feat/distributed-monitoring #1

Merged
nicklas merged 7 commits from feat/distributed-monitoring into master 2026-05-01 16:29:28 +00:00
Owner
No description provided.
Introduces a new `agents:` configuration block declaring remote probe
agents (ID, name, token, rate-per-second, stale-after). Adds
`monitored-by` on endpoints to map probes to specific agents (defaults
to [local] for zero-change upgrades) and `agent-quorum: N` on alerts to
gate notifications behind K-of-M agent agreement.

Per-agent failure/success counters replace the prior scalar fields on
Endpoint; the previous accessors are kept as compatibility shims
returning the max across agents so metrics/* callers don't change.
Alert gains a TriggeredByAgent map plus a QuorumFired flag; the public
Triggered field is preserved for provider wire-format compatibility and
mirrors QuorumFired.
Adds `agent_id` to endpoint_results and broadens the unique constraint
on endpoint_alerts_triggered to (endpoint_id, configuration_checksum,
agent_id). Postgres uses ALTER TABLE / ADD CONSTRAINT IF NOT EXISTS;
SQLite rebuilds the table inside a single transaction to swap the
unique constraint, with row-count assertions before/after.

Existing single-agent rows default to agent_id='local' so the migration
is non-destructive on upgrade. Storage interface gains
GetUptimeByKeyAndAgent and per-agent triggered methods; the original
helpers are reimplemented as wrappers that pass agent_id='local'.
endpoint_uptimes stays global in v1 — per-agent uptime is deferred.
HandleAlerting becomes per-agent inside, fed by result.AgentID. Each
agent maintains its own failure/success run; an agent crossing
FailureThreshold is recorded as triggered. The user-visible alert
fires only when at least AgentQuorum agents are simultaneously
triggered, and resolves when that count drops below quorum. Reminder
cadence keys off a sentinel "__quorum__" so reminders pace per-quorum
rather than per-agent.

In-memory NumberOfTriggeredAgents() is the authoritative count for
quorum decisions; storage rows reconcile that view across restarts and
multi-process setups.
POST /api/v1/agents/:agentId/endpoints/:key/results — bearer-token
authenticated push from agents, gated by per-agent rate limiting
(token bucket) and the endpoint's monitored-by allowlist. Results land
in storage with agent_id stamped in.

GET /api/v1/agents/:agentId/assignments — pulls the slim probe spec
for an agent (no alerts, no maintenance windows). Returns a SHA-256
checksum so polling agents can short-circuit with If-None-Match -> 304.

GET /api/v1/agents — dashboard listing with last-seen and stale flags.
Uses c.Status(200).JSON() to override the 404 the static-file
middleware pre-sets before the JSON handlers run.

External-endpoint, raw, and several status tests get small follow-ups
to plumb agent_id through cleanly.
Same binary, mode-selected via GATUS_MODE=agent. The agent fetches its
assignments from the central server (with checksum-based 304 caching),
spawns one probe goroutine per assigned endpoint, and POSTs results
back over HTTPS. No local config of endpoints — single source of truth
is the server's `agents:` block + `monitored-by` allowlist.

Brief central outages are tolerated by a bounded FIFO queue (drop-
oldest on overflow) that drains on the next successful POST. Probe
goroutines diff cleanly against new assignment lists, so adding or
removing endpoints on the server takes effect within one refresh
interval without restart.

main.go gains the GATUS_MODE branch and the startup recovery loop that
replays persisted triggered rows into per-agent counters and re-arms
QuorumFired so post-restart we don't double-fire alerts.
EndpointCard renders one badge per agent observed in the latest
results, greyed out when the agent is currently flagged stale. New
/agents route + AgentsList view show each agent's last-seen timestamp
and a Stale/Healthy pill.

Header on the home view gets a small Agents button (next to refresh)
with a red dot when any agent is stale, so dashboard users notice
silent agents without leaving the page.

web/static/{app.css,app.js,chunk-vendors.js} are the regenerated build
artifacts (web/static is //go:embed-ed into the Go binary at compile
time so they must ship together with the Vue source changes).
docs: distributed monitoring docs and production examples
Some checks are pending
labeler / labeler (pull_request_target) Waiting to run
test-ui / test-ui (pull_request) Waiting to run
test / test (pull_request) Waiting to run
9388f315d8
Adds a worked walkthrough for distributed monitoring in README and a
production-leaning docker compose example under
.examples/docker-compose-distributed-monitoring/ that bundles its own
Traefik (TLS via Let's Encrypt HTTP-01), Postgres, and the central
gatus server, plus a single-container agent stack. A simpler local
docker-only example lives in .examples/docker-distributed-monitoring/
for kicking the tires without a public hostname.

The Dockerfile drops the unconditional `-a` flag so go build's
per-package cache survives between rebuilds and copies go.mod/go.sum
before the rest of the source so dependency downloads only re-run when
modules change. Cuts repeat builds substantially on small VMs (1GB
RAM / 1 core was the original pain point).

CLAUDE.md collects project-specific guidance for AI coding assistants
working in this tree.
nicklas merged commit b0a7dce665 into master 2026-05-01 16:29:28 +00:00
nicklas deleted branch feat/distributed-monitoring 2026-05-01 16:30:07 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
nicklas/gatus!1
No description provided.