Monitoring and observability
The server exposes Prometheus metrics and a health endpoint on the metrics port (default 9090), enabled by default.
GET /metrics: Prometheus text format.GET /health: JSON,{"status":"ok","node_id":"...","shards":N}. Use it for load-balancer and orchestrator probes.
warning
The metrics port is plaintext HTTP, unauthenticated, and binds all interfaces regardless of --listen-address. /metrics exposes the node id, shard count, and full operational detail. Keep it on a private network or firewall it.
The metrics that matter
There are many; these are the ones to put on a dashboard and alert on.
Is the cluster healthy
celeriant_node_role: 1 on the leader, 0 on the follower. Exactly one leader should report 1. Two is split-brain; zero means nobody holds the lease.celeriant_leader_elections_total: should be flat. A climbing count means the cluster is flapping.celeriant_heartbeat_failures_totalandceleriant_follower_auto_fence_total: rising values mean the follower is struggling to keep up or the link is bad.
Is replication keeping up
celeriant_replication_follower_pressured: 1 means the follower is falling behind and S3 fallback is imminent.celeriant_replication_s3_fallbacks_total: every increment is a window where writes took the slower S3 path. Occasional is fine; sustained means the follower cannot keep up.celeriant_replication_queue_bytes: the backlog awaiting the follower.
Latency and throughput
celeriant_write_duration_seconds,celeriant_read_duration_seconds: end-to-end histograms. Watch p99.celeriant_fsync_duration_seconds,celeriant_replication_duration_seconds: where write latency comes from.celeriant_writes_total,celeriant_write_errors_total,celeriant_reads_total.
Resource health
celeriant_client_connections_active,celeriant_watch_subscribers_active.celeriant_cache_*_hits_total/_misses_total: a collapsing hit rate signals the working set has outgrown the memory budget.celeriant_shard_panics_total,celeriant_shard_restarts_total: should be zero.
A starting alert set
celeriant_node_rolesummed across the cluster is not exactly 1.rate(celeriant_leader_elections_total)above zero for more than a minute.rate(celeriant_replication_s3_fallbacks_total)sustained.- write p99 from
celeriant_write_duration_secondsover your SLO. - any
celeriant_shard_panics_total.
The deploy/local-cluster stack wires Prometheus, Loki, and Grafana against both nodes, which is the quickest way to see these in motion.