Monitoring & Metrics

Chango samples CPU% and resident memory of every host and every managed component process and keeps a short in-memory history that backs the admin UI's CPU / memory charts. The metrics layer is built in — no Prometheus / Grafana / external agent is required to see the cluster's resource usage.

What is sampled

Sample	Source	Cadence
Node CPU / memory	`/proc/loadavg`, `/proc/meminfo`, `/proc/stat` on each NM	Pulled by the leader from each NM every `chango.metrics.collect.interval.seconds` (default 10 s)
Component CPU / memory	`ps -p <pid> -o %cpu=,rss=` on the NM that hosts the instance	Same cadence; one `ps` per managed process
Component alive	`ProcessHandle.of(pid).isAlive()`	Same cadence

For components whose pid file lives in a 0700 directory (Trino's var/var/run/launcher.pid, ZooKeeper sometimes), the NM falls back to sudo cat to read it. The pid that ends up in the metric record is the one the chango master tracks for stop / restart.

How the data flows

The leader (and only the leader) runs the ComponentMetricsCollector thread on a fixed interval.
For each managed ComponentInstance, the leader resolves which NM hosts it and sends OpCode.COMPONENT_METRICS_REQ over the internal NIO protocol.
The NM runs ps against the instance's pid and returns { cpu, memRss, alive } as JSON.
The leader appends a Sample (ts, instanceId, cpu, memRss) to an in-memory circular buffer. Older samples past chango.metrics.retention.seconds (default 3600 s) are evicted.
The admin UI's per-cluster page polls /admin/api/<component>/<clusterId>/metrics every 5 s and renders the recharts line chart from the returned series.

Followers do not collect — they do not have authoritative inventory. After a leader change the new leader's history is empty for the first interval, then populates.

In-memory only

The metric history is held only in the leader's JVM heap. A master restart resets the chart back to zero. There is no on-disk persistence of metric samples and no external Prometheus shipping (out of the box). This is a deliberate tradeoff — metric stores are an entirely separate operational layer, and most teams already have one. What chango provides is the admin-UI dashboard.

To ship metrics to a real time-series database, run a node_exporter / process_exporter alongside chango and scrape per host. The chango master also exposes the same JSON the UI uses, so a small adapter can scrape /admin/api/.../metrics directly.

What the admin UI shows

Per cluster page (Trino cluster, Spark cluster, …) carries two charts:

CPU% — one line per instance in the cluster, last hour.
Memory (RSS, MB) — one line per instance in the cluster, last hour.

Per chango-cluster page also shows node-level CPU / memory across every NM in the fleet.

Process health

Beyond CPU / memory, every component instance carries a textual status (STARTING / RUNNING / STOPPING / STOPPED / ERROR). Chango derives that from:

The last lifecycle action it issued (start → STARTING, stop → STOPPING).
The NM's liveness signal (the ps and ProcessHandle.isAlive() checks above).
A successful pid read after start (no pid file → ERROR).

If a component process dies outside of chango's control (OOM, segfault, manual kill), the next metrics interval flips it to "alive = false" and the chango master moves the status to ERROR. The cluster page surfaces the failure within ~10 s.

Tuning

Knob	Default	Purpose
`chango.metrics.collect.interval.seconds`	10	How often the leader walks every component and asks each NM for samples
`chango.metrics.retention.seconds`	3600	How long the in-memory history is kept
`chango.component.op.timeout.ms`	120000	Timeout for individual `COMPONENT_METRICS_REQ` to a slow NM — keep high enough to not race

Increasing the interval lowers admin-side overhead at the cost of chart resolution; raising the retention buys longer history at the cost of leader JVM heap (each sample is ~64 bytes).