Internal Protocol
Chango's masters and node managers talk over a custom binary protocol on TCP. The protocol is small, length-framed, KMS-envelope-encrypted in flight, and serves a single purpose — let the master ship ComponentInstance lifecycle commands to NMs and pull metrics / log tails back.
Why not gRPC / HTTP
- The admin UI surface is HTTP — chango already serves HTTP on the admin port. The internal protocol is between chango processes, not between a client and chango.
- Component install streams large tarballs (4 GB+). A binary length-prefixed protocol with chunked frames was simpler than HTTP/2 over TLS, especially under air-gapped operation where TLS certificate provisioning is its own problem.
- The same protocol carries KMS envelope encryption end-to-end, so the cluster works without TLS on the wire.
Wire format
+---------+------+----------------+----------+
| length | code | correlation id | payload |
| 4 bytes | 2 B | 8 B | length-12|
+---------+------+----------------+----------+
length— total frame length in bytes, big-endian.code—OpCode(see below).correlation id— opaque request / response pairing; the responder echoes the request's id so the caller can match.payload— opcode-specific. If KMS-envelope encryption is enabled (chango.kms.internal.protocol.encrypt = true, default), the payload is{ wrappedDek || iv || ciphertext || gcmTag }. The wrappedDek is unwrapped by the receiver using the KMS provider, the ciphertext is AES-256-GCM with the unwrapped DEK.
Opcodes
The opcode set is small. The ones that matter most:
| OpCode | Direction | Purpose |
|---|---|---|
PING (1) |
Either | Keepalive |
MASTER_LEADER_GET (2) |
NM → master | Ask any master for the leader's address |
KMS_SYNC_PUSH (10) |
Leader → follower / NM | Push the latest KMS state |
IAM_SYNC_PUSH (11) |
Leader → follower / NM | Push the latest IAM state |
METADATA_SYNC_PUSH (12) |
Leader → follower | Push the latest component-inventory state |
KMS_SYNC_PULL_REQ (20) |
Follower / NM → leader | Pull current KMS snapshot |
IAM_SYNC_PULL_REQ (21) |
Follower / NM → leader | Pull current IAM snapshot |
METADATA_SYNC_PULL_REQ (22) |
Follower → leader | Pull current metadata snapshot |
COMPONENT_INSTALL_BEGIN (213) |
Master → NM | Install header — instanceId, install path, run-as, start / stop commands, env, config files, binary artifacts |
COMPONENT_INSTALL_DATA (214) |
Master → NM | One chunk of the install tarball payload |
COMPONENT_INSTALL_END (215) |
Master → NM | Finish + verify hash |
COMPONENT_INSTALL_RES (201) |
NM → master | Install result (status, install path, pid file, runtime pid if known) |
COMPONENT_START (202) |
Master → NM | Start instance |
COMPONENT_STOP (203) |
Master → NM | Stop instance |
COMPONENT_RESTART (204) |
Master → NM | Restart instance |
COMPONENT_DELETE (205) |
Master → NM | Delete instance + remove install dir |
COMPONENT_STATUS_REQ (206) |
Master → NM | Pull instance status + pid |
COMPONENT_METRICS_REQ (208) |
Master → NM | Pull CPU / memory sample for one instance |
LOG_TAIL_REQ |
Master → NM | Stream the last N lines of an instance log |
COMPONENT_CONFIG_PUT (212) |
Master → NM | Update an instance's launch fields (startCommand, stopCommand, pidFile, env, config) without reinstalling |
COMPONENT_FILE_GET / _PUT / _DELETE |
Master ↔ NM | Read / write / delete a config file inside the instance's install dir (for the admin UI's Configure panels) |
NGINX_APPLY / NGINX_APPLY_MULTI / NGINX_STATUS |
Master → NM | Apply / inspect the component-nginx config on the NM's host |
The full enum lives in chango-protocol/src/main/java/.../OpCode.java.
Streaming installs
For a 4 GB component tarball the protocol does not buffer the whole payload in memory. The master issues:
COMPONENT_INSTALL_BEGIN— header (instanceId, install path, run-as, the start / stop scripts, env, and the list of files the NM should extract).- Many
COMPONENT_INSTALL_DATAchunks, eachchango.master.component.transfer.chunk.byteslong (default 4 MB), in order. COMPONENT_INSTALL_END— final SHA-256 check; the NM verifies the assembled tarball matches before extraction.
The NM writes each chunk straight to a temp file on its /opt volume so the master never needs to hold the bytes in memory either.
Correlation, retries, timeouts
- The master assigns a monotonically increasing 64-bit correlation id per request. The NM echoes the id on every response and every install chunk's ack.
- A single operation has a timeout (
chango.master.component.op.timeout.ms, default 120 000). If the NM does not ack within that window the master gives up and returns an error to the admin REST caller. - Long-running ops (install of a multi-GB tarball, slow cluster start) use a longer per-operation timeout configured per provisioner.
KMS envelope on the wire
When chango.kms.internal.protocol.encrypt = true (default), every InternalMessage payload is AES-256-GCM:
- Sender requests a wrapped DEK from chango's KMS provider for the wire key (
internal-protocol). - AES-GCM-encrypts the cleartext payload with the unwrapped DEK and a fresh 96-bit IV.
- Frames the wire payload as
wrappedDek || iv || ciphertext || gcmTag. - Receiver unwraps the DEK and decrypts.
A wrong key (the cluster root secret diverged between master and NM, for example) fails to decrypt with AEADBadTagException. The NM logs the failure and refuses the message — no plaintext fallback.
When to disable encryption
chango.kms.internal.protocol.encrypt = false exists for diagnosis only. It removes the envelope and sends payloads in cleartext, which lets you tcpdump the wire to see what's actually being sent during a slow install. Never run with this disabled in production.
Sync push + pull cadence
State propagation from leader to followers / NMs is dual:
- Push — the leader writes a change locally, then fires
KMS_SYNC_PUSH/IAM_SYNC_PUSH/METADATA_SYNC_PUSHto every known peer in parallel. - Pull — every non-leader runs a periodic
*_SYNC_PULL_REQagainst the leader on a fixed interval (chango.cluster.sync.pull.interval.ms, default 30 000) to catch any push that was missed (peer was offline, network blip).
This belt-and-suspenders pattern is what keeps the cached state on every NM consistent without requiring at-least-once delivery on the push path.