Encrypted off-site backup — the durable layer¶
The board (board/data/), the guard's quarantine store (guard/data/), config/, and the
active vault are live data, not test fixtures. The board's local rolling snapshots
(board/data/backups/) are crash-safety only — same-disk, count-pruned, and trivially deletable
with the thing they protect. They survive a bad write; they do not survive a disk loss, a
rm -rf, or a corruption that propagates before anyone notices.
This subsystem is the missing layer: daily AES-256-GCM-encrypted snapshots pushed to a private GitHub repo, where git history is an immutable, off-site, versioned record you cannot silently overwrite. The whole design follows from one incident — live data was bulk-edited and the only backups were the local, count-pruned kind, so the pre-edit state was already gone.
Setup and recovery run through the /backup-recovery skill
(SKILL.md);
the implementation lives in backup/.
This page is the architecture and the contracts.
The pipeline¶
A backup is a single immutable artifact per run. backup.mjs
archives the in-scope stores, encrypts the tarball, writes one file plus a manifest entry into
the backup repo, and commits + pushes.
flowchart LR
subgraph live["live data (SCOPE)"]
B[board/data]
G[guard/data]
C[config/]
V["vault/<VAULT_NAME>"]
end
live --> TAR[gzip-tar]
TAR --> ENC["AES-256-GCM encrypt<br/>scrypt(passphrase, salt) → key<br/>random salt+IV per run"]
ENC --> FILE["snapshots/cos-backup-<ts>.enc"]
ENC --> MAN["MANIFEST.json<br/>sha256 + metadata"]
FILE --> PUSH["git commit + push<br/>(private repo)"]
MAN --> PUSH
The scope is declared in config.mjs
(SCOPE). The vault entry is the subtle one: it is resolved as vault/<VAULT_NAME> from
config/cos.env, the same active vault that setup-vault records, so a renamed or relocated
vault is never silently dropped. If the configured vault directory is missing at backup time,
backup.mjs emits a loud WARN (surfaced in the /backups log tails) rather than quietly
shipping a snapshot without it.
Encryption (lib/crypto.mjs)
is authenticated AES-256-GCM. The key is derived from the recovery passphrase via scrypt with a
random salt and IV per backup — so two backups of identical bytes produce different ciphertext,
and the GCM auth tag makes any post-hoc edit detectable. One file per run, snapshots/cos-backup-<ts>.enc,
never overwritten: the off-site versioning is git history itself, not a mutable "latest".
The recovery key — the part to get exactly right¶
One passphrase, no recovery
A single high-entropy passphrase is the only way to decrypt. Lose it and the backups are unrecoverable by design — that is the entire point of encrypting before pushing off-site. Treat it like a root credential.
The key lives in the macOS login Keychain (security item cos-backup-key) and is read at
backup time by the LaunchAgent. It is never written to this repo, to the backup repo, or to any
log. The protocol is two-copy: Keychain (for automated daily runs) plus one offline copy in a
password manager (for the day the Keychain is gone — e.g. a new machine). For one-off or CI restores
before the Keychain item exists, COS_BACKUP_KEY (env) overrides the Keychain lookup in
resolveKey().
Rotation re-keys new snapshots only; it does not re-encrypt old ones — they still need the old key, so the retired key stays archived offline, labelled with the cutover date.
Operational design — three triggers, one floor¶
The hard guarantee is the launchd 03:30 daily floor
(com.chiefofstaff.backup.plist.template).
While the board is running it adds two more triggers of the same backup.mjs — a manual
button and an opportunistic top-up — so a machine that's awake and in use is backed up well inside
the daily window, while a machine that's only ever asleep at 03:30 still gets caught at next wake.
| Trigger | Source | Gate |
|---|---|---|
| Daily floor | launchd at 03:30 | none — always runs |
| Back up now | POST /api/backups/run (button on /backups) |
none; ?force=1 also bypasses the freshness gate |
| Opportunistic top-up | fired non-blocking from hot read routes (GET /api/cases, GET /api/backups) |
only when newest snapshot is older than the 12h freshness window and a positive live-board identity check passes |
The board side (board/lib/backup-status.ts)
gates the top-up; it never blocks the request it piggybacks on.
Single-flight lock + exit codes¶
All three callers serialize on one exclusive .backup.lock inside the backup repo (atomic wx
create, gitignored, reclaimed if a crashed run left it >120s stale). The lock lives in backup.mjs,
not in board code, precisely because launchd runs the file directly and never passes through
the board. The exit code is the contract every caller reads:
| Exit | Meaning | Failure? |
|---|---|---|
0 |
snapshot written and pushed | no — healthy |
2 |
committed locally only (push failed, e.g. no network) — still a real backup | no |
3 |
benign lock-skip — another run held the lock, this one did nothing | no |
1 |
fail-closed repo-guard refusal (see below) | yes |
| other non-zero | hard failure | yes |
Exit 2 and 3 are not failures
2 means the snapshot is committed and just needs a later push; 3 means another trigger was
already running. Health checks (and the /backups verdict) deliberately treat only a non-0/2/3
exit as a real failure.
Fail-closed repo guard¶
The effective backup repo path resolves by precedence — COS_BACKUP_REPO env override >
config/cos.env BACKUP_REPO > the ~/.cos-backups default. The EXPECTED repo is derived
from cos.env (the config value, not the env var). Before doing anything,
assertDefaultRepoOrRefuse() exits 1 unless the effective repo === EXPECTED — so a
COS_BACKUP_REPO=/tmp/... override is refused rather than pushed to. This is the same fail-closed
instinct as the Guard: when the target is ambiguous, stop. The escape hatch
COS_BACKUP_ALLOW_NONDEFAULT=1 is reserved for deliberate disposable-repo tests. Consequently
COS_BACKUP_REPO is not the relocation knob — to move the repo you edit cos.env (keeping
effective === EXPECTED) and reinstall the agent.
Restore — verify before you touch anything¶
restore.mjs inverts the pipeline,
and its ordering is the whole safety story: nothing is written until the snapshot has fully
verified, and the live state is itself snapshotted before any overwrite, so a restore is reversible.
sequenceDiagram
participant R as restore.mjs
participant Repo as backup repo
participant FS as live stores
R->>Repo: git pull, pick snapshot (--date / latest)
R->>R: decrypt → GCM auth tag verify
R->>R: sha256 vs MANIFEST.json
R->>R: every *.json parses
alt verification fails
R-->>R: refuse (bad magic / tag throw / mismatch)
else dry-run (default)
R-->>R: report OK, write nothing
else --apply
R->>FS: copy current state → ~/cos-recovery/pre-restore-<ts>/
R->>FS: overwrite live stores
end
Verification is a three-gate chain — GCM auth tag, then sha256 against the manifest, then a JSON
parse of every store — and the default is dry-run; --apply is required to write. On --apply,
the current live stores are copied to ~/cos-recovery/pre-restore-<ts>/ before the overwrite, so
the restore can itself be undone. A snapshot that fails to verify (bad magic, an auth-tag throw, a
sha256 mismatch) is never applied — the remedy is an earlier --date, not a force.
Board health surface¶
The board exposes a read-only health view at /backups (sidebar → Review → Backups — a
top-level item next to Security/Trash/Activity, not nested under Security), served by
board/lib/backup-status.ts over GET /api/backups. That route is always 200 with a fail-safe
envelope: a never-broken health page is more useful than one that errors when the thing it monitors
is broken. It renders a healthy / warning / error verdict, last-run facts (time, size, store
count), push state, snapshot history, log tails, repo-path provenance, and a readiness checklist.
Two properties matter for the threat model. The readiness probe is read-only and offline — it
checks the recovery key's existence with security find-generic-password without -w (it never
reads the secret) and never touches the network. And POST /api/backups/run — the one mutating route,
which spawns the same backup.mjs — 403s on a non-live-board (sandbox) context, so the agent twin
can't fire backups from a test fixture.
Threat model / guarantees¶
| Property | Mechanism |
|---|---|
| Off-site | pushed to a private GitHub repo — survives local disk loss or rm -rf board/data/backups |
| Immutable | git history keeps every daily snapshot; one file per run, never overwritten |
| Confidential | AES-256-GCM before push — a leak of the private repo exposes nothing without the key |
| Tamper-evident | GCM auth tag + manifest sha256 — a modified snapshot fails verification, never restores |
| Reversible | --apply snapshots current state to ~/cos-recovery/ before any overwrite |
| Serialized | one exclusive .backup.lock (120s stale-reclaim) keeps the three callers from interleaving a push |
See also¶
- Architecture overview — where backup sits among the subsystems.
- Prompt-injection guard — the other fail-closed control, and the sibling of this subsystem's repo guard.
- Semantic search — the deliberate counterpoint: a subsystem that fails open.