Catalog maintenance — data-only backfills & fixes¶
Note
Scope. This page covers category-A catalog work (see the “Catalog migrations & upgrades” section of the repo CLAUDE.md): correcting field values or backfilling missing rows against the live catalog prefix, with no rebuild and no pointer flip. These operations are safe to run while writers keep appending — they upsert via register_recording(on_conflict="overwrite"). If you need to rename/retype a column or rebuild rows from raw, that’s a category-C catalog rebuild instead.
The maintenance scripts in scripts/ all default to dry-run (no writes without --commit). Their prod-write guards differ: fix_durations_inplace.py, reconstruct_all_recent.py, and fix_modalities_start_epoch.py refuse --commit against a non-staging catalog URI unless you pass --i-know-this-targets-prod (the same guard rebuild_catalog.py uses); backfill_participants.py instead requires --confirm-prod plus an interactive typed-bucket-name prompt; audit_participants.py is read-only and needs no guard. Run a dry-run first; it prints exactly what would change and writes nothing.
fix_durations_inplace.py — correct recording durations¶
Re-derives every recording’s end time from raw segment timestamps (derive_recording_end_time → ModalitySignals.max_end, which prefers actual data-stream ends over the worker_report process-teardown time) and overwrites the stored duration. It only writes rows whose derived duration differs from the stored value by more than --min-delta-seconds (default 60s), so already-correct rows aren’t churned. Rows with no derivable end are flagged duration_known_wrong=True rather than left with a wrong value.
Use this when durations are wrong across many rows — e.g. after a timezone bug inflated ended_at, or to re-correct rows a prior rebuild “fixed” with the old worker_report-inclusive max_end. As a safety brake (mirroring rebuild_catalog.py), --commit refuses (exit 2) if more than --max-indeterminate N rows yield no derivable end; default 0, so eyeball the dry-run’s indeterminate= count and raise the cap when that many are expected.
# Dry-run against prod (no writes) — eyeball the per-row deltas:
op run -- uv run python scripts/fix_durations_inplace.py --profile r2
# Apply (back the catalog up first — see the rebuild runbook's server-side copy):
op run -- uv run python scripts/fix_durations_inplace.py --profile r2 --commit --i-know-this-targets-prod
reconstruct_all_recent.py — backfill recordings missing from the catalog¶
Some recordings reach R2 but never make it into the catalog: they carry a legacy-shape manifest.json (started_at/ended_at/workers, with no recording_hash and no files inventory — the shape written by data-engine deployments that predate manifest-at-start) or no manifest at all (a crashed uploader). ingest_from_r2 rejects both at the _require_augmented gate, so the observability dashboard shows a gap for those days.
For each recording started in the last --days days that isn’t already in the catalog, smallest-first, the script:
reconstructs a manifest via
examples/reconstruct_missing_manifests.py— this streams one GET per file through sha256 (it downloads + hashes every byte) and writes arecording_hash+filesmanifest back to the recording prefix;ingest_from_r2— hot-promotes the reconstructed manifest into a catalog row;corrects the duration — reconstruct stamps an upload-time
ended_at, so the duration is re-derived from segment timestamps (same logic asfix_durations_inplace.py).
It is idempotent (recordings already in the catalog are skipped), so an interrupted run resumes on re-invocation. Empty/aborted sessions with no uploadable data are skipped (they correctly get no row). Each reconstruct has a per-recording timeout scaled by byte size, so one stalled GET can’t block the queue.
Warning
Reconstruct transfers the recording’s full byte volume. A batch of EEG sessions can be hundreds of GB. Run this on a host with high R2 bandwidth (an in-region cloud node), not a laptop. Use --max-gb to defer the heaviest sessions when you only want the quick wins (e.g. fill the dashboard’s recent days) first.
The recordings themselves only exist in the prod raw bucket, so a meaningful run targets --profile r2:
# Dry-run: list the worklist + per-recording sizes + total GB (no writes, no downloads):
op run -- uv run python scripts/reconstruct_all_recent.py --profile r2 --days 10
# Backfill, deferring sessions over 5 GB for a later pass:
op run -- uv run python scripts/reconstruct_all_recent.py --profile r2 --days 10 \
--max-gb 5 --commit --i-know-this-targets-prod
The final VERIFY recordings=… recent<N>d=… newest=… line reports the catalog total, how many fall inside the window, and the newest recording’s date — confirm recent<N>d rose and newest advanced to the expected date.
fix_modalities_start_epoch.py — backfill start_epoch_ns on legacy modality rows¶
Modality rows ingested before the shared-origin alignment work carry zero-origin domain_intervals (each modality anchored to its own first sample) and start_epoch_ns IS NULL, so cross-modal offsets within a recording are lost. This script re-anchors those rows onto the shared recording origin (RecordingRow.start_time) without re-ingesting any bulk data: it derives each modality’s own first-sample wall-clock epoch from its raw timestamps sidecars (ursa.recovery.first_epoch — the same formulas Virgo’s shared-origin parsers apply at ingest), shifts every domain_intervals pair by (first_sample_epoch − recording.start_time), sets start_epoch_ns, and re-registers all updated rows in a single batched on_conflict="overwrite" catalog commit (forwarding every other field verbatim).
It selects only processed rows with start_epoch_ns IS NULL, so it is idempotent — committed rows drop out of the selection and re-runs never double-shift. Each derived epoch must land inside the recording’s own [start, start+duration] window (± 1 h slack) or the row is skipped as out-of-bounds, so a glitched first row can’t commit a wild shift. As a safety brake (mirroring fix_durations_inplace.py), --commit refuses (exit 2) if more than --max-underivable N in-scope rows yield no derivable epoch (missing/GC’d raw sidecars, parse failures, out-of-window epochs); default 0, so eyeball the dry-run’s underivable=/out_of_window=/no_recording= counts (the gate sums all three) and raise the cap once you’ve confirmed those rows are the expected permanent-NULL set — GC’d raw sidecars, out-of-bounds epochs, and modality rows whose recording is absent from the catalog. Use --modality SLUG (repeatable) to restrict the scope to specific slugs.
Two modality families are out of scope by design: video rows (video-webcam / camera / screen) are backfilled by re-ingesting through Virgo’s sweep — their MP4 frame-index sidecar must be rewritten together with the domain origin — and pupillabs-scene has no joinable per-frame timestamps, so those rows stay NULL permanently.
# Dry-run against prod (no writes) — eyeball the per-row shifts + underivable report:
op run -- uv run python scripts/fix_modalities_start_epoch.py --profile r2
# Apply (back the catalog up first — see the rebuild runbook's server-side copy):
op run -- uv run python scripts/fix_modalities_start_epoch.py --profile r2 \
--commit --max-underivable <N> --i-know-this-targets-prod
The script prints the start_epoch_ns IS NULL AND ingestion_status = 'processed' count before and after. Definition of done: that count falls to the size of the expected permanent-NULL set — video rows still pending the Virgo re-ingest sweep, pupillabs-scene, and any rows whose raw sidecars were GC’d (all enumerated in the run’s report).
backfill_participants.py — recover dropped participant names¶
Recovers operator-typed participant names that never reached the catalog (recordings reconstructed before the reconstruction path read participant.txt resolve to empty/"unknown" participant_ids). It collects names from participant.txt — NAS primary (--nas-root, the complete set), R2 fallback — canonicalizes them (title-cases for consistency; an explicit merge map folds known partials), skips test-mode placeholders ("Testing mode"/"test") and unknown-sentinel values, then upserts a ParticipantRow and re-registers each recording with on_conflict="overwrite" to attach the slug and strip the sentinel. A rollback snapshot is written before the recording overwrites; --restore <snapshot.json> reverts.
Distinct safety flags from the --i-know-this-targets-prod scripts above: a prod write needs --confirm-prod plus an interactive typed-bucket-name prompt. A committed run also requires --nas-root (or explicit --allow-r2-only once the NAS is gone), and collisions/casefold near-dups block --commit unless --allow-collisions.
# Dry-run against prod, NAS primary (no writes):
op run -- uv run python scripts/backfill_participants.py --profile r2 \
--nas-root /Volumes/Shared_Drive/constellation-data/recordings
# Commit (prompts for the bucket name):
op run -- uv run python scripts/backfill_participants.py --profile r2 \
--nas-root /Volumes/Shared_Drive/constellation-data/recordings --commit --confirm-prod
audit_participants.py — standing guardrail for dropped names¶
Read-only. Flags recordings that are unattributed (empty/sentinel participant_ids) and have a recoverable name in R2 participant.txt (applying the same placeholder/sentinel skip as the backfill, so intentionally-unattributed test recordings aren’t flagged). Exits non-zero when any such drop exists, so a nightly conformance job catches a future regression within a day. Safe to run against prod.
op run -- uv run python scripts/audit_participants.py --profile r2
The durable fix lives upstream¶
reconstruct_all_recent.py is a remediation for recordings that reached R2 without a usable manifest — it is not the normal ingestion path. When you find yourself reconstructing a whole recent batch, the rigs that produced it are running a data-engine that predates manifest-at-start (no content hash written at session start). The lasting fix is upgrading those rigs so the standard ingest_from_r2 path works without a download-and-hash pass; file/track that in the data-engine repo rather than relying on repeated reconstructs.