Catalog rebuild — rebuild the production catalog from R2 raw data

Danger

ADMIN-ONLY OPERATION. This procedure replaces the production catalog Lance tables. It is destructive (the old catalog is archived, not deleted, but consumers will stop seeing the old rows the moment the switch lands) and coordinated (every machine using ursa.DataInterface(profile="r2") must restart or re-sync after the switch). Do not run this without (a) a documented reason, (b) a maintenance window announced in #engineering-support, and (c) a tested R2 backup of the current catalog prefix.

If you are reading this because something is wrong with the catalog and you’re not sure whether a rebuild is the answer: post in #engineering-support first. A rebuild is the answer for schema-shape problems and mass row-level corruption; it is not the answer for individual-row bugs (those are one-off scripts under scripts/).

Note

Scope. This runbook describes a catalog rebuild and its cutover via the active-catalog pointer (Phase 3). It is not the data-engine ingestion-schema rewrite itself (that work lives in the data-engine repo), not an automatic / cron-driven process (every step is gated on a documented maintenance window + operator sign-off), and not a multi-bucket migration (all steps stay within one R2 bucket per profile). For data-only fixes and additive schema changes — which do not need a rebuild or a pointer flip — see the “Catalog migrations & upgrades” section in the repo CLAUDE.md.

When you need this

A full catalog rebuild is justified when the schema or fill-policy of the existing catalog has drifted so far from the truth that row-level fixes don’t converge. Examples that have triggered it (or are expected to):

  • The ingestion format changes. When the data-engine ingestion-schema rewrite lands, recordings start carrying authoritative per-modality domain_intervals / channel_spec / started_at / ended_at at session start rather than at upload time. The 198 v0.1.0-backfilled rows were built before this signal existed; their metadata / device_info / duration are all stitched together from less reliable sources. A rebuild lets the catalog reflect the new schema’s full fidelity for those rows.

  • Lance schema evolution at the table level. When a MetadataDict column is promoted to a typed Lance MapType (ENG-1066), or a frequently-queried metadata key is promoted to its own typed column, a row-by-row migration may be more expensive than a full rebuild — especially when the inputs are still in R2 and re-derivation is cheap.

  • Mass loss of provenance. If a chain of one-off fix scripts has left the catalog in a state where you can no longer cleanly answer “where did this row come from?” — rebuild from the manifests, which are the ground truth.

A rebuild is not the answer when:

  • One column is wrong on N recordings (write a one-off scripts/fix_*.py instead).

  • A few recordings are missing (re-run ingest_from_r2 on those individually).

  • You want to add a new derived modality (that’s a Virgo job, not an Ursa catalog rebuild).

How it works

The catalog’s R2 location is resolved by a pointer object

Catalog.open(profile="r2") resolves the catalog’s Lance directory as <bucket>/<active_prefix>catalog/, where:

  1. The bucket name comes from src/ursa/config/profiles.yamlconstellation-assets for r2, constellation-assets-test for r2-test.

  2. The <active_prefix> comes from the active-catalog pointer: a small JSON object at <bucket>/ursa/ACTIVE.json whose active_prefix field names the live catalog prefix. This is read once per Catalog.open (resolve-at-open). If the pointer object is absent, the catalog.fallback_prefix in src/ursa/config/ursa.default.yaml is used instead.

With the pointer set to ursa/v0.2.0/ that resolves to:

r2://constellation-assets/ursa/v0.2.0/catalog/

The pointer is the rebuild’s primary lever. Build a staging catalog at a sibling prefix (ursa-staging-<UTC-TIMESTAMP>/), then flip the pointer with scripts/set_active_catalog.py — a single atomic PutObject. Both readers and writers (data-engine ingest) resolve through the same pointer, so they cut over together; there is no fleet-wide redeploy window during which a stale writer registers rows into the old catalog.

For build/verify (Phases 1–2) you target a specific prefix deterministically by setting catalog.resolution: pinned in a staging config_path (see Phase 1) — that bypasses the pointer entirely. The pointer governs the catalog directory only; asset bytes (Zarr / video / derived) are addressed by absolute storage_uri in each row, so a catalog rebuild needn’t move asset data.

Naming the staging prefix

Use ursa-staging-<UTC-TIMESTAMP>/ (e.g. ursa-staging-20260520T1900Z/), not ursa-staging-<DATE>/. The longer prefix costs nothing and is collision-proof across same-day retries — two attempts on the same date with <DATE>/ would either silently merge or require a manual prefix delete between runs.

Why a pointer flip, not an R2 rename or a release bump

R2 has no native directory rename — a “rename” is a full copy-then-delete of every key under the prefix. For a Lance catalog with many small fragment files this is N×2 operations, and the in-flight window leaves consumers half-renamed. The historical alternative — editing ursa.default.yaml and cutting a release that every consumer re-pins — is not atomic across the fleet: a writer still on the old pin keeps registering rows into the old catalog until it redeploys. The pointer flip is atomic from every consumer’s POV: one PutObject swaps the prefix, and each consumer resolves either the old or new prefix on its next Catalog.open, never a mix. The old catalog stays in place until you archive it explicitly, so rollback is a one-line --rollback.

Procedure

The rebuild has four phases plus a Rollback section. Phase 1 is reversible at any time (you can throw away the staging prefix and try again). Phases 2–4 are progressively less reversible. Read the entire procedure before starting.

Prerequisites

Verify all six before kicking off:

  1. Ursa version pin. Every consumer (research notebook, Virgo, Orion, ursa-mcp) is pinned to an Ursa version that includes the schema you’re rebuilding toward. Audit by searching pyproject.toml files in each consumer repo for ursa ==; confirm every consumer is on a version that contains the schema (see CHANGELOG for the cut release).

  2. Multi-node ingest support. ingest_from_r2 already consumes both manifest layouts: it reads the legacy single-key recordings/<rec_id>/manifest.json first, and falls back to merging per-node manifests under recordings/<rec_id>/_node_manifests/<hostname>.json in memory (src/ursa/register/orchestrator.py:963-1024). No separate prerequisite gates the per-node path. A recording with neither layout raises FileNotFoundError and lands in /tmp/rebuild_skipped.txt for the reconstruct-then-retry path (Phase 2.4).

  3. R2 credentials. You have assets_rw access in 1Password (catalog write) and raw_ro access (manifest read) for --profile r2.

  4. Backup. A recent dump of r2://constellation-assets/ursa/catalog/ exists somewhere off-bucket (a local LanceDB snapshot, or an R2 cross-bucket copy to constellation-backups/). The Rollback path assumes the old catalog is still at the old prefix; the off-bucket copy is insurance against that not being true.

  5. Maintenance window. Announced in #engineering-support at least 24 h ahead. Specify: (a) start time, (b) expected duration (see Timing below), (c) the prefix you’re flipping to. Researchers will pause work that hits DataInterface.query() during the switch.

  6. Dry-run on r2-test. Run the entire procedure end-to-end against the test bucket first. The current r2-test catalog is small (count varies; the rebuild + switch takes minutes), and any procedural gap surfaces there rather than in prod.

Timing

Wall-clock estimate for the prod rebuild is TBD — record the timing of the first end-to-end dry-run on r2-test here, then the first prod commit-run, so future operators can plan maintenance windows against real numbers.

Run

Date

Wall-clock

Notes

First r2-test dry-run

(record)

(record)

First r2-test --commit

(record)

(record)

First r2 --commit

(record)

(record)

Phase 1 — Build the staging catalog

Goal: populate r2://constellation-assets/ursa-staging-<UTC-TIMESTAMP>/catalog/ from R2 raw data. Old prod catalog at ursa/catalog/ is untouched throughout.

1.1. Create a staging config file. Save as /tmp/ursa-staging.yaml (replace the timestamp with your chosen UTC value):

# Staging-catalog config — pins the catalog at a sibling prefix so the
# rebuild populates Lance tables in parallel to the live ones, and so
# build/verify deterministically targets the staging prefix regardless of
# what the live pointer says. `resolution: pinned` is the key line: it
# bypasses the active-catalog pointer entirely.
# Raw bucket is unchanged (we're reading raw recordings, not rewriting them).
catalog:
  resolution: pinned
  fallback_prefix: ursa-staging-20260520T1900Z/
stores:
  # assets_rw.prefix and assets_ro.prefix MUST be identical — the staging
  # catalog lives at this single prefix; if they drift, RO reads and RW
  # writes target different locations.
  assets_rw:
    backend: r2
    creds: assets_rw
    prefix: ursa-staging-20260520T1900Z/
  assets_ro:
    backend: r2
    creds: assets_ro
    prefix: ursa-staging-20260520T1900Z/
  raw_rw:
    backend: r2
    creds: raw_rw
    prefix: ""
  raw_ro:
    backend: r2
    creds: raw_ro
    prefix: ""

1.2. Sanity check that the staging config resolves correctly:

uv run python -c "
import ursa
from pathlib import Path
print(ursa.DataInterface(
    profile='r2',
    config_path=Path('/tmp/ursa-staging.yaml'),
).catalog.uri)
"
# expect '.../constellation-assets/ursa-staging-20260520T1900Z/catalog'

If the printed URI still shows /ursa/catalog, the config_path argument didn’t reach Catalog.open — re-check the path you passed.

1.3. Run the rebuild script in dry-run mode against the staging config. The script encodes Phase 1’s discovery + two-pass ingestion as code; nothing is written until you pass --commit. The script enforces a startup safety guard--commit against a URI lacking ursa-staging aborts unless you pass --i-know-this-targets-prod.

uv run python scripts/rebuild_catalog.py --profile r2 --config /tmp/ursa-staging.yaml
# Reports n_total, n_ingested (dry-run shows the would-ingest count),
# n_already_ingested, n_skipped, n_corrected, n_indeterminate.

1.4. Once the dry-run report looks right, commit:

uv run python scripts/rebuild_catalog.py --profile r2 --config /tmp/ursa-staging.yaml --commit

The script is idempotent — re-running after a crashed first pass counts already-ingested rows in n_already_ingested rather than aborting on CatalogRowExists. If pass-2 produces indeterminate rows (rec_ids whose segment metadata can’t derive an end time), the script exits non-zero before pass-3 (compact) unless you pass --max-indeterminate <observed-count>. Indeterminate rows ship with metadata["duration_known_wrong"] = True so they’re queryable post-rebuild via data.list_recordings(metadata={"duration_known_wrong": True}).

1.5. Cleanup. When the rebuild is done, remove the staging tmp files: /tmp/ursa-staging.yaml, /tmp/rebuild_skipped.txt.

If the rebuild is interrupted, /tmp/rebuild_skipped.txt is the only artifact worth preserving — it re-feeds the reconstruct-manifests retry for rec_ids with missing manifests. Everything else regenerates from a fresh script run; the staging catalog itself is the canonical resume point.

Startup safety guard

scripts/rebuild_catalog.py reads data.catalog.uri at startup and refuses to --commit unless the URI contains the staging marker (default ursa-staging; configurable via --staging-marker). To override (e.g. you adopt a different naming convention), pass --i-know-this-targets-prod. The mode-announcing header always prints the resolved URI so the guard’s decision is visible. Dry-run is always allowed regardless of URI.

Phase 2 — Verify the staging catalog

Goal: confirm the staging catalog matches what you expect before switching consumers over.

Run these with config_path=Path("/tmp/ursa-staging.yaml") and compare to baseline reads from the live config (omit config_path).

2.1. Row counts:

uv run python -c "
import ursa
from pathlib import Path
for label, cfg in (('staging', Path('/tmp/ursa-staging.yaml')), ('live', None)):
    print(f'=== {label} ===')
    data = ursa.DataInterface(profile='r2', config_path=cfg)
    for tbl in ('participants', 'recordings', 'modalities', 'events', 'virgo_assets'):
        print(f'  {tbl}: {data.catalog.count(tbl)}')
"

Staging-recordings should be ≥ live-recordings (you’re not losing rows). Other tables depend on what the rebuild populated — if you didn’t re-derive virgo_assets, expect staging = 0 there.

2.2. Demo recording end-to-end. The ENG-892 doctest pin is the canonical fixture:

uv run python -c "
import ursa
from pathlib import Path
data = ursa.DataInterface(profile='r2', config_path=Path('/tmp/ursa-staging.yaml'))
rec = data.get_recording('1ad200d69a2f8d36424e5ad00e1b4196f45c30fca791b623f6a060bddca446a3')
assert rec is not None, 'ENG-892 demo recording missing from staging'
print('recording_hash:', rec.recording_hash[:12])
print('duration:', rec.duration)
print('duration_source:', (rec.metadata or {}).get('duration_source'))
print('participants:', rec.participant_ids)
mods = [m for m in data.list_modalities() if m.recording_hash == rec.recording_hash]
print('modalities:', sorted(m.modality for m in mods))
"

duration should be plausibly close to the ~30 min the actual session lasted. If the staging catalog still shows a 34-day duration, pass-2 of the rebuild didn’t run successfully — re-run scripts/rebuild_catalog.py --commit and check pass-2 log lines for that rec_id.

2.3. Run the full Ursa test suite against staging. The doctests pinned to the ENG-892 demo recording are the regression gate:

The doctests pinned to the ENG-892 demo recording always construct DataInterface(profile='r2') against the packaged-default prefix. To exercise the same doctests against staging, drop a ./ursa.yaml symlinking to /tmp/ursa-staging.yaml in the repo cwd (the cwd lookup is still in the search order) and run:

ln -s /tmp/ursa-staging.yaml ./ursa.yaml
uv run pytest tests/ --doctest-modules src/ursa -v
rm ./ursa.yaml  # restore the live default

2.4. Skipped-recording audit. Re-read /tmp/rebuild_skipped.txt (written by pass-1) and decide whether each skip is acceptable:

  • FileNotFoundError for a recording → neither a legacy recordings/<id>/manifest.json nor any _node_manifests/*.json was found (likely a crashed worker that never wrote a manifest). Run examples/reconstruct_missing_manifests.py --profile r2 --rec-id <id> --yes to add a reconstructed manifest, then re-run the rebuild script for that rec_id (the pass-1 wrap handles already-ingested siblings without aborting).

  • Other exceptions → investigate per-row; do not switch.

Phase 3 — Flip the pointer to the staging catalog

Goal: every ursa.DataInterface(profile="r2") caller in the org reads from the staging catalog on its next Catalog.open.

This is the cutover step. The Rollback path exists (see Rollback) but rollback during Phase 3 is more disruptive than rollback before. Confirm your maintenance window is active and #engineering-support is aware.

3.1. Flip the active-catalog pointer with scripts/set_active_catalog.py. It dry-runs by default, validates that the target prefix actually holds a catalog (<prefix>catalog/recordings.lance/ is non-empty) before writing, and requires --confirm-prod-flip for the production bucket:

# Dry-run first — prints current → target and writes nothing.
uv run python scripts/set_active_catalog.py \
    --profile r2 --prefix ursa-staging-<UTC-TIMESTAMP>/

# Commit the flip (one atomic PutObject to <bucket>/ursa/ACTIVE.json).
uv run python scripts/set_active_catalog.py \
    --profile r2 --prefix ursa-staging-<UTC-TIMESTAMP>/ \
    --commit --confirm-prod-flip --note "v0.3.0 cutover"

Note

--confirm-prod-flip here is a mandatory confirmation of an always-prod action — distinct from rebuild_catalog.py’s --i-know-this-targets-prod, which overrides a staging-marker heuristic. The names differ on purpose so muscle memory from one script can’t misfire the other.

Because both readers and writers resolve the prefix through the pointer, the flip reaches every consumer that opens a fresh DataInterface after it lands — no per-consumer release bump or config drop is required. (The legacy release-pin / per-machine-YAML paths still apply only to consumers configured with catalog.resolution: pinned, which opt out of the pointer.) Resolution is at Catalog.open, so long-lived sessions pick up the flip on their next reconstruct.

3.2. Smoke test against the now-live prefix. From any consumer:

uv run python -c "
import ursa
data = ursa.DataInterface(profile='r2')
print('catalog at:', data.catalog.uri)  # expect '.../ursa-staging-<TS>/catalog'
print('recordings:', data.catalog.count('recordings'))
"

The first line confirms the consumer picked up the new prefix. The count must match what Phase 2 reported for staging.

3.3. Watch for the first hour. Park on #engineering-support and monitor for any researcher reporting missing data, slow queries, or CatalogNotInitialized errors. The most common Phase-3 failure mode is a long-lived session that hasn’t reconstructed since the flip — data.catalog.uri still shows the old prefix; tell them to restart the process.

Phase 3b — Reconcile the cutover-window tail

Goal: catch any recordings that landed in the old catalog during the build→flip window (an in-flight ingest that committed against the old prefix before the flip reached its process).

Because recordings are content-addressed by recording_hash and the raw bytes always live in constellation-data (never lost), the catalog is re-derivable from R2. Reconcile by recording_hash set-difference, not a timestamp watermark — raw recordings arrive out of order, so a start_time cutoff would silently skip a session that predates the mark but uploaded after the flip.

uv run python -c "
import ursa
data = ursa.DataInterface(profile='r2')           # now resolves the new (live) catalog
have = {r.recording_hash for r in data.list_recordings()}
# enumerate rec_ids present in the raw bucket (recordings/<rec_id>/) and map
# to recording_hash; re-ingest any whose hash is not in `have`.
data.enable_writes()
for rec_id in sorted(missing_rec_ids):            # rec_ids whose hash ∉ have
    data.ingest_from_r2(rec_id)                    # idempotent on identical rows — safe to re-run
"

ingest_from_r2 registers with recording_on_conflict="error" (src/ursa/register/orchestrator.py:1086), which is idempotent on re-register with identical fields (src/ursa/data_interface.py:2866) — re-ingesting a hash already present with the same row is a no-op, while a divergent row raises CatalogRowExists so a real collision (hand-correction, double-backfill) surfaces loudly rather than silently taking max(duration). The reconcile only ingests hashes missing from the new catalog, so each lands cleanly; repeat until the set-difference is empty. Only then proceed to Phase 4. This converts the cutover-window race from silent data loss into eventual consistency.

Phase 4 — Archive the old catalog

Goal: keep the pre-rebuild catalog reachable for ~30 days as Rollback insurance, then delete it.

4.1. Rename the old prefix from ursa/ to ursa-prev-<YYYY-MM-DD>/. Use any S3-compatible tool with multi-part copy support:

# rclone example — see https://rclone.org/s3/ for setup against R2.
rclone copy \
    r2-prod:constellation-assets/ursa/catalog/ \
    r2-prod:constellation-assets/ursa-prev-2026-05-20/catalog/ \
    --transfers 32 --checksum

# Verify byte-for-byte before deletion.
rclone check \
    r2-prod:constellation-assets/ursa/catalog/ \
    r2-prod:constellation-assets/ursa-prev-2026-05-20/catalog/

# Delete the old prefix only after `check` reports zero diff.
rclone delete r2-prod:constellation-assets/ursa/catalog/

The rclone check step is non-negotiable. Lance dataset directories carry _versions/, _indices/, _transactions/, _latest.manifest, plus fragment data; a partial copy that misses any one of those leaves the renamed catalog unreadable.

Note

The rclone dependency is the only brew install / apt install step in this runbook, in tension with the workspace-CLAUDE.md preference for pure-Python alternatives. An in-tree replacement (Catalog.copy_to(dst_prefix)) is tracked as ENG-1244; once that lands, this section switches to a one-line data.catalog.copy_to('ursa-prev-<DATE>/catalog/') and the rclone dependency goes away.

4.2. Schedule deletion. The archived prefix lives for at least 30 days. After that window, an admin runs rclone delete r2-prod:constellation-assets/ursa-prev-2026-05-20/ to reclaim the bytes. Do not auto-schedule this — the deletion is a manual step so a human signs off on “yes, we’re past the rollback horizon.”

Rollback

Trigger: any researcher reports missing data, or data.catalog.count('recordings') drops below the pre-rebuild baseline, or the doctest suite fails against the live config after Phase 3.

Before Phase 4 (old catalog still in place):

  • Flip the pointer back — atomic, one command:

    uv run python scripts/set_active_catalog.py \
        --profile r2 --rollback --commit --confirm-prod-flip
    

    --rollback reads the pointer’s recorded previous_prefix (the prefix that was live before this flip) and swaps to it. Consumers re-resolve the old prefix on their next Catalog.open and the rebuild is invisible.

  • The staging prefix ursa-staging-<TS>/ stays in R2 for forensics. Keep it for 30 days, then rclone delete.

After Phase 4 (old catalog renamed to ursa-prev-<DATE>/):

  • The pointer’s previous_prefix still names the original prefix, which has been renamed away. First restore the bytes: copy ursa-prev-<DATE>/ back to its original prefix (same rclone copy + rclone check ritual), then --rollback (or --prefix <restored>) the pointer to it.

  • Announce the rollback in #engineering-support. The rebuild is suspect until the root cause is identified.

Anti-patterns to avoid

  • Don’t skip Phase 2. Verifying staging before the switch is the entire reason a parallel catalog exists. Skipping it forfeits the safety property.

  • Don’t run Phase 4 immediately after Phase 3. The 30-day window is short, but the cost of deleting too early is irreversible. Wait.

  • Don’t skip Phase 3b (reconciliation). The pointer makes accidental dual-write far less likely — there’s one resolution source, and writers cut over with readers — but a write already in flight when the flip lands can still commit to the old prefix. The recording_hash set-difference reconciliation is the belt-and-suspenders that recovers those stragglers; run it before Phase 4 and confirm the delta is empty.

  • One operator at a time on the pointer. set_active_catalog.py writes the pointer with an ETag-conditional put, so a concurrent flip/rollback fails loud (ETagMismatch) rather than silently clobbering the previous_prefix lineage — but coordinate flips through #engineering-support regardless; don’t run two at once.

  • Don’t override the safety guard casually. scripts/rebuild_catalog.py --i-know-this-targets-prod exists for a reason but is verbose-by-design. If you’re typing it without explicit cause, stop and re-confirm the catalog URI.

  • Don’t rebuild against profile="local". ingest_from_r2 rejects the local profile (there’s no R2 to read from); a local rebuild would have to upload its output to R2 separately, which defeats the parallel-catalog model.

  • Don’t rebuild “for fun.” Catalog reads are the load-bearing path for research workflows. Researchers depend on stable queries. A rebuild that doesn’t fix a documented problem is operational risk for no gain.

Open questions / future work

  • In-tree Catalog.copy_to helper to retire rclone (ENG-1244). Phase 4 currently uses rclone copy + rclone check. An in-tree alternative would let the runbook drop the host-binary dependency.

  • Online rebuild. Today’s procedure requires a maintenance window because the switch is consumer-coordinated. A future change could make Catalog.open re-read its prefix on each call (or on a signal), letting consumers pick up the new prefix without restarts. Out of scope for this runbook.

  • Cross-bucket rebuilds. All the steps above keep the rebuild in the same R2 bucket. A cross-bucket rebuild (e.g. promoting r2-test content to r2) is a separate procedure — Linear ticket pending.

  • Incremental rebuild. Today’s rebuild ingests every recording from scratch. For very large catalogs, an incremental “only the rows that changed schema” rebuild may be cheaper. Defer until catalog size makes the full rebuild’s wall-clock unacceptable.

  • Pass-2 retirement. scripts/rebuild_catalog.py’s pass-2 (legacy-only duration correction from segment metadata) is dead code once the legacy 198-row cohort is corrected and the data-engine schema rewrite has landed. Remove pass-2 at that point.