ursa.layout

Two-store model (architecture v0.4)

Ursa is a two-store database backed by two R2 buckets:

  • Raw storeconstellation-data (this module’s :data:RAW_BUCKET). Cold/infrequent-access tier. Whole-file objects exactly as data-engine writes them; no chunk indexing or temporal structure. Designed to be read exactly once, by Virgo’s ingestion node, then rarely accessed again. Lifecycle policy can archive or delete raw segment files after a configurable retention window once Virgo has produced the processed artifact (tracked by ENG-1085).

  • Processed storeconstellation-assets (this module’s

    data:

    ASSETS_BUCKET). Hot tier. Populated by Virgo’s ingestion node: Zarr arrays for regular continuous streams, Lance tables for irregular events and the catalog itself, MP4 + Lance frame indices for video.

Two buckets, two layouts:

constellation-data — raw recordings, written by data-engine rigs. Ursa treats this bucket as read-only. A Cloudflare bucket lock is applied to the recordings/ prefix so rigs cannot modify already-uploaded objects. Key structure (per data-engine PR #49)::

recordings/<recording_id>/<worker_subdir>/<file>   raw segment files
manifests/<recording_id>/manifest.json             upload commit marker
_status/<hostname>.json                            uploader heartbeat
nodes/<node_id>.json                               node registry

constellation-assets — Ursa-managed objects, organised by repo header so each package owns a distinct prefix and permissions can be scoped independently::

virgo/<recording_hash>/<modality>.<ext>              Virgo canonical outputs
ursa/catalog/<table>.lance                           Lance catalog tables
orion/checkpoints/<checkpoint_id>/                   Orion model checkpoints
orion/benchmark-suites/<name>/<version>.json         Benchmark suite configs
orion/benchmark-results/<result_id>.json             Benchmark evaluation results

Per-modality lifecycle (architecture v0.4)

A :class:~ursa.catalog.ModalityRow carries two URI fields plus an ingestion_status enum:

  • raw_storage_uri — the immutable cold-bucket pointer (r2://constellation-data/recordings/...). Set at registration and preserved forever, even after Virgo’s ingestion node has produced the processed artifact, so re-ingestion is always possible.

  • storage_uri — the current authoritative location. While ingestion_status="raw" it mirrors raw_storage_uri; once Virgo’s ingestion node runs, the row is upserted with ingestion_status="processed", storage_uri swapped to the constellation-assets key, and format / domain_intervals / channel_spec populated.

recording_hash is the join key throughout — no re-keying during the raw → processed transition.

R2 storage layout conventions for Ursa.

All code that constructs or interprets R2 object keys (ingestion, query, lifecycle) MUST use the functions here. Having a single module as the canonical source prevents the key-structure from diverging across callers.

Module Contents

Functions

virgo_recording_prefix

Prefix for all Virgo-processed objects belonging to one recording.

virgo_modality_key

Key for one modality’s Virgo-processed object in constellation-assets.

virgo_modality_uri

Full r2:// URI for a Virgo-processed modality object.

catalog_prefix

Prefix for all Lance catalog tables in constellation-assets.

catalog_table_key

Key for a named Lance catalog table.

catalog_table_uri

Full r2:// URI for a Lance catalog table.

orion_checkpoint_prefix

Prefix for all objects belonging to one Orion model checkpoint.

orion_checkpoint_uri

Full r2:// URI for an Orion checkpoint prefix.

orion_checkpoint_data_hash_key

Key for the data-hash manifest inside a checkpoint.

orion_benchmark_suite_key

Key for a versioned benchmark suite configuration object.

orion_benchmark_suite_uri

Full r2:// URI for a benchmark suite configuration object.

orion_benchmark_result_key

Key for a benchmark evaluation result object.

orion_benchmark_result_uri

Full r2:// URI for a benchmark evaluation result.

raw_recording_prefix

Prefix for all raw segment files belonging to one recording.

raw_modality_prefix

Prefix for one modality’s raw segment files.

raw_modality_uri

Full r2:// URI for a raw modality prefix in constellation-data.

raw_commit_marker_key

Key for the upload commit marker written by the uploader after a complete session.

raw_status_key

Key for a rig’s uploader status heartbeat file.

raw_node_key

Key for a node’s registry entry in constellation-data.

validate_storage_uri

Reject a storage_uri that doesn’t match its StorageFormat tier.

Data

API

ursa.layout.__all__

[‘RAW_BUCKET’, ‘ASSETS_BUCKET’, ‘FORMAT_EXT’, ‘CANONICAL_FORMATS’, ‘virgo_recording_prefix’, ‘virgo_…

ursa.layout.RAW_BUCKET

‘constellation-data’

ursa.layout.ASSETS_BUCKET

‘constellation-assets’

ursa.layout.FORMAT_EXT: dict[ursa.catalog.schemas.StorageFormat, str]

None

ursa.layout.CANONICAL_FORMATS: frozenset[ursa.catalog.schemas.StorageFormat]

‘frozenset(…)’

ursa.layout.virgo_recording_prefix(recording_hash: str) str[source]

Prefix for all Virgo-processed objects belonging to one recording.

Used by ingestion to list all canonical objects for a recording (e.g. before lifecycle GC runs). Not used for individual object writes — call

Func:

virgo_modality_key for those.

Example: virgo/abc123def456/

ursa.layout.virgo_modality_key(recording_hash: str, modality: str, fmt: ursa.catalog.schemas.StorageFormat) str[source]

Key for one modality’s Virgo-processed object in constellation-assets.

Only accepts canonical formats (ZARR, LANCE, MP4_INDEX, PARQUET). Raises ValueError for RAW_* formats — those belong in constellation-data and must be addressed via :func:raw_modality_uri.

Example: virgo/abc123def456/eeg.zarr

ursa.layout.virgo_modality_uri(recording_hash: str, modality: str, fmt: ursa.catalog.schemas.StorageFormat) str[source]

Full r2:// URI for a Virgo-processed modality object.

Example: r2://constellation-assets/virgo/abc123def456/eeg.zarr

ursa.layout.catalog_prefix() str[source]

Prefix for all Lance catalog tables in constellation-assets.

Tables live under ursa/catalog/ — the ursa/ repo header scopes permissions for the Ursa package, mirroring how virgo/ scopes Virgo and orion/ scopes Orion.

Example: ursa/catalog/

ursa.layout.catalog_table_key(table_name: str) str[source]

Key for a named Lance catalog table.

Prefer using a :ref:TABLE_* constant <catalog-table-constants> over a bare string so a rename is a single-file edit.

Example: ursa/catalog/recordings.lance

ursa.layout.catalog_table_uri(table_name: str) str[source]

Full r2:// URI for a Lance catalog table.

Example: r2://constellation-assets/ursa/catalog/recordings.lance

ursa.layout.TABLE_PARTICIPANTS

‘participants’

ursa.layout.TABLE_RECORDINGS

‘recordings’

ursa.layout.TABLE_MODALITIES

‘modalities’

ursa.layout.TABLE_EVENTS

‘events’

ursa.layout.TABLE_EMBEDDINGS

‘embeddings’

ursa.layout.TABLE_VIRGO_ASSETS

‘virgo_assets’

ursa.layout.TABLE_CHECKPOINTS

‘checkpoints’

ursa.layout.TABLE_BENCHMARK_SUITES

‘benchmark_suites’

ursa.layout.TABLE_BENCHMARK_RESULTS

‘benchmark_results’

ursa.layout.ALL_CATALOG_TABLES: tuple[str, ...]

()

ursa.layout.orion_checkpoint_prefix(checkpoint_id: str) str[source]

Prefix for all objects belonging to one Orion model checkpoint.

CheckpointRow.storage_uri should be set to this prefix. The data-hash manifest lives at {orion_checkpoint_prefix(id)}data_hashes/manifest.json — use

Func:

orion_checkpoint_data_hash_key to construct that path rather than string-concatenating.

Example: orion/checkpoints/ckpt-abc123/

ursa.layout.orion_checkpoint_uri(checkpoint_id: str) str[source]

Full r2:// URI for an Orion checkpoint prefix.

Example: r2://constellation-assets/orion/checkpoints/ckpt-abc123/

ursa.layout.orion_checkpoint_data_hash_key(checkpoint_id: str) str[source]

Key for the data-hash manifest inside a checkpoint.

This is the file Orion writes that lists every recording consumed during the training run, used for train/test overlap detection.

Example: orion/checkpoints/ckpt-abc123/data_hashes/manifest.json

ursa.layout.orion_benchmark_suite_key(suite_name: str, suite_version: int) str[source]

Key for a versioned benchmark suite configuration object.

BenchmarkSuiteRow.storage_uri should point at this key. The object contains the held-out query spec and metric definitions.

Example: orion/benchmark-suites/cognitive_load_eval/1.json

ursa.layout.orion_benchmark_suite_uri(suite_name: str, suite_version: int) str[source]

Full r2:// URI for a benchmark suite configuration object.

Example: r2://constellation-assets/orion/benchmark-suites/cognitive_load_eval/1.json

ursa.layout.orion_benchmark_result_key(result_id: str) str[source]

Key for a benchmark evaluation result object.

BenchmarkResultRow.storage_uri should point at this key.

Example: orion/benchmark-results/result-deadbeef.json

ursa.layout.orion_benchmark_result_uri(result_id: str) str[source]

Full r2:// URI for a benchmark evaluation result.

Example: r2://constellation-assets/orion/benchmark-results/result-deadbeef.json

ursa.layout.raw_recording_prefix(recording_id: str) str[source]

Prefix for all raw segment files belonging to one recording.

Matches the key layout introduced in data-engine PR #49: recordings/<recording_id>/. Note: manifests/ is a sibling prefix at the bucket root, not nested under recordings/.

Example: recordings/rec_20260507_143022_a7f3/

ursa.layout.raw_modality_prefix(recording_id: str, worker_subdir: str) str[source]

Prefix for one modality’s raw segment files.

worker_subdir is the per-worker directory data-engine creates, e.g. camera_front_cam or eeg_default.

Example: recordings/rec_20260507_143022_a7f3/camera_front_cam/

ursa.layout.raw_modality_uri(recording_id: str, worker_subdir: str) str[source]

Full r2:// URI for a raw modality prefix in constellation-data.

The URI points at the prefix (trailing /) — raw modalities consist of multiple segment files. The ingestion step (ENG-888) resolves individual objects within the prefix when building ModalityRow entries.

Example: r2://constellation-data/recordings/rec_20260507_.../camera_front_cam/

ursa.layout.raw_commit_marker_key(recording_id: str) str[source]

Key for the upload commit marker written by the uploader after a complete session.

This is NOT under recordings/ — the manifests prefix sits at the bucket root alongside recordings/, _status/, and nodes/.

Example: manifests/rec_20260507_143022_a7f3/manifest.json

ursa.layout.raw_status_key(hostname: str) str[source]

Key for a rig’s uploader status heartbeat file.

Example: _status/green-mantis.json

ursa.layout.raw_node_key(node_id: str) str[source]

Key for a node’s registry entry in constellation-data.

Example: nodes/green-mantis.json

ursa.layout._VALID_URI_SCHEMES: tuple[str, ...]

(‘r2’, ‘s3’, ‘gcs’, ‘file’)

ursa.layout.validate_storage_uri(uri: str, fmt: ursa.catalog.schemas.StorageFormat) None[source]

Reject a storage_uri that doesn’t match its StorageFormat tier.

Phase 1a (M2) callers register raw modalities (RAW_*) under constellation-data and canonical modalities (ZARR, LANCE, MP4_INDEX, PARQUET) under constellation-assets. This helper enforces that contract before any catalog row is written.

Test-profile bucket suffixes (-test) are not yet recognised — see ENG-1071 <https://linear.app/constellationlab/issue/ENG-1071>_.

Raises ValueError (the typed Pydantic URI_PATTERN regex would catch malformed input upstream; this validator handles the semantic mismatch where a syntactically-valid URI points at the wrong bucket for its format).