ursa.catalog.schemas

Pydantic schemas for the nine Lance catalog tables.

Design notes:

  • extra="allow" is deliberate. A row read with the model from version N can still round-trip unknown columns added in version N+1 — additive Lance schema evolution stays a config change, not a code change. Opposite of the secrets models in constellation-utils (extra="forbid").

  • frozen=True: declared fields are immutable once constructed. Catalog mutations go through writes of new rows. Note: __pydantic_extra__ is a regular dict and is not frozen — we treat that as a permitted escape hatch for ad-hoc extra hydration during ingestion, and lock the behavior in a test so a Pydantic upgrade can’t change it silently.

  • All datetime fields use :data:UTCDatetime: aware-only on input, normalized to UTC after validation. Tradeoff — original tz offset is discarded; rationale is fewer foot-guns when comparing timestamps across rows in the lifecycle layer (e.g. last_accessed_at GC).

  • Map-style fields are constrained to scalar values (and lists of scalars). Lance/Arrow MapType requires a single value type per column, and the architecture promises “frequently-used keys can be promoted to typed columns later.” Both promises break with nested dicts or arbitrary objects.

  • Time-origin convention: per-recording time fields (domain_intervals tuple bounds, event_time, TimeWindow.start_seconds) are in the recording’s native time domain. Negative values are legal for re-aligned timelines (stimulus-onset-relative events, pre-roll baseline). Only the relative invariant end > start is enforced per interval; intervals must additionally be sorted ascending and non-overlapping.

  • ID fields use :data:CatalogID: non-empty, [A-Za-z0-9_-]+. Stricter content-hash / UUID4 enforcement is deferred to a shared util / writer.

  • EmbeddingRow.vector is the highest-cardinality hot path; per-element Pydantic validation of a 1024+-dim list[float] is ~4 ms/row. We type the field as Any and run a single BeforeValidator that checks length and v[0]. Static type degrades to Any; runtime contract is preserved.

Module Contents

Classes

StorageFormat

How a modality’s bytes are laid out in object storage.

IngestionStatus

Per-modality ingestion lifecycle (architecture v0.4 two-store model).

TimeWindow

Half-open [start, end) window in the recording’s native time domain.

EmbeddingSource

Source of an embedding row: which (recording, modality, window) it covers. Same nested-struct caveat as :class:TimeWindow.

CatalogRow

Base for all Lance catalog table rows.

ParticipantRow

One row per enrolled participant.

RecordingRow

One row per recording.

ModalityRow

One row per stream within a recording.

EventRow

System prompts, user responses, and any time-stamped event.

VirgoAssetRow

One row per Virgo output, with full provenance.

CheckpointRow

One row per Orion model checkpoint.

BenchmarkSuiteRow

One row per versioned benchmark suite configuration.

BenchmarkResultRow

One row per benchmark evaluation result.

EmbeddingRow

Vector embedding over a (recording, modality, window) tuple.

Functions

_to_utc

Normalize an aware datetime to UTC.

_check_homogeneous_lists

Reject mixed-type lists in a metadata map.

_validate_vector_fast

Fast-path vector validator: skips per-element Pydantic checks (~4 ms/1024-dim). Accepts any sized indexable sequence — list, tuple, numpy.ndarray, torch.Tensor — without importing numpy/torch. Only checks v[0]; heterogeneous tails slip through and are caught by Arrow at write time. Raises ValueError (not TypeError) so Pydantic wraps as ValidationError.

Data

API

ursa.catalog.schemas.__all__

[‘BenchmarkResultRow’, ‘BenchmarkSuiteRow’, ‘CatalogID’, ‘CatalogRow’, ‘CheckpointRow’, ‘EmbeddingRo…

ursa.catalog.schemas._to_utc(dt: datetime.datetime) datetime.datetime[source]

Normalize an aware datetime to UTC.

AwareDatetime rejects naive inputs upstream, so dt.tzinfo is guaranteed non-None here. The astimezone call is a no-op when input is already UTC.

ursa.catalog.schemas.UTCDatetime

None

ursa.catalog.schemas.ScalarMetadataValue

None

ursa.catalog.schemas.MetadataValue

None

ursa.catalog.schemas._check_homogeneous_lists(value: dict[str, Any]) dict[str, Any][source]

Reject mixed-type lists in a metadata map.

MetadataValue already type-restricts list elements to scalars, but a list with mixed scalar types ([1, "two", True]) still passes. Lance MapType requires a single Arrow type per column, so we walk the values and reject heterogeneous lists at construction time. Bool is a subclass of int — collapse them to the same bucket so [True, False, 0] isn’t flagged as mixed.

ursa.catalog.schemas.MetadataDict

None

ursa.catalog.schemas.ID_PATTERN

‘^[A-Za-z0-9_-]+$’

ursa.catalog.schemas.CatalogID

None

ursa.catalog.schemas.URI_PATTERN

‘^(r2|s3|gcs|file)://\S+$’

ursa.catalog.schemas.StorageURI

None

ursa.catalog.schemas.NonEmptyString

None

ursa.catalog.schemas.ModalityName

None

class ursa.catalog.schemas.StorageFormat[source]

Bases: enum.StrEnum

How a modality’s bytes are laid out in object storage.

Two tiers:

Canonical — Ursa-managed, produced by ingestion or Virgo. These are the permanent storage formats that ursa.query reads directly.

Raw (RAW_*) — data-engine-native segment files, registered during Phase 1a ingestion before Virgo has processed them. ModalityRow entries with raw formats have storage_uri pointing at the data-engine raw prefix. Virgo promotes raw modalities to canonical formats; old raw rows are retired by lifecycle GC. See :mod:ursa.layout for key conventions.

Initialization

Initialize self. See help(type(self)) for accurate signature.

ZARR

‘zarr’

LANCE

‘lance’

MP4_INDEX

‘mp4_index’

PARQUET

‘parquet’

RAW_BINARY

‘raw_binary’

RAW_CSV

‘raw_csv’

RAW_JSONL

‘raw_jsonl’

RAW_AUDIO

‘raw_audio’

RAW_VIDEO

‘raw_video’

class ursa.catalog.schemas.IngestionStatus[source]

Bases: enum.StrEnum

Per-modality ingestion lifecycle (architecture v0.4 two-store model).

raw — modality registered against a cold-bucket raw file (whole object addressable via ModalityRow.raw_storage_uri); Virgo’s ingestion node has not yet run, so domain_intervals, channel_spec, and format may be null. storage_uri mirrors raw_storage_uri until ingestion completes.

processed — Virgo’s ingestion node has converted the raw file to a canonical format (Zarr / MP4 + Lance frame index). format is a canonical (non-RAW_*) value, domain_intervals and channel_spec are populated, and storage_uri points at the processed object on the hot bucket. raw_storage_uri is preserved so re-ingestion is always possible.

Initialization

Initialize self. See help(type(self)) for accurate signature.

RAW

‘raw’

PROCESSED

‘processed’

class ursa.catalog.schemas.TimeWindow(/, **data: typing.Any)[source]

Bases: pydantic.BaseModel

Half-open [start, end) window in the recording’s native time domain.

No absolute bound on start_seconds — re-aligned timelines (stimulus onset, pre-roll baseline) legitimately produce negative times. Only the relative invariant end > start is enforced.

Note (deferred benchmark): nested struct columns in Lance have weaker filter pushdown than top-level columns. If query latency on VirgoAssetRow.time_window or EmbeddingRow.source.time_window becomes a bottleneck, the writer can flatten internally and treat this Pydantic type as a logical view.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

start_seconds: float

None

end_seconds: float

None

_end_after_start() ursa.catalog.schemas.TimeWindow[source]
class ursa.catalog.schemas.EmbeddingSource(/, **data: typing.Any)[source]

Bases: pydantic.BaseModel

Source of an embedding row: which (recording, modality, window) it covers. Same nested-struct caveat as :class:TimeWindow.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

recording_hash: ursa.catalog.schemas.CatalogID

None

modality: ursa.catalog.schemas.ModalityName

None

time_window: ursa.catalog.schemas.TimeWindow

None

ursa.catalog.schemas._validate_vector_fast(v: Any) list[float][source]

Fast-path vector validator: skips per-element Pydantic checks (~4 ms/1024-dim). Accepts any sized indexable sequence — list, tuple, numpy.ndarray, torch.Tensor — without importing numpy/torch. Only checks v[0]; heterogeneous tails slip through and are caught by Arrow at write time. Raises ValueError (not TypeError) so Pydantic wraps as ValidationError.

ursa.catalog.schemas.Vector

None

class ursa.catalog.schemas.CatalogRow(/, **data: typing.Any)[source]

Bases: pydantic.BaseModel

Base for all Lance catalog table rows.

Subclasses MUST declare __primary_key__ as a non-empty tuple of field names that identify the row uniquely within its table. Failure raises TypeError at class-definition time. The future Lance writer reads this attribute to enforce uniqueness.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

‘ConfigDict(…)’

__primary_key__: ClassVar[tuple[str, ...]]

()

classmethod __pydantic_init_subclass__(**kwargs: Any) None[source]
class ursa.catalog.schemas.ParticipantRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per enrolled participant.

Primary key: participant_id — any unique catalog ID. By convention a short slug like p042; not enforced.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘participant_id’,)

participant_id: ursa.catalog.schemas.CatalogID

None

enrolled_at: ursa.catalog.schemas.UTCDatetime

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.RecordingRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per recording.

Primary key: recording_hash — any unique catalog ID; content-hash convention enforced later via a shared util.

participant_ids is a list (architecture v0.4): a recording can cover multiple participants (multi-subject experiments, dyad sessions, crowd recordings). Single-participant recordings carry a one-element list. The list must be non-empty.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘recording_hash’,)

recording_hash: ursa.catalog.schemas.CatalogID

None

participant_ids: list[ursa.catalog.schemas.CatalogID]

‘Field(…)’

start_time: ursa.catalog.schemas.UTCDatetime

None

duration: datetime.timedelta

‘Field(…)’

device_info: ursa.catalog.schemas.MetadataDict

‘Field(…)’

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.ModalityRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per stream within a recording.

Primary key: composite (recording_hash, modality). The Lance writer will enforce uniqueness; this row schema only declares the contract.

Architecture v0.4 splits a modality’s lifecycle into two states tracked by :attr:ingestion_status:

  • raw — registered against a cold-bucket raw file. format may be a RAW_* value or null (unknown at registration time); domain_intervals and channel_spec are typically null. storage_uri mirrors :attr:raw_storage_uri (the immutable cold-bucket pointer).

  • processed — Virgo’s ingestion node has converted the raw file to a canonical format. format must be a non-RAW_* value; domain_intervals and channel_spec must be populated; storage_uri points at the processed object on the hot bucket.

    attr:

    raw_storage_uri is preserved across the transition so re-ingestion is always possible.

Attr:

storage_uri always points to the current authoritative object (raw URI initially, swapped to Zarr/MP4 after Virgo ingestion).

Attr:

raw_storage_uri is the permanent cold-bucket pointer.

Attr:

domain_intervals is a list of (start, end) tuples in the recording’s native time domain, handling non-continuous recordings, irregular series, and gaps. Each interval enforces end > start and intervals must be sorted ascending and non-overlapping. Null while ingestion_status="raw" (Virgo populates it during ingestion using temporaldata domain-computation utilities).

channel_spec uses :data:MetadataDict — per-channel structured metadata (e.g. polarity, reference) should be encoded as parallel lists ({"channel_names": [...], "polarity": [...], "reference": [...]}), not list-of-dicts; that’s required for Lance Map queryability.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘recording_hash’, ‘modality’)

recording_hash: ursa.catalog.schemas.CatalogID

None

modality: ursa.catalog.schemas.ModalityName

None

ingestion_status: ursa.catalog.schemas.IngestionStatus

None

storage_uri: ursa.catalog.schemas.StorageURI

None

raw_storage_uri: ursa.catalog.schemas.StorageURI

None

format: ursa.catalog.schemas.StorageFormat | None

None

sampling_rate: float | None

‘Field(…)’

domain_intervals: list[tuple[float, float]] | None

None

channel_spec: ursa.catalog.schemas.MetadataDict | None

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

_validate_domain_intervals() ursa.catalog.schemas.ModalityRow[source]
_validate_ingestion_status_coherence() ursa.catalog.schemas.ModalityRow[source]
class ursa.catalog.schemas.EventRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

System prompts, user responses, and any time-stamped event.

Primary key: event_id — any unique catalog ID.

event_time is in the recording’s native time domain; negative values are legal for re-aligned timelines or pre-roll events.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘event_id’,)

event_id: ursa.catalog.schemas.CatalogID

None

recording_hash: ursa.catalog.schemas.CatalogID

None

event_time: float

None

event_type: ursa.catalog.schemas.NonEmptyString

None

prompt: str | None

None

response: str | None

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.VirgoAssetRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per Virgo output, with full provenance.

Primary key: asset_id — any unique catalog ID.

last_accessed_at is updated by Ursa’s query layer on every read and drives lifecycle GC (architecture doc §3.6).

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘asset_id’,)

asset_id: ursa.catalog.schemas.CatalogID

None

recording_hash: ursa.catalog.schemas.CatalogID

None

pipeline_name: ursa.catalog.schemas.NonEmptyString

None

pipeline_version: ursa.catalog.schemas.NonEmptyString

None

cache_key: ursa.catalog.schemas.CatalogID

None

code_version: ursa.catalog.schemas.NonEmptyString

None

config_hash: ursa.catalog.schemas.CatalogID

None

time_window: ursa.catalog.schemas.TimeWindow

None

created_at: ursa.catalog.schemas.UTCDatetime

None

last_accessed_at: ursa.catalog.schemas.UTCDatetime

None

storage_uri: ursa.catalog.schemas.StorageURI

None

class ursa.catalog.schemas.CheckpointRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per Orion model checkpoint.

Primary key: checkpoint_id — any unique catalog ID.

run_id is an opaque ClearML task ID; runs are tracked in ClearML’s own database, not in the Ursa catalog. parent_checkpoint_id records the checkpoint this one was resumed from, enabling resume chains and full lineage traversal. The full list of recordings consumed lives at storage_uri/data_hashes/manifest.json.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘checkpoint_id’,)

checkpoint_id: ursa.catalog.schemas.CatalogID

None

run_id: ursa.catalog.schemas.CatalogID

None

step: int

‘Field(…)’

model_id: ursa.catalog.schemas.NonEmptyString

None

code_version: ursa.catalog.schemas.NonEmptyString

None

storage_uri: ursa.catalog.schemas.StorageURI

None

created_at: ursa.catalog.schemas.UTCDatetime

None

parent_checkpoint_id: ursa.catalog.schemas.CatalogID | None

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.BenchmarkSuiteRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per versioned benchmark suite configuration.

Primary key: composite (suite_name, suite_version), matching the identity used by :class:BenchmarkResultRow as its FK. Suite configs are standalone — no FK to recordings or participants. storage_uri points to the held-out query spec and metric definitions on R2.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘suite_name’, ‘suite_version’)

suite_name: ursa.catalog.schemas.NonEmptyString

None

suite_version: int

‘Field(…)’

storage_uri: ursa.catalog.schemas.StorageURI

None

created_at: ursa.catalog.schemas.UTCDatetime

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.BenchmarkResultRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

One row per benchmark evaluation result.

Primary key: result_id — content-addressed hash of the six identity fields (suite_name, suite_version, checkpoint_id, dataset_hash, partial_subset, partial_seed). The six fields are stored as queryable columns so callers can look up results without pre-computing the hash.

dataset_hash is an opaque string; computation convention is a future feature. partial_subset ∈ (0, 1] and partial_seed distinguish full evals from in-training partial benchmarks (see architecture §5.7).

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘result_id’,)

result_id: ursa.catalog.schemas.CatalogID

None

suite_name: ursa.catalog.schemas.NonEmptyString

None

suite_version: int

‘Field(…)’

checkpoint_id: ursa.catalog.schemas.CatalogID

None

dataset_hash: ursa.catalog.schemas.CatalogID

None

partial_subset: float

‘Field(…)’

partial_seed: int | None

None

storage_uri: ursa.catalog.schemas.StorageURI

None

computed_at: ursa.catalog.schemas.UTCDatetime

None

metadata: ursa.catalog.schemas.MetadataDict

‘Field(…)’

class ursa.catalog.schemas.EmbeddingRow(/, **data: typing.Any)[source]

Bases: ursa.catalog.schemas.CatalogRow

Vector embedding over a (recording, modality, window) tuple.

Primary key: embedding_id — any unique catalog ID.

vector is typed as Any for runtime performance; runtime contract is non-empty list[float] (no bools), enforced by

Func:

_validate_vector_fast. Per-model fixed-dim enforcement lives in the Lance writer, materialized as FixedSizeList[float, dim] keyed by model_id.

Initialization

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

__primary_key__: ClassVar[tuple[str, ...]]

(‘embedding_id’,)

embedding_id: ursa.catalog.schemas.CatalogID

None

source: ursa.catalog.schemas.EmbeddingSource

None

vector: ursa.catalog.schemas.Vector

None

model_id: ursa.catalog.schemas.NonEmptyString

None