ursa.catalog.schemas¶
Pydantic schemas for the nine Lance catalog tables.
Design notes:
extra="allow"is deliberate. A row read with the model from version N can still round-trip unknown columns added in version N+1 — additive Lance schema evolution stays a config change, not a code change. Opposite of the secrets models inconstellation-utils(extra="forbid").frozen=True: declared fields are immutable once constructed. Catalog mutations go through writes of new rows. Note:__pydantic_extra__is a regular dict and is not frozen — we treat that as a permitted escape hatch for ad-hoc extra hydration during ingestion, and lock the behavior in a test so a Pydantic upgrade can’t change it silently.All datetime fields use :data:
UTCDatetime: aware-only on input, normalized to UTC after validation. Tradeoff — original tz offset is discarded; rationale is fewer foot-guns when comparing timestamps across rows in the lifecycle layer (e.g.last_accessed_atGC).Map-style fields are constrained to scalar values (and lists of scalars). Lance/Arrow
MapTyperequires a single value type per column, and the architecture promises “frequently-used keys can be promoted to typed columns later.” Both promises break with nested dicts or arbitrary objects.Time-origin convention: per-recording time fields (
domain_intervalstuple bounds,event_time,TimeWindow.start_seconds) are in the recording’s native time domain. Negative values are legal for re-aligned timelines (stimulus-onset-relative events, pre-roll baseline). Only the relative invariantend > startis enforced per interval; intervals must additionally be sorted ascending and non-overlapping.ID fields use :data:
CatalogID: non-empty,[A-Za-z0-9_-]+. Stricter content-hash / UUID4 enforcement is deferred to a shared util / writer.EmbeddingRow.vectoris the highest-cardinality hot path; per-element Pydantic validation of a 1024+-dimlist[float]is ~4 ms/row. We type the field asAnyand run a singleBeforeValidatorthat checks length andv[0]. Static type degrades toAny; runtime contract is preserved.
Module Contents¶
Classes¶
How a modality’s bytes are laid out in object storage. |
|
Per-modality ingestion lifecycle (architecture v0.4 two-store model). |
|
Half-open |
|
Source of an embedding row: which (recording, modality, window) it
covers. Same nested-struct caveat as :class: |
|
Base for all Lance catalog table rows. |
|
One row per enrolled participant. |
|
One row per recording. |
|
One row per stream within a recording. |
|
System prompts, user responses, and any time-stamped event. |
|
One row per Virgo output, with full provenance. |
|
One row per Orion model checkpoint. |
|
One row per versioned benchmark suite configuration. |
|
One row per benchmark evaluation result. |
|
Vector embedding over a (recording, modality, window) tuple. |
Functions¶
Normalize an aware datetime to UTC. |
|
Reject mixed-type lists in a metadata map. |
|
Fast-path vector validator: skips per-element Pydantic checks (~4 ms/1024-dim). Accepts any sized indexable sequence — list, tuple, numpy.ndarray, torch.Tensor — without importing numpy/torch. Only checks v[0]; heterogeneous tails slip through and are caught by Arrow at write time. Raises ValueError (not TypeError) so Pydantic wraps as ValidationError. |
Data¶
API¶
- ursa.catalog.schemas.__all__¶
[‘BenchmarkResultRow’, ‘BenchmarkSuiteRow’, ‘CatalogID’, ‘CatalogRow’, ‘CheckpointRow’, ‘EmbeddingRo…
- ursa.catalog.schemas._to_utc(dt: datetime.datetime) datetime.datetime[source]¶
Normalize an aware datetime to UTC.
AwareDatetimerejects naive inputs upstream, sodt.tzinfois guaranteed non-None here. Theastimezonecall is a no-op when input is already UTC.
- ursa.catalog.schemas.UTCDatetime¶
None
- ursa.catalog.schemas.ScalarMetadataValue¶
None
- ursa.catalog.schemas.MetadataValue¶
None
- ursa.catalog.schemas._check_homogeneous_lists(value: dict[str, Any]) dict[str, Any][source]¶
Reject mixed-type lists in a metadata map.
MetadataValuealready type-restricts list elements to scalars, but a list with mixed scalar types ([1, "two", True]) still passes. LanceMapTyperequires a single Arrow type per column, so we walk the values and reject heterogeneous lists at construction time. Bool is a subclass of int — collapse them to the same bucket so[True, False, 0]isn’t flagged as mixed.
- ursa.catalog.schemas.MetadataDict¶
None
- ursa.catalog.schemas.ID_PATTERN¶
‘^[A-Za-z0-9_-]+$’
- ursa.catalog.schemas.CatalogID¶
None
- ursa.catalog.schemas.URI_PATTERN¶
‘^(r2|s3|gcs|file)://\S+$’
- ursa.catalog.schemas.StorageURI¶
None
- ursa.catalog.schemas.NonEmptyString¶
None
- ursa.catalog.schemas.ModalityName¶
None
- class ursa.catalog.schemas.StorageFormat[source]¶
Bases:
enum.StrEnumHow a modality’s bytes are laid out in object storage.
Two tiers:
Canonical — Ursa-managed, produced by ingestion or Virgo. These are the permanent storage formats that
ursa.queryreads directly.Raw (
RAW_*) — data-engine-native segment files, registered during Phase 1a ingestion before Virgo has processed them.ModalityRowentries with raw formats havestorage_uripointing at the data-engine raw prefix. Virgo promotes raw modalities to canonical formats; old raw rows are retired by lifecycle GC. See :mod:ursa.layoutfor key conventions.Initialization
Initialize self. See help(type(self)) for accurate signature.
- ZARR¶
‘zarr’
- LANCE¶
‘lance’
- MP4_INDEX¶
‘mp4_index’
- PARQUET¶
‘parquet’
- RAW_BINARY¶
‘raw_binary’
- RAW_CSV¶
‘raw_csv’
- RAW_JSONL¶
‘raw_jsonl’
- RAW_AUDIO¶
‘raw_audio’
- RAW_VIDEO¶
‘raw_video’
- class ursa.catalog.schemas.IngestionStatus[source]¶
Bases:
enum.StrEnumPer-modality ingestion lifecycle (architecture v0.4 two-store model).
raw— modality registered against a cold-bucket raw file (whole object addressable viaModalityRow.raw_storage_uri); Virgo’s ingestion node has not yet run, sodomain_intervals,channel_spec, andformatmay be null.storage_urimirrorsraw_storage_uriuntil ingestion completes.processed— Virgo’s ingestion node has converted the raw file to a canonical format (Zarr / MP4 + Lance frame index).formatis a canonical (non-RAW_*) value,domain_intervalsandchannel_specare populated, andstorage_uripoints at the processed object on the hot bucket.raw_storage_uriis preserved so re-ingestion is always possible.Initialization
Initialize self. See help(type(self)) for accurate signature.
- RAW¶
‘raw’
- PROCESSED¶
‘processed’
- class ursa.catalog.schemas.TimeWindow(/, **data: typing.Any)[source]¶
Bases:
pydantic.BaseModelHalf-open
[start, end)window in the recording’s native time domain.No absolute bound on
start_seconds— re-aligned timelines (stimulus onset, pre-roll baseline) legitimately produce negative times. Only the relative invariantend > startis enforced.Note (deferred benchmark): nested struct columns in Lance have weaker filter pushdown than top-level columns. If query latency on
VirgoAssetRow.time_windoworEmbeddingRow.source.time_windowbecomes a bottleneck, the writer can flatten internally and treat this Pydantic type as a logical view.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- start_seconds: float¶
None
- end_seconds: float¶
None
- _end_after_start() ursa.catalog.schemas.TimeWindow[source]¶
- class ursa.catalog.schemas.EmbeddingSource(/, **data: typing.Any)[source]¶
Bases:
pydantic.BaseModelSource of an embedding row: which (recording, modality, window) it covers. Same nested-struct caveat as :class:
TimeWindow.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- recording_hash: ursa.catalog.schemas.CatalogID¶
None
- modality: ursa.catalog.schemas.ModalityName¶
None
- time_window: ursa.catalog.schemas.TimeWindow¶
None
- ursa.catalog.schemas._validate_vector_fast(v: Any) list[float][source]¶
Fast-path vector validator: skips per-element Pydantic checks (~4 ms/1024-dim). Accepts any sized indexable sequence — list, tuple, numpy.ndarray, torch.Tensor — without importing numpy/torch. Only checks v[0]; heterogeneous tails slip through and are caught by Arrow at write time. Raises ValueError (not TypeError) so Pydantic wraps as ValidationError.
- ursa.catalog.schemas.Vector¶
None
- class ursa.catalog.schemas.CatalogRow(/, **data: typing.Any)[source]¶
Bases:
pydantic.BaseModelBase for all Lance catalog table rows.
Subclasses MUST declare
__primary_key__as a non-empty tuple of field names that identify the row uniquely within its table. Failure raisesTypeErrorat class-definition time. The future Lance writer reads this attribute to enforce uniqueness.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- model_config¶
‘ConfigDict(…)’
- __primary_key__: ClassVar[tuple[str, ...]]¶
()
- class ursa.catalog.schemas.ParticipantRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per enrolled participant.
Primary key:
participant_id— any unique catalog ID. By convention a short slug likep042; not enforced.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘participant_id’,)
- participant_id: ursa.catalog.schemas.CatalogID¶
None
- enrolled_at: ursa.catalog.schemas.UTCDatetime¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.RecordingRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per recording.
Primary key:
recording_hash— any unique catalog ID; content-hash convention enforced later via a shared util.participant_idsis a list (architecture v0.4): a recording can cover multiple participants (multi-subject experiments, dyad sessions, crowd recordings). Single-participant recordings carry a one-element list. The list must be non-empty.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘recording_hash’,)
- recording_hash: ursa.catalog.schemas.CatalogID¶
None
- participant_ids: list[ursa.catalog.schemas.CatalogID]¶
‘Field(…)’
- start_time: ursa.catalog.schemas.UTCDatetime¶
None
- duration: datetime.timedelta¶
‘Field(…)’
- device_info: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.ModalityRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per stream within a recording.
Primary key: composite
(recording_hash, modality). The Lance writer will enforce uniqueness; this row schema only declares the contract.Architecture v0.4 splits a modality’s lifecycle into two states tracked by :attr:
ingestion_status:raw— registered against a cold-bucket raw file.formatmay be aRAW_*value or null (unknown at registration time);domain_intervalsandchannel_specare typically null.storage_urimirrors :attr:raw_storage_uri(the immutable cold-bucket pointer).processed— Virgo’s ingestion node has converted the raw file to a canonical format.formatmust be a non-RAW_*value;domain_intervalsandchannel_specmust be populated;storage_uripoints at the processed object on the hot bucket.- attr:
raw_storage_uriis preserved across the transition so re-ingestion is always possible.
- Attr:
storage_urialways points to the current authoritative object (raw URI initially, swapped to Zarr/MP4 after Virgo ingestion).- Attr:
raw_storage_uriis the permanent cold-bucket pointer.- Attr:
domain_intervalsis a list of(start, end)tuples in the recording’s native time domain, handling non-continuous recordings, irregular series, and gaps. Each interval enforcesend > startand intervals must be sorted ascending and non-overlapping. Null whileingestion_status="raw"(Virgo populates it during ingestion usingtemporaldatadomain-computation utilities).
channel_specuses :data:MetadataDict— per-channel structured metadata (e.g. polarity, reference) should be encoded as parallel lists ({"channel_names": [...], "polarity": [...], "reference": [...]}), not list-of-dicts; that’s required for Lance Map queryability.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘recording_hash’, ‘modality’)
- recording_hash: ursa.catalog.schemas.CatalogID¶
None
- modality: ursa.catalog.schemas.ModalityName¶
None
- ingestion_status: ursa.catalog.schemas.IngestionStatus¶
None
- storage_uri: ursa.catalog.schemas.StorageURI¶
None
- raw_storage_uri: ursa.catalog.schemas.StorageURI¶
None
- format: ursa.catalog.schemas.StorageFormat | None¶
None
- sampling_rate: float | None¶
‘Field(…)’
- domain_intervals: list[tuple[float, float]] | None¶
None
- channel_spec: ursa.catalog.schemas.MetadataDict | None¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- _validate_domain_intervals() ursa.catalog.schemas.ModalityRow[source]¶
- _validate_ingestion_status_coherence() ursa.catalog.schemas.ModalityRow[source]¶
- class ursa.catalog.schemas.EventRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowSystem prompts, user responses, and any time-stamped event.
Primary key:
event_id— any unique catalog ID.event_timeis in the recording’s native time domain; negative values are legal for re-aligned timelines or pre-roll events.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘event_id’,)
- event_id: ursa.catalog.schemas.CatalogID¶
None
- recording_hash: ursa.catalog.schemas.CatalogID¶
None
- event_time: float¶
None
- event_type: ursa.catalog.schemas.NonEmptyString¶
None
- prompt: str | None¶
None
- response: str | None¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.VirgoAssetRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per Virgo output, with full provenance.
Primary key:
asset_id— any unique catalog ID.last_accessed_atis updated by Ursa’s query layer on every read and drives lifecycle GC (architecture doc §3.6).Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘asset_id’,)
- asset_id: ursa.catalog.schemas.CatalogID¶
None
- recording_hash: ursa.catalog.schemas.CatalogID¶
None
- pipeline_name: ursa.catalog.schemas.NonEmptyString¶
None
- pipeline_version: ursa.catalog.schemas.NonEmptyString¶
None
- cache_key: ursa.catalog.schemas.CatalogID¶
None
- code_version: ursa.catalog.schemas.NonEmptyString¶
None
- config_hash: ursa.catalog.schemas.CatalogID¶
None
- time_window: ursa.catalog.schemas.TimeWindow¶
None
- created_at: ursa.catalog.schemas.UTCDatetime¶
None
- last_accessed_at: ursa.catalog.schemas.UTCDatetime¶
None
- storage_uri: ursa.catalog.schemas.StorageURI¶
None
- class ursa.catalog.schemas.CheckpointRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per Orion model checkpoint.
Primary key:
checkpoint_id— any unique catalog ID.run_idis an opaque ClearML task ID; runs are tracked in ClearML’s own database, not in the Ursa catalog.parent_checkpoint_idrecords the checkpoint this one was resumed from, enabling resume chains and full lineage traversal. The full list of recordings consumed lives atstorage_uri/data_hashes/manifest.json.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘checkpoint_id’,)
- checkpoint_id: ursa.catalog.schemas.CatalogID¶
None
- run_id: ursa.catalog.schemas.CatalogID¶
None
- step: int¶
‘Field(…)’
- model_id: ursa.catalog.schemas.NonEmptyString¶
None
- code_version: ursa.catalog.schemas.NonEmptyString¶
None
- storage_uri: ursa.catalog.schemas.StorageURI¶
None
- created_at: ursa.catalog.schemas.UTCDatetime¶
None
- parent_checkpoint_id: ursa.catalog.schemas.CatalogID | None¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.BenchmarkSuiteRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per versioned benchmark suite configuration.
Primary key: composite
(suite_name, suite_version), matching the identity used by :class:BenchmarkResultRowas its FK. Suite configs are standalone — no FK to recordings or participants.storage_uripoints to the held-out query spec and metric definitions on R2.Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘suite_name’, ‘suite_version’)
- suite_name: ursa.catalog.schemas.NonEmptyString¶
None
- suite_version: int¶
‘Field(…)’
- storage_uri: ursa.catalog.schemas.StorageURI¶
None
- created_at: ursa.catalog.schemas.UTCDatetime¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.BenchmarkResultRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowOne row per benchmark evaluation result.
Primary key:
result_id— content-addressed hash of the six identity fields(suite_name, suite_version, checkpoint_id, dataset_hash, partial_subset, partial_seed). The six fields are stored as queryable columns so callers can look up results without pre-computing the hash.dataset_hashis an opaque string; computation convention is a future feature.partial_subset∈ (0, 1] andpartial_seeddistinguish full evals from in-training partial benchmarks (see architecture §5.7).Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘result_id’,)
- result_id: ursa.catalog.schemas.CatalogID¶
None
- suite_name: ursa.catalog.schemas.NonEmptyString¶
None
- suite_version: int¶
‘Field(…)’
- checkpoint_id: ursa.catalog.schemas.CatalogID¶
None
- dataset_hash: ursa.catalog.schemas.CatalogID¶
None
- partial_subset: float¶
‘Field(…)’
- partial_seed: int | None¶
None
- storage_uri: ursa.catalog.schemas.StorageURI¶
None
- computed_at: ursa.catalog.schemas.UTCDatetime¶
None
- metadata: ursa.catalog.schemas.MetadataDict¶
‘Field(…)’
- class ursa.catalog.schemas.EmbeddingRow(/, **data: typing.Any)[source]¶
Bases:
ursa.catalog.schemas.CatalogRowVector embedding over a (recording, modality, window) tuple.
Primary key:
embedding_id— any unique catalog ID.vectoris typed asAnyfor runtime performance; runtime contract is non-empty list[float] (no bools), enforced by- Func:
_validate_vector_fast. Per-model fixed-dim enforcement lives in the Lance writer, materialized asFixedSizeList[float, dim]keyed bymodel_id.
Initialization
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- __primary_key__: ClassVar[tuple[str, ...]]¶
(‘embedding_id’,)
- embedding_id: ursa.catalog.schemas.CatalogID¶
None
- source: ursa.catalog.schemas.EmbeddingSource¶
None
- vector: ursa.catalog.schemas.Vector¶
None
- model_id: ursa.catalog.schemas.NonEmptyString¶
None