Architecture¶
The full Constellation Research Stack architecture is captured in the canonical Notion doc:
This page mirrors the Ursa-specific section so the deployed docs site can stand on its own.
Goals¶
Single API for all multimodal data. EEG, multiple video feeds, eye tracking, biometrics, questionnaire responses, keyboard/mouse/screen captures. Researchers don’t think about per-modality file formats.
Temporal-first queries. Load a 10s window across modalities for a given recording without loading the full recording.
Lazy / streaming by default. Multi-day continuous recordings can’t fit in memory; the API supports iterators that yield aligned windows.
Rich filtering. SQL-style on metadata; vector search over embeddings; combined queries.
Cloud-agnostic storage. R2 today, S3 tomorrow, Polaris-local SSD always.
Append-only / versioned. Schema evolution is non-negotiable for a long-lived dataset.
Lifecycle controls. Retention rules, automated GC, Polaris cache sync — all configured by the same filter language.
Data model¶
Participant
├── many Recordings (recording_hash is the primary identifier)
│ ├── many Modalities (eeg, video_webcam, video_screen, pupil, biometrics, ...)
│ │ ├── Zarr or Lance backing store
│ │ └── Domain (start, end), sampling spec, channel/frame metadata
│ ├── many Events (system-issued prompts, user responses, time-stamped)
│ ├── many Derived assets (Virgo outputs)
│ └── flexible metadata (queryable key-value)
└── flexible metadata (queryable key-value)
Catalog tables (Lance)¶
Six Lance tables, all Pydantic-schema-enforced:
participants— one row per participantrecordings— one row per recording (recording_hashis the primary key)modalities— one row per stream within a recording (eeg, video, pupil, …)events— system prompts + user responses + any time-stamped eventsderived_assets— Virgo outputs with full provenanceembeddings— vector column for semantic search
A seventh, benchmark_results, is populated by Orion.
Map columns are first-class queryable. Frequently-used keys can be promoted to typed columns later via Lance schema evolution without rewriting data.
Storage backends¶
Different modalities live in different stores; the API hides this from users.
Regular continuous streams (EEG @ 1024 Hz, biometrics @ 100 Hz, etc.) → Zarr on R2. Cloud-native chunked NDArray. Scales to multi-day recordings.
Video → mp4 stays as mp4 on R2; Ursa writes a Lance index of
(frame_idx → byte_offset, timestamp)so we can seek to a frame in O(1).Irregular event streams (clicks, keystrokes, prompts, responses, fixations) → Lance rows with timestamp columns.
Catalog and embeddings → Lance tables with full-text + vector search.
The user-facing data type is always temporaldata.Data / RegularTimeSeries / IrregularTimeSeries / Interval. Ursa adds Zarr- and Lance-backed implementations of those primitives so the rest of the stack doesn’t care which backend is in play.
Lifecycle¶
A single filter language drives three operations:
ursa.lifecycle.gc(filters, dry_run)— garbage collection (with Slack confirm over a configurable size threshold)ursa.lifecycle.sync_polaris(filters, max_size_tb)— what stays on the Polaris local SSDursa.lifecycle.backfill(pipeline, where)— re-run a Virgo pipeline on stale recordings
Filters: NotAccessedSince, OutdatedBy, SupersededBy, PinnedBy, plus a generic Filter(field, op, value).
Phasing¶
See Linear for issue-level detail.