Architecture¶

The full Constellation Research Stack architecture is captured in the canonical Notion doc:

Research Stack Architecture — Notion

This page mirrors the Ursa-specific section so the deployed docs site can stand on its own.

Goals¶

Single API for all multimodal data. EEG, multiple video feeds, eye tracking, biometrics, questionnaire responses, keyboard/mouse/screen captures. Researchers don’t think about per-modality file formats.
Temporal-first queries. Load a 10s window across modalities for a given recording without loading the full recording.
Lazy / streaming by default. Multi-day continuous recordings can’t fit in memory; the API supports iterators that yield aligned windows.
Rich filtering. SQL-style on metadata; vector search over embeddings; combined queries.
Cloud-agnostic storage. R2 today, S3 tomorrow, Polaris-local SSD always.
Append-only / versioned. Schema evolution is non-negotiable for a long-lived dataset.
Lifecycle controls. Retention rules, automated GC, Polaris cache sync — all configured by the same filter language.

Data model¶

Participant
   ├── many Recordings (recording_hash is the primary identifier)
   │     ├── many Modalities (eeg, video_webcam, video_screen, pupil, biometrics, ...)
   │     │     ├── Zarr or Lance backing store
   │     │     └── Domain (start, end), sampling spec, channel/frame metadata
   │     ├── many Events (system-issued prompts, user responses, time-stamped)
   │     ├── many Derived assets (Virgo outputs)
   │     └── flexible metadata (queryable key-value)
   └── flexible metadata (queryable key-value)

Catalog tables (Lance)¶

Six Lance tables, all Pydantic-schema-enforced:

participants — one row per participant
recordings — one row per recording (recording_hash is the primary key)
modalities — one row per stream within a recording (eeg, video, pupil, …)
events — system prompts + user responses + any time-stamped events
derived_assets — Virgo outputs with full provenance
embeddings — vector column for semantic search

A seventh, benchmark_results, is populated by Orion.

Map columns are first-class queryable. Frequently-used keys can be promoted to typed columns later via Lance schema evolution without rewriting data.

Storage backends¶

Different modalities live in different stores; the API hides this from users.

Regular continuous streams (EEG @ 1024 Hz, biometrics @ 100 Hz, etc.) → Zarr on R2. Cloud-native chunked NDArray. Scales to multi-day recordings.
Video → mp4 stays as mp4 on R2; Ursa writes a Lance index of (frame_idx → byte_offset, timestamp) so we can seek to a frame in O(1).
Irregular event streams (clicks, keystrokes, prompts, responses, fixations) → Lance rows with timestamp columns.
Catalog and embeddings → Lance tables with full-text + vector search.

The user-facing data type is always temporaldata.Data / RegularTimeSeries / IrregularTimeSeries / Interval. Ursa adds Zarr- and Lance-backed implementations of those primitives so the rest of the stack doesn’t care which backend is in play.

Lifecycle¶

A single filter language drives three operations:

ursa.lifecycle.gc(filters, dry_run) — garbage collection (with Slack confirm over a configurable size threshold)
ursa.lifecycle.sync_polaris(filters, max_size_tb) — what stays on the Polaris local SSD
ursa.lifecycle.backfill(pipeline, where) — re-run a Virgo pipeline on stale recordings

Filters: NotAccessedSince, OutdatedBy, SupersededBy, PinnedBy, plus a generic Filter(field, op, value).

Phasing¶

See Linear for issue-level detail.