Architecture

The full Constellation Research Stack architecture is captured in the canonical Notion doc:

This page mirrors the Ursa-specific section so the deployed docs site can stand on its own.

Goals

  1. Single API for all multimodal data. EEG, multiple video feeds, eye tracking, biometrics, questionnaire responses, keyboard/mouse/screen captures. Researchers don’t think about per-modality file formats.

  2. Temporal-first queries. Load a 10s window across modalities for a given recording without loading the full recording.

  3. Lazy / streaming by default. Multi-day continuous recordings can’t fit in memory; the API supports iterators that yield aligned windows.

  4. Rich filtering. SQL-style on metadata; vector search over embeddings; combined queries.

  5. Cloud-agnostic storage. R2 today, S3 tomorrow, Polaris-local SSD always.

  6. Append-only / versioned. Schema evolution is non-negotiable for a long-lived dataset.

  7. Lifecycle controls. Retention rules, automated GC, Polaris cache sync — all configured by the same filter language.

Data model

Participant
   ├── many Recordings (recording_hash is the primary identifier)
   │     ├── many Modalities (eeg, video_webcam, video_screen, pupil, biometrics, ...)
   │     │     ├── Zarr or Lance backing store
   │     │     └── Domain (start, end), sampling spec, channel/frame metadata
   │     ├── many Events (system-issued prompts, user responses, time-stamped)
   │     ├── many Derived assets (Virgo outputs)
   │     └── flexible metadata (queryable key-value)
   └── flexible metadata (queryable key-value)

Catalog tables (Lance)

Six Lance tables, all Pydantic-schema-enforced:

  • participants — one row per participant

  • recordings — one row per recording (recording_hash is the primary key)

  • modalities — one row per stream within a recording (eeg, video, pupil, …)

  • events — system prompts + user responses + any time-stamped events

  • derived_assets — Virgo outputs with full provenance

  • embeddings — vector column for semantic search

A seventh, benchmark_results, is populated by Orion.

Map columns are first-class queryable. Frequently-used keys can be promoted to typed columns later via Lance schema evolution without rewriting data.

Storage backends

Different modalities live in different stores; the API hides this from users.

  • Regular continuous streams (EEG @ 1024 Hz, biometrics @ 100 Hz, etc.) → Zarr on R2. Cloud-native chunked NDArray. Scales to multi-day recordings.

  • Video → mp4 stays as mp4 on R2; Ursa writes a Lance index of (frame_idx byte_offset, timestamp) so we can seek to a frame in O(1).

  • Irregular event streams (clicks, keystrokes, prompts, responses, fixations) → Lance rows with timestamp columns.

  • Catalog and embeddings → Lance tables with full-text + vector search.

The user-facing data type is always temporaldata.Data / RegularTimeSeries / IrregularTimeSeries / Interval. Ursa adds Zarr- and Lance-backed implementations of those primitives so the rest of the stack doesn’t care which backend is in play.

Lifecycle

A single filter language drives three operations:

  • ursa.lifecycle.gc(filters, dry_run) — garbage collection (with Slack confirm over a configurable size threshold)

  • ursa.lifecycle.sync_polaris(filters, max_size_tb) — what stays on the Polaris local SSD

  • ursa.lifecycle.backfill(pipeline, where) — re-run a Virgo pipeline on stale recordings

Filters: NotAccessedSince, OutdatedBy, SupersededBy, PinnedBy, plus a generic Filter(field, op, value).

Phasing

See Linear for issue-level detail.