# Derrick — Data Model Design Document

## Overview

Derrick's data model represents a **pipeline configuration and execution tracking** system. It captures the full lifecycle: from how sources and destinations are configured, through how sync work is planned and scheduled, to how execution is tracked at the individual time-range level.

All entities use **UUIDs** as primary keys. Timestamps track creation and modification. Soft-delete is used for entities that participate in historical data (workspaces, connections, streams).

---

## Entity Relationship Diagram

```mermaid
erDiagram
    Workspace ||--o{ Source : "scopes"
    Workspace ||--o{ Destination : "scopes"
    Workspace ||--o{ Connection : "contains"
    Workspace ||--o{ ApiKey : "bound to"

    Source ||--o{ Connection : "feeds"
    Destination ||--o{ Connection : "receives"

    Connection ||--o{ Stream : "configures"
    Source ||--o{ Stream : "reads from"
    Destination ||--o{ Stream : "writes to"

    Stream ||--o{ SyncRun : "produces"
    Stream ||--o{ SyncRange : "divided into"
    Stream ||--o{ SyncRangeLog : "logs activity"

    SyncRun ||--o{ SyncRangeLog : "referenced by"
    SyncRange ||--o{ SyncRangeLog : "referenced by"

    Stream ||--o{ Notification : "triggers"

    ApiKey ||--o{ Session : "authenticates"
```

---

## Entities

### Workspace

Provides **logical multi-tenancy**. All pipeline entities (sources, destinations, connections) are scoped to a workspace. A single instance can serve multiple isolated environments.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `name` | Human-readable workspace name |
| `deleted_at` | Soft-delete timestamp (nullable) |
| `created_at` | Creation timestamp |
| `updated_at` | Last modification timestamp |

**Lifecycle**: Active → Soft-deleted (retains historical data).

---

### Source

A configured connection to an external API. Holds the **connector type** and **credentials**.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `name` | Display name (e.g. "Production Klaviyo") |
| `connector` | Connector type identifier (e.g. `klaviyo`, `shopify`, `facebook_marketing`) |
| `config` | JSON blob containing credentials and connector-specific settings |
| `start_date` | Earliest date from which to sync data (nullable) |
| `workspace_id` | Parent workspace reference (nullable for backward compat) |
| `created_at`, `updated_at` | Timestamps |

**Design notes**:
- `config` is a JSON blob rather than structured columns because each connector type has a different credential shape (API key, OAuth2 tokens with refresh/expiry, basic auth, service account JSON).
- For OAuth2 sources, `config` also holds `access_token`, `refresh_token`, and `token_expires_at` — updated in-place by the token refresh system.

---

### Destination

A configured target system where synced data is written.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `name` | Display name (e.g. "Production Warehouse") |
| `dest_type` | Type identifier: `postgres`, `s3`, `kafka`, `local_file` |
| `config` | JSON blob containing connection details specific to the destination type |
| `destination_namespace` | Default schema/prefix for tables (nullable) |
| `stream_prefix` | Default prefix prepended to stream table names (nullable) |
| `workspace_id` | Parent workspace reference (nullable) |
| `created_at`, `updated_at` | Timestamps |

**Design notes**:
- `destination_namespace` and `stream_prefix` provide default routing. These can be overridden at the connection level.
- Config shapes vary by type: Postgres needs a `connection_string`, S3 needs `bucket`/`region`/`access_key_id`/`secret_access_key`, Kafka needs `bootstrap_servers`/`topic`, local file needs `folder`.

---

### Connection

Binds a source to a destination within a workspace. Serves as the primary organizational unit in the UI.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `workspace_id` | Parent workspace |
| `source_id` | Reference to the source |
| `destination_id` | Reference to the destination |
| `name` | Display name (e.g. "Klaviyo → S3 Lake") |
| `destination_namespace` | Schema/prefix override for this connection (nullable) |
| `stream_prefix` | Table name prefix override for this connection (nullable) |
| `deleted_at` | Soft-delete timestamp |
| `created_at`, `updated_at` | Timestamps |

**Design notes**:
- Routing config at the connection level takes precedence over destination defaults. This allows multiple connections to the same destination with different namespaces.
- `ConnectionWithStats` is a **view model** (not a persisted entity) that joins connection data with aggregate stream statistics: stream count, enabled count, total records/bytes synced, last sync time, and health status.

---

### Stream

The fundamental unit of sync work. Represents a specific data entity being pulled from a source and written to a destination.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `source_id` | Reference to the source (for direct resolution) |
| `destination_id` | Reference to the destination (for direct resolution) |
| `connection_id` | Reference to the parent connection |
| `name` | Stream name matching the connector's stream definition (e.g. `events`, `orders`) |
| `path` | API path from the connector manifest |
| `primary_key` | Field used for deduplication/upsert (e.g. `id`) |
| `enabled` | Whether the stream is active for scheduling |
| `sync_strategy` | Processing order: `oldest_first` or `newest_first` |
| `sync_granularity` | Range size: `day` or `hour` |
| `sync_interval` | Deprecated; replaced by `schedule_cron` |
| `schedule_cron` | Cron expression for scheduled syncs (nullable) |
| `retry_max_attempts` | Maximum retry attempts per range |
| `retry_backoff_seconds` | Base backoff duration in seconds |
| `deleted_at` | Soft-delete timestamp |
| `created_at`, `updated_at` | Timestamps |

**Design notes**:
- Streams reference both `source_id`/`destination_id` directly AND `connection_id`. This redundancy enables efficient queries (e.g. finding all streams for a source) without always joining through connections.
- `sync_strategy` determines processing order: `newest_first` prioritises recent data (useful for dashboards), `oldest_first` for complete backfills.

---

### SyncRun

Tracks a single execution of a stream's sync. One sync run may process many ranges.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `stream_id` | Reference to the stream |
| `status` | Current state: `running`, `complete`, `failed`, `cancelled` |
| `records_synced` | Total records written across all ranges |
| `bytes_synced` | Total bytes written across all ranges |
| `error` | Error message if failed (nullable) |
| `error_category` | Classified error type: `auth`, `rate_limit`, `network`, `transform`, `destination`, `unknown` (nullable) |
| `started_at` | When execution began |
| `completed_at` | When execution finished (nullable) |
| `duration_ms` | Total duration in milliseconds (nullable) |

**Lifecycle**: `running` → `complete` | `failed` | `cancelled`.

---

### SyncRange

Represents a single time interval within a sync. This is the **granular unit of progress tracking**.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `stream_id` | Reference to the stream |
| `range_start` | Start of the time interval |
| `range_end` | End of the time interval |
| `status` | Current state: `pending`, `running`, `complete`, `failed`, `retrying` |
| `records_synced` | Records written for this range |
| `bytes_synced` | Bytes written for this range |
| `error` | Error message if failed (nullable) |
| `attempt_count` | Number of processing attempts |
| `created_at` | When the range row was created |
| `started_at` | When processing began (nullable) |
| `completed_at` | When processing finished (nullable) |

**Lifecycle**: `pending` → `running` → `complete` | `failed` → `retrying` → `running` → ...

**Design notes**:
- Ranges are created idempotently via bulk `INSERT ... ON CONFLICT DO NOTHING`. The engine periodically refreshes interval rows (every 10 ranges or 5 minutes) to discover newly-elapsed hours/days.
- A "current" range (where `now` falls between `range_start` and `range_end`) is kept in `pending` state after processing, since more data may arrive within that interval.
- `attempt_count` drives retry backoff scheduling at the scheduler level.

---

### SyncRangeLog

An append-only **activity log** tracking range-level events. Used for operational visibility and debugging.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `stream_id` | Reference to the stream |
| `sync_range_id` | Reference to the range (nullable for stream-level events) |
| `run_id` | Reference to the sync run (nullable for recovery events) |
| `event_type` | Event classification (see below) |
| `range_start`, `range_end` | Time boundaries of the event (nullable) |
| `attempt_number` | Which attempt this event relates to (nullable) |
| `backoff_seconds` | Backoff duration if retrying (nullable) |
| `error` | Error message (nullable) |
| `error_category` | Classified error type (nullable) |
| `records_synced`, `bytes_synced` | Metrics at time of event (nullable) |
| `metadata` | Arbitrary JSON for extensibility (nullable) |
| `created_at` | Event timestamp |

**Event types**: `started`, `completed`, `failed`, `retry_scheduled`, `retry_attempt`, `cancelled`, `interrupted`, `interleave`.

---

### Notification

User-facing alerts for operational events.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `notification_type` | Category: `schema_change`, `token_refresh_failed`, `sync_error` |
| `title` | Short display title |
| `message` | Detailed message body |
| `stream_id` | Related stream (nullable — some notifications are source-level) |
| `is_read` | Whether the user has acknowledged it |
| `created_at` | When generated |

---

### ApiKey

Stores hashed API keys for authentication.

| Field | Description |
|---|---|
| `id` | UUID primary key |
| `key_hash` | SHA-256 hash of the full key |
| `key_prefix` | First 8 characters of the random portion (for display identification) |
| `label` | Human-readable label (e.g. "admin", "ci-pipeline") |
| `workspace_id` | Workspace scope (nullable — null means full access) |
| `is_active` | Whether the key is valid for authentication |
| `created_at`, `updated_at` | Timestamps |

**Design notes**:
- Raw API keys are never stored. Only the hash is persisted.
- `key_prefix` allows users to identify which key is which in a list, without exposing the full key.

---

### Session

Ephemeral login sessions created when authenticating via the UI.

| Field | Description |
|---|---|
| `id` | UUID primary key (stored as session cookie) |
| `api_key_id` | Reference to the API key used to authenticate |
| `expires_at` | Session expiry (auto-cleaned hourly by the scheduler) |
| `created_at` | When created |

---

### Setting

Key-value store for runtime configuration.

| Field | Description |
|---|---|
| `key` | Setting identifier (e.g. `global_max_concurrency`, `source_max_concurrency`) |
| `value` | String value (parsed to the appropriate type at read time) |
| `updated_at` | Last modification timestamp |

---

## View Models (Non-Persisted)

These models are computed via joins/aggregations and returned by the API. They are not stored as separate tables.

| Model | Purpose |
|---|---|
| `StreamStatus` | Joins stream + source + latest sync run for the dashboard status view |
| `ConnectionWithStats` | Joins connection + source + destination + aggregate stream stats |
| `ActivityLogEntry` | Joins SyncRangeLog with stream name for human-readable activity feeds |
| `SyncStatsRow` | Aggregated counters (complete/pending/failed/running ranges, recent activity buckets) |

---

## Key Design Decisions

1. **JSON config blobs** — Source and destination configs use JSON rather than structured columns. This is intentional: each connector and destination type has a unique configuration shape, and the schema changes when new connectors are added. JSON provides forward-compatible storage.

2. **Range-level tracking** — Progress is tracked at the individual time-range level, not just the run level. This enables partial resync (re-process a single day), visual status heatmaps, and precise retry targeting.

3. **Activity log as append-only event stream** — `SyncRangeLog` serves as an audit trail and debugging tool. Events are fire-and-forget (failures to log don't block syncs).

4. **Soft-delete for referential entities** — Workspaces, connections, and streams use `deleted_at` rather than hard delete, preserving historical sync data.

5. **Redundant foreign keys on Stream** — Streams reference both `connection_id` and `source_id`/`destination_id` directly. This trades normalisation for query efficiency in the common path (trigger sync needs source config and destination config without traversing connections).
