# Derrick — Concept Document

## Purpose

Derrick is a **self-contained data pipeline engine** that syncs data from SaaS APIs into data infrastructure. It exists to solve a specific problem: getting operational data (orders, customers, events, ad metrics) out of third-party APIs and into a queryable destination — with minimal setup, zero external dependencies, and full operational visibility.

The system is designed to be deployed as a **single binary** that includes the backend, the embedded database, and the frontend — making it runnable with a single command.

---

## Core Concepts

### Sources

A **Source** represents a configured connection to an external API. Each source references a **connector type** (e.g. Klaviyo, Shopify) and holds the credentials and configuration needed to authenticate against that API. A source may optionally define a **start date** — the earliest date from which data should be synced.

Sources are the "read side" of the pipeline. They are reusable: one source can feed multiple streams.

### Connectors

A **Connector** defines *how* to interact with a specific API. Derrick supports two flavours:

- **Declarative connectors** — defined entirely in YAML. A manifest file specifies the base URL, authentication scheme, available streams, pagination strategy, date filtering, and response paths. The engine interprets these manifests at runtime. Adding a new API integration requires no code — only a new YAML file.

- **Coded connectors** — implemented in code for APIs that require more complex logic (OAuth2 flows, custom request construction, report-based APIs like Google Ads or GA4). These implement the source connector interface directly.

Both types are registered in a **ConnectorRegistry** at startup and are discoverable by the API and UI.

### Destinations

A **Destination** is a configured target where synced data is written. Derrick supports four destination types:

| Destination | Behaviour |
|---|---|
| **Postgres** | Upsert semantics — matching records are updated, new records are inserted. Auto-creates tables after inferring schema from the data shape. Supports schema evolution (new columns detected and added automatically). |
| **S3** | Uploads files in Parquet, Avro, or JSONL format with Hive-style date partitioning (`dt=YYYY-MM-DD`). Supports hour-level partitioning. |
| **Kafka** | Produces JSON messages keyed by the record's primary key. Supports SASL/PLAIN authentication. |
| **Local File** | Writes newline-delimited JSON files to a local folder. Optional date partitioning. |

All destinations implement a common writer interface that defines three operations: ensure table exists, write a batch of records, and clean up a range before resync.

### Connections

A **Connection** binds a source and a destination together within a workspace. Connections carry routing configuration — a **destination namespace** (e.g. a Postgres schema) and a **stream prefix** (prepended to table names). Connections serve as the organizing layer: streams are created within a connection.

### Streams

A **Stream** is the fundamental unit of work. It represents a specific data entity (e.g. "Klaviyo events", "Shopify orders") being synced from a source to a destination via a connection. Each stream has:

- A **name** and **path** (matching the connector's stream definition)
- A **primary key** (for deduplication/upsert)
- A **sync strategy** — controls the chronological direction of processing:
  - `oldest_first` — processes ranges in forward chronological order (standard backfill)
  - `newest_first` — processes ranges in **reverse chronological order**, delivering the most recent data first before working backwards through history
- A **sync granularity** (`day` or `hour`)
- Optional **scheduling** via cron expression
- **Retry configuration** (max attempts, backoff seconds)

Streams can be enabled/disabled, triggered manually, scheduled, or reset.

### Workspaces

A **Workspace** provides logical tenant isolation. Sources, destinations, connections, and streams are scoped to a workspace. API keys can be bound to a specific workspace. Workspaces enable multi-tenant operation within a single deployment.

---

## How Syncing Works

### Range-Based Incremental Sync

Derrick's sync engine operates on **time ranges**. When a stream is synced:

1. The engine generates intervals based on the stream's granularity (day or hour) from the source's start date to the current time.
2. All expected intervals are persisted as **SyncRange** rows in the database (idempotent — duplicates are silently ignored).
3. The engine enters a **pick-next loop**, querying the database for the next processable range (pending or failed, ordered by the sync strategy).
4. For each range, data is fetched from the source using **streaming pagination** — pages are sent via a bounded channel and written to the destination as they arrive, avoiding full dataset accumulation in memory.
5. Ranges are marked as `running` → `complete` or `failed`, with progress tracked at the range level.

If a connector does not support date-range filtering, the engine falls back to a **full read**.

### Backward Chronological Sync (Newest-First)

A key capability of Derrick is the ability to **sync data in reverse chronological order**. When a stream's sync strategy is set to `newest_first`:

- The pick-next loop selects ranges starting from the **most recent date** and works backwards toward the source's start date.
- This means users get **today's data first**, then yesterday's, then last week's — before the system fills in historical data from months or years ago.
- Each range is still an independent, atomic unit of work. The only difference is the processing order.

This is particularly valuable for:
- **Dashboard-first use cases** — users need current data immediately and can tolerate a gradual backfill of history.
- **Large historical datasets** — when a source has years of data, waiting for a complete backfill before seeing any recent data is impractical.
- **APIs with rate limits** — the most valuable (recent) data is secured first, even if rate limits slow down the full backfill.

The `newest_first` strategy requires the API connector to support **range-bounded queries** (both a start and end date filter). Connectors that only support a "greater-than" date filter are limited to `oldest_first`. The connector manifest explicitly declares whether it supports an end-date filter, and the UI constrains the available strategy options accordingly.

### Retry and Error Handling

Each range is individually retryable. Errors are categorized (`auth`, `rate_limit`, `network`, `transform`, `destination`, `unknown`), and retry eligibility is determined by category — only transient failures (rate limit, network, unknown) are retried.

Retries use **exponential backoff** with a configurable base and a ceiling of 300 seconds. A **circuit breaker** pattern is used: ranges that fail during the main loop are excluded from the pick-next query and given a final retry pass after all other ranges complete.

### Streaming Write (Page-at-a-Time)

The connector sends pages of records through an async channel. The writer consumes pages as they arrive. This design means:

- Memory usage is bounded to the channel buffer (4 pages) rather than the total dataset.
- For object store destinations, each page becomes a separate file within the date partition.
- For relational destinations, each page is a separate insert batch.

### Cancellation

Running syncs support **cooperative cancellation**. Each sync run is assigned a cancellation token registered in a shared registry. The API or scheduler can signal cancellation, and the sync loop checks the token between ranges. Cleanup is guaranteed automatically — the token is removed from the registry when the sync task completes, even in error scenarios.

---

## Scheduling

The **Scheduler** runs as a background task, ticking every 10 seconds. On each tick it:

1. Refreshes OAuth2 tokens that are expiring within 5 minutes.
2. Cleans up expired sessions (hourly).
3. Re-reads concurrency settings from the database.
4. Evaluates cron expressions for all enabled, scheduled streams.
5. Checks retry eligibility for streams with failed ranges.
6. Acquires concurrency permits (global + per-source) before spawning syncs.

### Concurrency Control

Two-level semaphore system:
- **Global limit** — maximum total concurrent syncs across all sources (default: 4).
- **Per-source limit** — maximum concurrent syncs from a single source/connector (default: 2).

Both limits are configurable at runtime via the settings API. Permits are held for the duration of the sync and released when the task completes.

### Crash Recovery

On startup, the scheduler runs **crash recovery**: any sync runs or sync ranges left in a `running` state (from a previous unclean shutdown) are transitioned to `failed`. Activity log events are emitted for each interrupted range.

---

## Data Explorer

Derrick includes a built-in **Data Explorer** for querying synced data in S3 destinations. It shells out to the DuckDB CLI, injecting S3 credentials as a preamble. The explorer also supports table discovery — walking the object store prefix hierarchy to enumerate available namespaces and streams.

---

## Authentication & Security

- **API Key authentication**: Keys are stored as SHA-256 hashes. Keys follow the format `dk_live_{32 hex}`.
- **Session-based auth**: Login with an API key creates a time-limited session stored as a cookie.
- **Auth bypass**: Configurable via `DERRICK_AUTH_DISABLED` for development.
- **Admin key seeding**: On first boot, an admin key can be provisioned via `DERRICK_ADMIN_API_KEY` env var.

---

## Notifications

The system generates notifications for operational events:
- **Schema changes** — when a new column is detected and auto-added during Postgres writes.
- **OAuth2 token failures** — when token refresh fails, including revocation detection.
- **Sync failures** — tracked at the range and run level.

Notifications are surfaced in the UI via a notification bell with unread count.

---

## Frontend

The React/TypeScript frontend serves as the operational console. It includes:

| Page | Purpose |
|---|---|
| **Dashboard** | Overview of sync activity and system health |
| **Connections** | Manage source-destination bindings and stream configuration |
| **Connection Detail** | Deep view into a connection's streams, activity log, and range calendar |
| **Sources** | CRUD for API source configurations |
| **Destinations** | CRUD for data destinations with connection testing |
| **Streams** | Stream management, sync triggering, and scheduling |
| **Sync Status** | Live view of stream activity and last-run status |
| **Sync History** | Historical sync run log with filtering |
| **Data Explorer** | SQL query interface for S3 data via DuckDB |
| **Settings** | Scheduler concurrency settings and API key management |
| **Login** | API key authentication |

Notable components include a **Range Calendar** (visualizing sync range status as a heatmap), **Sync Charts**, **Scheduler Badge**, and a **Workspace Switcher** for multi-tenant navigation.

The frontend is embedded into the server binary as static assets and served as a single-page application with a fallback catch-all route.