# Derrick — Architecture Document

## Scope

This document describes the **technical constraints and architectural expectations** of the Derrick data pipeline engine. It is not a technology selection document — it describes what the system must handle, what guarantees it must provide, and what trade-offs have been made.

---

## 1. Deployment Constraints

### Single-Binary Distribution

The system **must** be deployable as a single executable that bundles:
- The HTTP API server
- The background scheduler
- The frontend SPA (embedded as static assets)
- Database migrations

A user should be able to run `./derrick` and have a fully operational system — no configuration files, no orchestration, no companion services.

### Zero-Dependency Local Mode

When no external database is configured, the system **must** automatically start and manage an embedded PostgreSQL instance. This includes:
- Downloading and installing the database engine on first run
- Persistent data storage across restarts (data survives process restart)
- Automatic database creation and migration
- Stale process cleanup (detecting and killing orphaned database processes from unclean shutdown)

This constraint eliminates the "install Postgres first" barrier for development and small deployments.

### External Database Mode

When an external database connection string is provided, the system **must** use it instead of the embedded instance. The same migration set must work for both modes. The embedded instance must remain alive for the full duration of the process when used.

### Container Deployment

The system **must** support containerised deployment. In container mode, the embedded database is not available — an external database connection is required. The container image should be minimal (runtime libraries only, no build tools).

---

## 2. Runtime Constraints

### Async Concurrency Model

All I/O operations (API calls, database queries, file writes, network operations) **must** be non-blocking. The system runs many concurrent syncs, each of which makes many HTTP requests and database writes. Blocking any operation would starve the async runtime and degrade throughput for all other concurrent work.

### Memory-Bounded Streaming

Sync operations **must not** accumulate the entire dataset in memory. The system deals with API responses that can contain millions of records across thousands of pages. The architecture requires a **page-at-a-time streaming model**:

- The connector (reader) sends pages through a bounded channel.
- The destination (writer) consumes pages as they arrive.
- The channel buffer is bounded (currently 4 pages), creating natural backpressure — if the writer is slower than the reader, the reader blocks until buffer space is available.

This constraint ensures memory usage is proportional to the channel buffer size, not the total dataset size.

### Cooperative Cancellation

Running syncs **must** be cancellable without killing the process. Cancellation is cooperative — the sync loop checks a cancellation signal between ranges. Requirements:
- Cancellation must be signalled externally (via API) and checked internally (in the sync loop).
- Partial progress (ranges already completed before cancellation) must be preserved.
- Resources (permits, registry entries) must be cleaned up even if the task panics.

### Clock and Time Handling

All system timestamps use UTC. Date ranges are expressed as timezone-naive datetime values for range boundaries (representing logical calendar intervals), and timezone-aware UTC timestamps for event tracking. The sync engine must handle the "current range" problem — a range that includes `now` cannot be marked complete because more data may arrive within that interval. These ranges must remain in `pending` state.

---

## 3. Data Integrity Constraints

### Idempotent Range Processing

Range processing **must** be idempotent. The same range can be processed multiple times (due to retries, resyncs, or scheduler re-triggers) without producing duplicate data. Requirements:
- Range rows are created idempotently — duplicate inserts for the same stream and time interval are silently ignored.
- Before writing data for a range, previously written data for that range must be cleaned up (for file-based destinations, this means deleting existing partition files; for upsert-capable destinations, the conflict resolution handles deduplication).
- Range status transitions must be safe under concurrent access.

### Upsert Semantics for Relational Destinations

Relational database writes **must** use upsert semantics. When the same primary key is encountered twice, the newer record must overwrite the older one. Non-primary-key columns are all updated.

### Schema Evolution

When new fields appear in API responses that don't exist as columns in the destination table, the system **must** automatically add them. Requirements:
- New columns are detected by comparing record keys against cached column metadata.
- Columns are added automatically with a permissive type (string/text for evolved columns).
- Schema changes generate user-visible notifications.
- The column cache must be populated once per sync and updated in-memory after evolution.

### Crash Recovery

If the system crashes or is killed during a sync:
- All sync runs left in `running` state must be transitioned to `failed` on next startup.
- All sync ranges left in `running` state must be transitioned to `failed` on next startup.
- Activity log events (`interrupted`) must be emitted for each recovered range.
- The scheduler must not start normal operation until recovery is complete.

---

## 4. Connector Extensibility

### Declarative Connector Contract

Adding a new API integration via YAML must not require any code changes beyond:
1. Adding the YAML manifest file.
2. Registering the manifest in the connector registry builder.

The declarative engine must support:
- Multiple authentication types: API key (header), Bearer token, Basic Auth, and OAuth2.
- Request construction with configurable headers, base URLs, and query parameters.
- Date filtering via parameterised templates (e.g. `filter=greater-than(updated,{date})`).
- Range-bounded queries with both min and max date parameters.
- Response extraction via configurable JSON paths.
- Cursor-based pagination with configurable next-page path.
- Attribute flattening for nested API response structures (e.g. JSON:API `attributes`).
- Extra static headers (e.g. API versioning headers).

### Coded Connector Contract

Connectors requiring custom logic implement the source connector interface with three operations:
- **metadata** — returns connector identity and auth type.
- **read range** — fetches data for a specific time range given config and stream name.
- **read range paginated** — streams pages through a bounded channel for a time range. Default behaviour wraps the basic read-range as a single page.

Coded connectors must handle their own:
- OAuth2 token injection into requests.
- Complex request construction (e.g. Google Ads GAQL queries, GA4 report requests).
- Custom pagination patterns (offset-based, token-based).
- Response transformation (normalising API-specific response structures to flat JSON records).

### Connector Registry

All connectors (declarative and coded) must be registered in a central registry at startup. The registry provides:
- Connector lookup by ID.
- Stream discovery (listing available streams for a connector).
- Config field metadata (for driving the UI source creation form).
- Connector detail (auth type, stream count, config field definitions).

---

## 5. Destination Abstraction

### Writer Contract

All destinations implement a common writer interface with three operations:
- **ensure table** — given a table name, primary key, and a sample record, create the target structure if it doesn't exist.
- **write batch** — write a batch of records to the named table. Returns the number written.
- **cleanup range** — delete previously written data for a date range before rewrite. Default is a no-op (for upsert-capable destinations).

### Writer Factory

A centralized factory function **must** construct the appropriate writer from destination type and config. This eliminates duplication between API-triggered syncs and scheduler-triggered syncs.

### Object Store Partitioning

File-based destinations (S3, local file) must support three path layout strategies:
- **Flat** — single file per table.
- **Date-partitioned** — Hive-style `dt=YYYY-MM-DD` directories.
- **Hour-partitioned** — Hive-style `dt=YYYY-MM-DD/hr=HH` directories.

Partition cleanup before resync must delete all files within the target partition prefix.

### Format Pluggability

Object store writes must support multiple serialisation formats:
- JSONL (newline-delimited JSON)
- Parquet (columnar, compressed)
- Avro (schema-embedded binary)

The format is selected per-destination configuration and is independent of the storage backend.

---

## 6. Scheduling Constraints

### Cron Evaluation

The scheduler must evaluate standard cron expressions. It must prevent duplicate triggers — a stream should not be triggered again if a sync is already running for it. The scheduler evaluates against the last run's start time to determine if a cron tick has been missed.

### Concurrency Control

The scheduler must enforce two-level concurrency limits:
- A global limit on total concurrent syncs.
- A per-source limit on concurrent syncs from a single connector type.

Both limits must be adjustable at runtime (read from the database on every scheduler tick). Permits must be held for the duration of the sync and automatically released when the sync completes or fails.

### Retry Eligibility

The scheduler must check retry eligibility before triggering a sync for a stream with failed ranges:
- Ranges that have exceeded `retry_max_attempts` are considered exhausted and must not block the stream.
- Ranges still within their retry count must respect the exponential backoff period — the scheduler must not trigger a sync while any non-exhausted range is in its backoff window.

### OAuth2 Token Lifecycle

The scheduler must proactively refresh OAuth2 tokens before they expire. Requirements:
- Check all OAuth2 sources on every tick.
- Refresh tokens expiring within 5 minutes.
- On refresh failure, generate a user-visible notification.
- On grant revocation (`invalid_grant`), generate a specific re-authentication notification.
- Updated tokens must be persisted back to the source config in-place.

---

## 7. API Contract

### REST + JSON

The API exposes a RESTful interface with JSON payloads. All endpoints are mounted under `/api`. The frontend SPA is served from the root `/` with a catch-all fallback for client-side routing.

### CORS

The API must support cross-origin requests (for development mode where frontend and backend run on different ports).

### Authentication

All API endpoints (except health check) must require authentication via:
1. `Authorization: Bearer <api_key>` header, OR
2. `derrick_session` cookie containing a valid session UUID.

Authentication can be globally disabled via environment variable for development.

### Connection Testing

The destination creation flow must support a **test connection** endpoint that validates credentials before saving. Each destination type implements its own test logic.

### Bulk Operations

The streams API must support bulk creation — creating multiple streams at once for a connection, avoiding the n+1 round-trip problem when setting up a new integration.

---

## 8. Security Constraints

- API keys must never be stored in plaintext. Only SHA-256 hashes are persisted.
- Destination credentials (database passwords, AWS secret keys, Kafka passwords) are stored in the database config JSON blob. In a production deployment, the database itself must be secured.
- OAuth2 client secrets are stored in source configs — the system assumes the state database is a trusted store.
- The auth bypass mode (`DERRICK_AUTH_DISABLED`) must log a prominent warning and must never be used in production.

---

## 9. Observability Constraints

### Structured Logging

The system must support two logging modes:
- Human-readable (default for development).
- Structured JSON (for production log aggregation).

Logs must include contextual fields: `run_id`, `stream_id`, `stream_name`, `source_name`, `range`, `error_category`, `records_synced`, `bytes_synced`, `duration_ms`.

### Activity Log

Range-level events must be persisted to an append-only activity log. This log must survive process restarts and serve as the source of truth for operational debugging. Event emission is fire-and-forget — logging failures must not block sync operations.

### Health Check

A health endpoint must report:
- Database connectivity.
- Scheduler status (active syncs by source).
- Runtime configuration (embedded PG URL if applicable).

---

## 10. Scalability Boundaries

### Single-Process Architecture

Derrick is designed as a **single-process** system. There is one scheduler, one set of concurrency permits, one cancellation registry. This simplifies reasoning about state but imposes limits:
- Total throughput is bounded by the process's async runtime capacity and the concurrency settings.
- Horizontal scaling would require architectural changes (distributed locks, external job queues).

### Database as the Coordination Layer

The relational database serves as the single source of truth for:
- Pipeline configuration (sources, destinations, streams).
- Sync state (runs, ranges, range status).
- Settings and concurrency configuration.
- Auth state (keys, sessions).

This means the database is the bottleneck for coordination. The system assumes the database performance is sufficient for the operational load (hundreds of concurrent range status updates per second).

### API Rate Limiting

The system does not impose its own rate limits on outbound API calls. It relies on:
- Per-source concurrency limits (limiting parallel syncs to the same API).
- Retries with exponential backoff on `429 Too Many Requests` responses.
- The assumption that sources are configured with per-source limits that respect the API provider's rate limits.

---

## 11. Table Naming and Namespace Resolution

When writing data, the system constructs the target table name from three components:

1. **Namespace** — A schema or prefix grouping (defaults to `derrick`). Can be overridden at the connection level.
2. **Stream prefix** — A prefix prepended to the stream name (defaults to the source name, can be overridden or suppressed).
3. **Stream name** — The API entity name (e.g. `events`, `orders`).

The final table name follows the pattern: `{namespace}.{prefix}{stream_name}`.

When a namespace is explicitly set, the source name is not auto-prefixed — the namespace provides sufficient context. When using the default namespace, the source name is auto-prefixed to prevent collisions between different sources writing to the same destination.
