Getting Started

Get from zero to a working data quality pipeline in 5 minutes. No cloud account required — everything runs locally with DuckDB.

1. Install

pip install aegis-dq

Verify:

aegis --help

You should see the init, validate, run, audit, and rules commands listed.

2. Seed demo data

Create a local DuckDB file with 10 000 orders and intentional data quality issues baked in:

# seed.py
import duckdb

con = duckdb.connect("demo.db")

con.execute("""
    CREATE TABLE orders AS
    SELECT
        i          AS order_id,
        'placed'   AS status,
        i * 9.99   AS revenue
    FROM range(1, 10001) t(i)
""")

# Inject nulls: every 200th order loses its order_id
con.execute("UPDATE orders SET order_id = NULL WHERE order_id % 200 = 0")

# Inject negatives: every 500th order has negative revenue
con.execute("UPDATE orders SET revenue = -5.00 WHERE order_id % 500 = 0")

con.close()
print("Created demo.db  (10 000 rows, 50 null order_ids, 20 negative revenues)")

python seed.py
# Created demo.db  (10 000 rows, 50 null order_ids, 20 negative revenues)

DuckDB ships as a Python wheel — no separate install needed.

3. Create rules.yaml

Paste this into rules.yaml in your working directory:

rules:

  - apiVersion: aegis.dev/v1
    kind: DataQualityRule
    metadata:
      id: orders_order_id_not_null
      severity: critical
      domain: retail
      owner: data-platform
      description: Every order must have an order_id
    scope:
      warehouse: duckdb
      table: orders
      columns: [order_id]
    logic:
      type: not_null
    diagnosis:
      common_causes:
        - ETL pipeline failed mid-load
        - Source system sent partial records

  - apiVersion: aegis.dev/v1
    kind: DataQualityRule
    metadata:
      id: orders_revenue_positive
      severity: high
      domain: retail
      owner: revenue-team
      description: Revenue must be non-negative
    scope:
      warehouse: duckdb
      table: orders
    logic:
      type: sql_expression
      expression: "revenue >= 0"
    diagnosis:
      common_causes:
        - Refund logic inverted the sign
        - Currency conversion failure

  - apiVersion: aegis.dev/v1
    kind: DataQualityRule
    metadata:
      id: orders_minimum_rows
      severity: medium
      domain: retail
      description: Orders table must have at least 1 000 rows
    scope:
      warehouse: duckdb
      table: orders
    logic:
      type: row_count
      threshold: 1000

4. Validate syntax offline

Before touching any data, confirm your rules are correctly formed:

aegis validate rules.yaml

Expected output:

Aegis validate — rules.yaml

  ✓ orders_order_id_not_null
  ✓ orders_revenue_positive
  ✓ orders_minimum_rows

All 3 rule(s) valid.

Errors (✗) must be fixed before running. Warnings (⚠) are informational and do not block execution.

5. Run without LLM

Run your rules against the demo database in offline mode — no API key needed:

aegis run rules.yaml --db demo.db --no-llm

Expected output:

Aegis DQ — loading rules from rules.yaml
Loaded 3 rules  •  warehouse: duckdb  •  llm: disabled

Running pipeline: plan → execute → reconcile → classify → diagnose → rca → report

  ✓  orders_minimum_rows          passed    10 000 rows
  ✗  orders_order_id_not_null     FAILED    50 / 10 000   critical
  ✗  orders_revenue_positive      FAILED    20 / 10 000   high


   Aegis Validation Report
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric              ┃ Value                         ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Rules checked       │ 3                             │
│ Passed              │ 1                             │
│ Failed              │ 2                             │
│ Pass rate           │ 33.3%                         │
│ LLM cost            │ $0.000000                     │
└─────────────────────┴───────────────────────────────┘

Exit code: 1  (failures detected)

The process exits with code 1 whenever any rule fails — useful for blocking CI pipelines.

6. Run with LLM diagnosis

Set your Anthropic API key and re-run. The diagnose and rca nodes will now call the LLM for each failure:

export ANTHROPIC_API_KEY=sk-ant-...
aegis run rules.yaml --db demo.db

AWS Bedrock (no API key — uses AWS credentials profile):

# Uses the Bedrock Converse API with Amazon Nova Pro (no use-case form needed)
python demo/realworld_demo.py --aws-profile your-aws-profile

Expected output (additional LLM section appended after the summary table):

LLM Diagnosis
─────────────────────────────────────────────────────────────────────
Rule:    orders_order_id_not_null  (critical)
Table:   orders
Failed:  50 rows (0.5% of 10 000)

Explanation:
  50 rows in the orders table have a NULL order_id. Downstream joins
  on order_id will silently drop these rows, causing undercounting in
  revenue reports.

Likely cause:
  The ETL pipeline loads from the source OLTP database without a
  NOT NULL guard. When the source emits a partial record (e.g. a
  cart-abandonment event), order_id is omitted and lands as NULL.

Recommended action:
  1. Run: SELECT * FROM orders WHERE order_id IS NULL LIMIT 20
  2. Check ETL logs for the most recent ingestion window
  3. Add a NOT NULL constraint or COALESCE guard in staging
─────────────────────────────────────────────────────────────────────
Rule:    orders_revenue_positive  (high)
Table:   orders
Failed:  20 rows (0.2% of 10 000)

Explanation:
  20 rows have revenue = -5.00, violating the non-negative revenue
  constraint.

Likely cause:
  Refund processing logic may have inverted the sign rather than
  recording refunds in a separate table.

Recommended action:
  1. Run: SELECT * FROM orders WHERE revenue < 0 LIMIT 20
  2. Verify refund handling in the ETL transform
─────────────────────────────────────────────────────────────────────

LLM cost: $0.000412  (claude-haiku-4-5, 2 diagnoses)

7. Use a local LLM (no API key)

If you have Ollama running locally, you can run diagnosis entirely offline:

# requires ollama running locally with llama3.2 pulled
aegis run rules.yaml --db demo.db --llm ollama --llm-model llama3.2

Ollama runs on http://localhost:11434 by default. To use a different host:

aegis run rules.yaml --db demo.db --llm ollama --llm-model llama3.2 \
  --llm-base-url http://my-ollama-host:11434

8. Inspect the audit trail

Every run writes to ~/.aegis/history.db. Use the audit subcommands to explore it:

# List all runs (newest first)
aegis audit list-runs

# Show the full node-by-node trajectory for a specific run
aegis audit trajectory run_20260511_143022_a1b2c3

# Full-text search across all LLM decisions
aegis audit search "null order_id"

Example list-runs output:

  run_id                           started              rules  passed  failed
  run_20260511_143022_a1b2c3      2026-05-11 14:30:22      3       1       2
  run_20260510_091500_d4e5f6      2026-05-10 09:15:00      3       3       0

9. Export for fine-tuning

Dump the audit trail for a run as ShareGPT-format JSONL, ready for supervised fine-tuning:

aegis audit export-dataset output.jsonl --run-id run_20260511_143022_a1b2c3

Omit --run-id to export all runs:

aegis audit export-dataset output.jsonl

Each line in output.jsonl is one conversation turn: the rule context as a user message and the LLM diagnosis as an assistant message.

10. Real-world end-to-end demo

The repository ships a complete RetailCo e-commerce demo that exercises every pipeline node against a 4-table DuckDB database with realistic dirty data. Use it to see the full agentic output — diagnosis, root-cause analysis, and LLM-generated remediation SQL — in one command.

# Validation only (no LLM, instant)
python demo/realworld_demo.py --no-llm

# Full pipeline with AWS Bedrock (requires ~/.aws/credentials profile)
python demo/realworld_demo.py --aws-profile your-profile

# Full pipeline with Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python demo/realworld_demo.py --no-llm   # swap in AnthropicAdapter in the script

What runs: - 4 tables — customers, products, orders, payments (10 rows each, dirty data injected) - 12 rules — not_null, not_empty_string, accepted_values, sql_expression, unique, min_value_check, foreign_key, date_order - 11 failures detected: NULL email, empty email, invalid tier, negative price, duplicate SKU, negative stock, orphan FK, date inversion, invalid status, orphan payment, negative payment - Full LLM output per failure: explanation + likely_cause + suggested_action + root_cause + propagation + fix + proposed SQL - Run time: ~7s | LLM cost: ~$0.006 (Amazon Nova Pro) | Tokens: ~3,600

The demo script is at demo/realworld_demo.py.

11. Next steps

Rule Schema Reference — all 31 rule types with full field definitions
Architecture — deep dive into the 7-node pipeline
dbt Integration — auto-generate Aegis rules from your dbt manifest
Airflow Integration — run Aegis as an Airflow operator
MCP Server — use Aegis as a Claude Desktop tool
vs Competitors — how Aegis compares to Great Expectations, Soda, and Monte Carlo

12. Generate rules with the LLM (v0.6.0)

Instead of writing rules by hand, let Aegis introspect your table schema and generate a draft rules file:

aegis generate orders --db demo.db --output orders_rules.yaml

This generates structural rules based on what Aegis observes in the schema: not_null on non-nullable columns, between from observed min/max, unique on ID columns, null percentage thresholds.

Business validation rules with --kb

Pass a plain-text or markdown file describing your business logic and the LLM generates business validation rules alongside the structural ones:

aegis generate orders --db demo.db \
  --kb docs/orders_policy.md \
  --output orders_rules.yaml

Example KB file (docs/orders_policy.md):

- status must be one of: placed, confirmed, shipped, delivered, cancelled
- amount must be greater than 0; refunds are handled in a separate table
- customer_id must reference a valid customer (no test accounts: id > 1000)
- order_date must not be in the future
- discount_pct must be between 0 and 0.5 (max 50% discount)
- email must match standard email format

From this, Aegis generates rules like:

- logic:
    type: accepted_values
    values: [placed, confirmed, shipped, delivered, cancelled]
- logic:
    type: sql_expression
    expression: "amount > 0"
- logic:
    type: between
    min_value: 0
    max_value: 0.5
- logic:
    type: regex_match
    pattern: "^[^@]+@[^@]+\\.[^@]+$"

Generated rules are stamped status: draft and generated_by: <model>. Review them, promote to active, and commit to version control.

13. Validate SQL expressions (v0.6.0)

Run the SQL verification pipeline against your rules without executing a full run:

# Stage 1 — syntax only (no DB needed)
aegis validate rules.yaml --check-sql

# Stages 1-3 — syntax + schema + dry-run
aegis validate rules.yaml --db demo.db

Any sql_expression or custom_sql rule with a broken expression is caught here before it reaches production.