Pipeline Building & ETL

Build and test data pipelines by pulling data from connections, transforming it, and loading it somewhere useful. Your workspace has Python, shell access, and DuckDB, so the full pipeline runs in one place — and cron schedules the recurring versions.

Building a pipeline through conversation

Install the Snowflake demo, then:

Pull all orders from Snowflake and all customers. Join them, compute total revenue per customer, and export as Parquet.

AI writes connections/snowflake-demo/queries/all_orders.sql
AI runs workspace_shell("connection query snowflake-demo --file connections/snowflake-demo/queries/all_orders.sql --json")
   → 17 orders → DuckDB relation: snowflake_demo_all_orders

AI writes connections/snowflake-demo/queries/all_customers.sql
AI runs workspace_shell("connection query snowflake-demo --file connections/snowflake-demo/queries/all_customers.sql --json")
   → 10 customers → DuckDB relation: snowflake_demo_all_customers

AI writes connections/DUCKDB/queries/revenue_per_customer.sql
AI runs workspace_shell("connection query DUCKDB --file connections/DUCKDB/queries/revenue_per_customer.sql --json")
   → 10 rows → DuckDB relation: revenue_per_customer

AI: Revenue per customer:
    Fiona Green:      $3,450.80 (3 orders)
    Charlie Brown:    $2,340.00 (2 orders)
    Jessica Martinez: $2,100.00 (2 orders)
    ...
    Ethan Hunt:       $123.00   (1 order, cancelled)

    All 10 customers matched. No orphaned orders.

Validate: are we losing any rows in the join?

AI writes connections/DUCKDB/queries/join_validation.sql
AI runs workspace_shell("connection query DUCKDB --file connections/DUCKDB/queries/join_validation.sql --json")

AI: All 17 orders matched to a customer. No rows dropped.
    All 10 customers appear (Ethan Hunt has 1 cancelled order).

Export as Parquet.

AI writes scripts/export_revenue.py
AI runs workspace_shell("python scripts/export_revenue.py")
   → Wrote artifacts/revenue_per_customer/2026-05-05_revenue.parquet

AI: Exported. The script reads the DuckDB relation and writes a
    date-stamped Parquet file under artifacts/revenue_per_customer/.

Scheduling the pipeline

Once the pipeline works, cron runs it without you in the loop:

Run this every weekday morning at 6am.

AI runs workspace_shell("cron create daily-revenue
  --command 'python scripts/export_revenue.py'
  --cron '0 6 * * 1-5' --json")
   → schedule saved to schedules/daily-revenue.json

Each run produces a date-stamped Parquet under artifacts/revenue_per_customer/. Run history lives in .dv/schedules/daily-revenue_history.jsonl and is queryable through cron history.

The workspace as a staging area

The pattern for pipeline work in MarcoPolo:

Pull: query data from multiple connections into DuckDB. Each connection query materializes a relation.

Transform: your AI writes DuckDB SQL or Python scripts under scripts/ for cleaning, joining, deduplication, and computed columns. Intermediate results stay in DuckDB.

Validate: compare row counts before and after. Check join cardinality. Verify no data was lost.

Output: write final files into artifacts/ with date-stamped names so each run preserves history rather than clobbering the previous one. For repeat distribution, connection upload can push results back to a shared drive.

How real teams use this

Manufacturing data export. A QC team on Microsoft Fabric runs yield analysis queries, transforms the results in DuckDB, and exports as CSVs for downstream tools. The query files and scripts persist in the workspace — they re-run the same pipeline weekly with updated data.

Customer activity ETL. A RevOps team pulls closed-won opportunities from Salesforce and usage events from Athena, joins them in DuckDB, computes days-to-first-login per account, and exports as Parquet. The pipeline script is saved under scripts/ and scheduled through cron.

Best practices

Save useful scripts. Working transformations belong in scripts/. They're already in your workspace; re-run or adapt them later.

Validate at every step. Ask your AI to check row counts and data integrity after each transformation. Catching a bad join early saves debugging later.

Use DuckDB for joins. DuckDB is fast at analytical operations. Joining data from multiple connections through connections/DUCKDB/ is often faster than your warehouse and free of cost.

Date-stamp outputs. A pipeline that overwrites yesterday's output loses history the moment a debugging session re-runs it. Write artifacts/<output>/$(date -u +%F)_<file> so each run preserves the previous one.