Use Cases

Pipeline Building & ETL

Build, test, and debug data pipelines in MarcoPolo.

Build and test data pipelines by pulling data from sources, transforming it, and loading it somewhere useful. Your workspace has Python, shell access, and DuckDB, so the full pipeline runs in one place.

Building a pipeline through conversation

Pull all orders from Snowflake and all customers. Join them, compute total revenue per customer, and export as Parquet.

AI calls query("snowflake-demo-datasource", "queries/snowflake-demo-datasource/all_orders.sql")
   -> 17 orders -> cached in DuckDB table: all_orders

AI calls query("snowflake-demo-datasource", "queries/snowflake-demo-datasource/all_customers.sql")
   -> 10 customers -> cached in DuckDB table: all_customers

AI writes queries/DUCKDB/revenue_per_customer.sql
AI calls query("DUCKDB", "queries/DUCKDB/revenue_per_customer.sql")
   -> 10 rows -> cached in DuckDB table: revenue_per_customer

AI: Revenue per customer:
    Fiona Green:      $3,450.80 (3 orders)
    Charlie Brown:    $2,340.00 (2 orders)
    Jessica Martinez: $2,100.00 (2 orders)
    ...
    Ethan Hunt:       $123.00   (1 order, cancelled)

    All 10 customers matched. No orphaned orders.

Validate: are we losing any rows in the join?

AI writes queries/DUCKDB/join_validation.sql
AI calls query("DUCKDB", "queries/DUCKDB/join_validation.sql")

AI: All 17 orders matched to a customer. No rows dropped.
    All 10 customers appear (Ethan Hunt has 1 cancelled order).

Export as Parquet.

AI calls execute_command("python3 export_parquet.py")
   -> Exported to /workspace/exports/revenue_per_customer.parquet

AI: Exported. File is available in your workspace web app at
    exports/revenue_per_customer.parquet

The workspace as a staging area

The pattern for pipeline work in MarcoPolo:

Pull: query data from multiple sources into DuckDB. Each source is queried independently and results are cached locally.

Transform: your AI writes DuckDB SQL or Python scripts for cleaning, joining, deduplication, and computed columns. Intermediate results are saved.

Validate: compare row counts before and after. Check join cardinality. Verify no data was lost.

Output: export as Parquet, CSV, or write to a DuckDB table for future sessions.

How real teams use this

Manufacturing data export. A QC team on Microsoft Fabric runs yield analysis queries, transforms the results in DuckDB, and exports as CSVs for downstream tools. The query files and scripts persist in the workspace - they re-run the same pipeline weekly with updated data.

Customer activity ETL. A RevOps team pulls closed-won opportunities from Salesforce and usage events from Athena, joins them in DuckDB, computes days-to-first-login per account, and exports as Parquet. The pipeline script is saved to the workspace for reuse.

Best practices

Save useful scripts. If you build a transformation that works, it's already saved in your workspace. Re-run or adapt it later by asking your AI.

Validate at every step. Ask your AI to check row counts and data integrity after each transformation. Catching a bad join early saves debugging later.

Use DuckDB for joins. DuckDB is fast at analytical operations. Joining data from multiple sources in DuckDB is often faster than your warehouse and free of cost.

On this page