Pipeline Building & ETL
Build, test, and debug data pipelines in MarcoPolo.
Build and test data pipelines by pulling data from sources, transforming it, and loading it somewhere useful. Your workspace has Python, shell access, and DuckDB, so the full pipeline runs in one place.
Building a pipeline through conversation
Pull all orders from Snowflake and all customers. Join them, compute total revenue per customer, and export as Parquet.
AI calls query("snowflake-demo-datasource", "queries/snowflake-demo-datasource/all_orders.sql")
-> 17 orders -> cached in DuckDB table: all_orders
AI calls query("snowflake-demo-datasource", "queries/snowflake-demo-datasource/all_customers.sql")
-> 10 customers -> cached in DuckDB table: all_customers
AI writes queries/DUCKDB/revenue_per_customer.sql
AI calls query("DUCKDB", "queries/DUCKDB/revenue_per_customer.sql")
-> 10 rows -> cached in DuckDB table: revenue_per_customer
AI: Revenue per customer:
Fiona Green: $3,450.80 (3 orders)
Charlie Brown: $2,340.00 (2 orders)
Jessica Martinez: $2,100.00 (2 orders)
...
Ethan Hunt: $123.00 (1 order, cancelled)
All 10 customers matched. No orphaned orders.Validate: are we losing any rows in the join?
AI writes queries/DUCKDB/join_validation.sql
AI calls query("DUCKDB", "queries/DUCKDB/join_validation.sql")
AI: All 17 orders matched to a customer. No rows dropped.
All 10 customers appear (Ethan Hunt has 1 cancelled order).Export as Parquet.
AI calls execute_command("python3 export_parquet.py")
-> Exported to /workspace/exports/revenue_per_customer.parquet
AI: Exported. File is available in your workspace web app at
exports/revenue_per_customer.parquetThe workspace as a staging area
The pattern for pipeline work in MarcoPolo:
Pull: query data from multiple sources into DuckDB. Each source is queried independently and results are cached locally.
Transform: your AI writes DuckDB SQL or Python scripts for cleaning, joining, deduplication, and computed columns. Intermediate results are saved.
Validate: compare row counts before and after. Check join cardinality. Verify no data was lost.
Output: export as Parquet, CSV, or write to a DuckDB table for future sessions.
How real teams use this
Manufacturing data export. A QC team on Microsoft Fabric runs yield analysis queries, transforms the results in DuckDB, and exports as CSVs for downstream tools. The query files and scripts persist in the workspace - they re-run the same pipeline weekly with updated data.
Customer activity ETL. A RevOps team pulls closed-won opportunities from Salesforce and usage events from Athena, joins them in DuckDB, computes days-to-first-login per account, and exports as Parquet. The pipeline script is saved to the workspace for reuse.
Best practices
Save useful scripts. If you build a transformation that works, it's already saved in your workspace. Re-run or adapt it later by asking your AI.
Validate at every step. Ask your AI to check row counts and data integrity after each transformation. Catching a bad join early saves debugging later.
Use DuckDB for joins. DuckDB is fast at analytical operations. Joining data from multiple sources in DuckDB is often faster than your warehouse and free of cost.