Use Cases

Data Exploration & Profiling

Get oriented in unfamiliar data sources using MarcoPolo.

You just connected a data source and have no idea what's in it. Or you inherited a warehouse with 200 tables and need to figure out which ones matter. This is where exploration starts.

Exploring the Snowflake demo in Claude Code

Here's a real exploration session against the Snowflake demo data source that ships with every MarcoPolo workspace.

What's in the Snowflake demo data source?

AI calls get_schema("snowflake-demo-datasource", database="ANALYTICS")
   -> 3 tables: CUSTOMERS (10 rows), ORDERS (17 rows), PRODUCT_METRICS (16 rows)

Profile the ORDERS table. Show me column types, null rates, and value distributions.

AI calls get_schema("snowflake-demo-datasource", database="ANALYTICS", table="ORDERS")
   -> 8 columns: ORDER_ID, CUSTOMER_ID, ORDER_DATE, ORDER_AMOUNT,
      ORDER_STATUS, SHIPPING_COUNTRY, PRODUCT_CATEGORY, CREATED_AT

AI writes queries/snowflake-demo-datasource/profile_orders.sql
AI calls query("snowflake-demo-datasource", "queries/snowflake-demo-datasource/profile_orders.sql")
   -> Results cached in DuckDB table: profile_orders

AI: ORDERS: 17 rows, date range 2024-10-05 to 2024-12-15
    - 10 unique customers, 9 countries, 5 product categories
    - ORDER_AMOUNT: $123.00 to $1,750.40, avg $808.06
    - No NULLs in any column
    - 4 statuses: DELIVERED (11), SHIPPED (3), PENDING (2), CANCELLED (1)

Show me how the tables relate to each other.

AI calls get_schema for CUSTOMERS and PRODUCT_METRICS columns

AI: Join paths:
    ORDERS.CUSTOMER_ID -> CUSTOMERS.CUSTOMER_ID
    ORDERS.PRODUCT_CATEGORY -> PRODUCT_METRICS.PRODUCT_CATEGORY

    CUSTOMERS is the reference table (10 rows, one per customer).
    ORDERS has the transactional data (17 rows).
    PRODUCT_METRICS aggregates by category and date (16 rows).

Save what you learned to RULES.md.

AI calls execute_command to write docs/snowflake-demo-datasource/RULES.md

AI: Updated RULES.md:
    - CUSTOMERS: 10 customers across 9 countries. One INACTIVE (Ethan Hunt).
    - ORDERS: transactional, Oct-Dec 2024. Join to CUSTOMERS on CUSTOMER_ID.
    - PRODUCT_METRICS: pre-aggregated by category and date. Join to ORDERS
      on PRODUCT_CATEGORY.
    - No NULL issues in any table.

After this session, the RULES.md and query files persist in the workspace. The next conversation starts with this context already available.

That persistent context matters. Strong exploration sessions do two things: document what the AI should remember in RULES.md, and leave behind reusable query files in the workspace so later analysis starts from known-good patterns instead of starting over.

Example of datasource-specific metric definitions captured in RULES.md

Example of reusable query files persisted in the MarcoPolo workspace

How real teams use this

A manufacturing QC team connected their Microsoft Fabric SQL Analytics endpoints and ran a similar flow: surveyed 3 schemas with 47 tables, profiled the key quality tables, discovered a 12% NULL rate in a grade column (grading system introduced mid-year), and documented all of it in RULES.md. Every subsequent analysis session started from that baseline.

A RevOps team connected Salesforce and discovered 656 objects in the salesforce_data database. Rather than exploring all of them, they asked Claude to filter to the objects with data (Account, Opportunity, Contact, Lead) and profiled those. Within 20 minutes they had a working RULES.md with join paths and metric definitions.

The pattern

  1. Survey the landscape. Start broad. Ask for all schemas, all tables, grouped by topic.
  2. Drill into what matters. Pick the tables that are relevant and profile them at the column level.
  3. Check data quality. Null rates, unique values, value distributions. Surfaces gotchas before they cause wrong analysis.
  4. Map relationships. Understand join paths between tables, especially in complex schemas.
  5. Save the context. Write what you learned to RULES.md. Future sessions start from this baseline.

Good prompts for exploration

List all tables in the analytics schema with row counts and
descriptions of what each one likely contains.
Profile the events table: row count by day for the last 90 days,
null rate per column, and the top 20 most common event types.
Are there any duplicate primary keys, orphaned foreign keys,
or tables that look like they haven't been updated recently?
Which tables contain customer data? Show me how they relate
to each other and which one is the canonical source.

On this page