Triage & Break-Fix

Investigate production issues by checking logs, querying databases, and tracing data.

Something's broken. The API is slow, the dashboard numbers look wrong, a customer is reporting missing data. You need to check logs, query databases, and trace data through systems — fast.

This is where the MarcoPolo plugin is most valuable. Triage requires chaining the connection CLI across multiple connections in a single investigation. The plugin's skills keep Claude on the right path instead of fumbling with tool discovery.

Investigation: cancelled order anomaly

Using Claude Code with the plugin and the Snowflake + S3 demos:

I'm seeing a cancelled order in the Snowflake data. Investigate — how many are there, which customers, and check if there are related support tickets in S3.

AI runs workspace_shell("cat connections/snowflake-demo/RULES.md SYNTAX.md")

AI writes connections/snowflake-demo/queries/cancelled_orders.sql
AI runs workspace_shell("connection query snowflake-demo --file connections/snowflake-demo/queries/cancelled_orders.sql --json")
   → 1 cancelled order: ORDER_ID 1005, CUSTOMER_ID 5 (Ethan Hunt),
     $123.00, 2024-11-01. Only INACTIVE customer in the system.

AI runs workspace_shell("connection browse aws-s3-demo --remote-path opsrus-lakehouse-demo/bronze/support_tickets/2025/ --json")
   → Partitions: 06/, 07/

AI runs workspace_shell("connection browse aws-s3-demo --remote-path opsrus-lakehouse-demo/bronze/events/2025/07/ --json")
   → Daily partitions: 01/ through 15/

AI writes connections/DUCKDB/queries/customer_order_status.sql
AI runs workspace_shell("connection query DUCKDB --file connections/DUCKDB/queries/customer_order_status.sql --json")
   → Joins cached customer and order data:
     Ethan Hunt is the only INACTIVE customer and the only cancellation.
     All other customers are ACTIVE with DELIVERED/SHIPPED/PENDING orders.

AI: One cancelled order, tied to the only inactive customer (Ethan Hunt,
    Australia, $123). The S3 support tickets have data for Jun-Jul 2025
    but the order is from Nov 2024 — the timelines don't overlap.

    This looks like a clean account deactivation, not a system issue.
    The cancellation amount is the minimum in the dataset, suggesting
    it may have been a test order.

Three connections checked (Snowflake, S3, DuckDB cross-reference), root cause identified, all in one conversation. The query files and the trail of connection calls become an audit log under connections/*/queries/.

How real teams use this

ClickHouse performance investigation. A restaurant analytics team debugged slow queries by profiling snapshot distributions, event filtering patterns, and materialized view behavior across ClickHouse, Athena, and PostgreSQL — all in one Claude Code session. The investigation involved 352 commands over 11 active days, iterating on the same query files as they narrowed down the issue.

Missing customer data. A CS lead reports that a customer's dashboard shows zero revenue last month. Claude checks the warehouse — revenue data exists in the raw tables but was excluded during aggregation because a billing sync process marked the account as inactive. Root cause is upstream in the billing sync, not the data pipeline.

Pipeline failure. A data engineer notices the nightly aggregation hasn't run since Monday. Claude checks the job config table in Postgres, finds a cron schedule change from a deployment, traces the impact through the S3 data lake partitions to confirm which days are affected, and identifies the specific config value to revert.

Best practices

Start broad, narrow quickly. Ask your AI to check the obvious things first: is the data there? Are there error logs? Is the service up? Before diving deep.

Use the workspace as evidence. Your AI saves queries and results to the workspace. The query files under connections/<name>/queries/ and the materialized DuckDB relations create an audit trail of the investigation that you can share with the team or reference later.

Save the pattern. If you find yourself investigating the same category of issue repeatedly, ask your AI to write a diagnostic script under scripts/ and save it. Schedule it through cron if it's worth running on a recurring basis.

Install the plugin for triage. Triage requires chaining tools across connections without wrong turns. The plugin's skills keep Claude oriented on the workspace and the right CLI verbs.

Triage & Break-Fix

Investigation: cancelled order anomaly

How real teams use this

Best practices

On this page