Object & File Storage

Browsing and analyzing files in S3, Azure Blob, GCS, Google Drive, OneDrive, and SFTP through MarcoPolo.

MarcoPolo connects to object storage (AWS S3, Azure Blob Storage, Google Cloud Storage), cloud file storage (Google Drive, OneDrive), and SFTP servers. Your AI can browse directories, download files, and process data from any of them. Useful for log analysis, data lake exploration, document processing, and file-based workflows.

Connecting

Provide your storage credentials in the web UI or via a setup link from your AI. For object storage, you'll typically need a bucket or container name, access credentials, and region. For Google Drive and OneDrive, you authenticate via OAuth. For SFTP, you provide host, port, and key/password.

Credentials are stored encrypted and never exposed to the AI.

Browse, download, query

Storage connections expose three capability-gated verbs through the in-workspace connection CLI:

browse — list files at a provider-side path:

You: "Show me what's in the raw-logs bucket."
You: "List files in s3://data-lake/bronze/orders/ from last week."
You: "What spreadsheets are in my Google Drive?"

The AI runs connection browse <name> --remote-path ... --json through workspace_shell.

download — pull files into the workspace:

You: "Download the latest sales report from S3."
You: "Grab the Feb 11 partition from the orders data lake."
You: "Pull the Q1 revenue spreadsheet from OneDrive."

Downloaded files land in data/downloads/ by default. You can override with --local-path if you want them grouped with a specific analysis.

query over downloaded files — DuckDB reads CSV, Parquet, JSON, and Excel directly from a workspace path:

You: "Parse this CSV, clean up the date columns, and load it into DuckDB."
You: "Read the JSON logs and count error types by hour."

The AI writes a query under connections/DUCKDB/queries/ that calls read_csv_auto('data/downloads/<file>.csv') (or read_parquet, read_json_auto, etc.) and runs it through connection query DUCKDB.

upload — round-trip transformed files back to the provider:

You: "Push the cleaned dataset back to S3."

upload is capability-gated — only available where the platform supports writes for that connection type.

Common workflows

Log analysis. Browse log partitions, download a time range, parse and aggregate in DuckDB. Ideal for incident investigation: "The orders API has been slow since yesterday. Check the S3 event logs." See Triage & Break-Fix.

Data lake exploration. Navigate bronze/silver/gold layers, check partition sizes, sample files to understand structure. Answer questions like "Is this partition larger than normal?" or "What schema does this Parquet file use?"

File-based ETL. Download from storage, transform in DuckDB or with Python in scripts/, write outputs into artifacts/, optionally upload back. The workspace is a natural staging area for this kind of work. See Pipelines.

Best practices

Be specific about paths and time ranges. Storage buckets can contain millions of files. Give your AI a starting path or date range to narrow the search.

Sample before loading. For large files, ask your AI to check the file size first or download a sample. Loading a 10GB file into DuckDB is a different operation than loading a 10MB file.

Document storage structure in connections/<name>/RULES.md. If your data lake follows a specific layout (partitioned by date, organized by domain) or your Drive has a folder convention, document it so your AI knows where to look. See Context (RULES.md).

Object & File Storage

Connecting

Browse, download, query

Common workflows

Best practices

On this page