Data Sources

Cloud Storage

Browsing and analyzing files in S3, Azure Blob, Google Drive, OneDrive, and other cloud storage through MarcoPolo.

MarcoPolo connects to object storage (AWS S3, Azure Blob Storage) and cloud file storage (Google Drive, OneDrive). Your AI can browse directories, download files, and process data from any of these. Useful for log analysis, data lake exploration, document processing, and file-based workflows.

Connecting

Provide your storage credentials in the web UI or via a connector link from your AI. For object storage, you'll typically need a bucket or container name, access credentials, and region. For Google Drive and OneDrive, you authenticate via OAuth.

Credentials are stored encrypted and never exposed to the AI.

Browsing, downloading, and processing

Storage data sources use different tools than databases:

Browsing with browse: navigate your bucket's file structure:

You: "Show me what's in the raw-logs bucket."
You: "List files in s3://data-lake/bronze/orders/ from last week."
You: "What spreadsheets are in my Google Drive?"

Downloading with download: pull files into your workspace:

You: "Download the latest sales report from S3."
You: "Grab the Feb 11 partition from the orders data lake."
You: "Pull the Q1 revenue spreadsheet from OneDrive."

Processing with DuckDB, Python, or shell tools, once files are in your workspace:

You: "Parse this CSV, clean up the date columns, and load it into DuckDB."
You: "Read the JSON logs and count error types by hour."

DuckDB reads CSV, Parquet, JSON, and Excel files directly, making it a natural fit for storage-based analysis.

Common workflows

Log analysis. Browse log partitions, download a time range, parse and aggregate in the workspace. Ideal for incident investigation: "The orders API has been slow since yesterday. Check the S3 event logs."

Data lake exploration. Navigate bronze/silver/gold layers, check partition sizes, sample files to understand structure. Answer questions like "Is this partition larger than normal?" or "What schema does this Parquet file use?"

File-based ETL. Download from storage, transform with Python, load into DuckDB, export results. The workspace is a natural staging area for this kind of work.

Best practices

Be specific about paths and time ranges. Storage buckets can contain millions of files. Give your AI a starting path or date range to narrow the search.

Sample before loading. For large files, ask your AI to check the file size first or download a sample. Loading a 10GB file into DuckDB is a different operation than loading a 10MB file.

Document storage structure in RULES.md. If your data lake follows a specific layout (partitioned by date, organized by domain) or your Drive has a folder convention, document it so your AI knows where to look. See Context (RULES.md).

On this page