The Challenge
You need a central repository for storing and analyzing large volumes of semi-structured data like application logs, clickstream events, or transaction records. A data lake architecture lets you ingest data in its raw form, transform it into queryable formats, and run SQL analytics without provisioning any database servers.
What You'll Build
- → S3 data lake with raw, processed, and curated zones
- → Glue crawlers for automatic schema discovery
- → Glue ETL jobs for data transformation
- → Athena for SQL queries on processed data
- → Lifecycle policies for storage cost optimization
Try This Prompt in Pulumi Neo
Run this prompt in Neo to deploy your infrastructure, or edit it to customize.
Best For
Architecture Overview
This architecture implements a data lake on AWS using S3 as the storage foundation, Glue for data cataloging and transformation, and Athena for interactive SQL queries. Data flows through three zones: raw data arrives in the landing zone in its original format, Glue ETL jobs clean and transform it into the processed zone, and curated datasets optimized for specific analytical use cases go into the curated zone.
The power of this approach is schema-on-read. Unlike a traditional data warehouse where you must define a schema before loading data, a data lake stores raw data first and applies schema when you query it. Glue crawlers analyze files in S3, detect their format (JSON, CSV, Parquet), and register the schema in the Glue Data Catalog. Athena reads from the catalog and lets you run SQL queries directly against files in S3 without loading them into a database.
Lifecycle policies handle cost optimization automatically. Raw data that is queried frequently stays in S3 Standard, but data older than a configurable threshold transitions to S3 Infrequent Access or Glacier. This keeps storage costs proportional to data utility without requiring manual archival.
S3 Storage Zones
The data lake uses a multi-zone structure within S3. The raw zone stores ingested data in its original format, preserving a complete audit trail. The processed zone contains cleaned and normalized data in columnar formats like Parquet for efficient querying. The curated zone holds purpose-built datasets optimized for specific dashboards, reports, or machine learning models.
Glue Crawlers and ETL
Glue crawlers scan S3 paths on a schedule, detect file formats and column types, and register tables in the Glue Data Catalog. ETL jobs then transform raw data: parsing JSON logs into structured columns, deduplicating records, converting formats, and writing results to the processed zone in Parquet. Parquet reduces query costs because Athena reads only the columns it needs.
Athena Query Engine
Athena provides serverless SQL queries against data stored in S3. There are no servers to provision or clusters to manage. You pay per query based on the amount of data scanned, which makes Parquet format valuable since it reduces scan volume. Analysts connect to Athena through standard SQL clients, BI tools, or the AWS console to explore processed datasets.
Common Customizations
- Add event-driven ingestion: Extend the prompt to include S3 event notifications that trigger Lambda functions when new files arrive, starting ETL processing automatically.
- Add data quality checks: Request Glue Data Quality rules that validate transformed data against expectations like null-rate thresholds or value ranges before it moves to the curated zone.
- Add partitioning: Ask for partitioned tables (by date, region, or event type) to reduce Athena query costs and improve performance on large datasets.
- Connect a BI tool: Request a QuickSight dataset or a Grafana connection to Athena for dashboard visualizations on top of the data lake.
Related Prompts
Deploy a Static Website
You need a fast, secure way to serve a static website globally. Whether it is a marketing site, documentation portal, or …
Deploy a Simple Static Website to S3
You need a quick, low-cost way to host a static website without setting up servers or a CDN. S3 static website hosting …
Deploy a Static Website with S3 and CloudFront CDN
You need to host a static website with fast global delivery, HTTPS encryption, and custom domain support. This pattern …
Create a Simple S3 Bucket
You need a secure S3 bucket to store files or assets, or you want to test your Pulumi Neo setup with the simplest …