Build a Data Lake Architecture

By Pulumi Team
Published
Updated

The Challenge

You need a central repository for storing and analyzing large volumes of semi-structured data like application logs, clickstream events, or transaction records. A data lake architecture lets you ingest data in its raw form, transform it into queryable formats, and run SQL analytics without provisioning any database servers.

What You'll Build

  • S3 data lake with raw, processed, and curated zones
  • Glue crawlers for automatic schema discovery
  • Glue ETL jobs for data transformation
  • Athena for SQL queries on processed data
  • Lifecycle policies for storage cost optimization

Neo Try This Prompt in Pulumi Neo

Run this prompt in Neo to deploy your infrastructure, or edit it to customize.

Best For

Use this prompt when you need to centralize data from multiple sources for analytics, reporting, or machine learning. This architecture suits teams that ingest application logs, user behavior events, transaction records, or IoT sensor data and want to query it with standard SQL without managing a data warehouse.

Architecture Overview

This architecture implements a data lake on AWS using S3 as the storage foundation, Glue for data cataloging and transformation, and Athena for interactive SQL queries. Data flows through three zones: raw data arrives in the landing zone in its original format, Glue ETL jobs clean and transform it into the processed zone, and curated datasets optimized for specific analytical use cases go into the curated zone.

The power of this approach is schema-on-read. Unlike a traditional data warehouse where you must define a schema before loading data, a data lake stores raw data first and applies schema when you query it. Glue crawlers analyze files in S3, detect their format (JSON, CSV, Parquet), and register the schema in the Glue Data Catalog. Athena reads from the catalog and lets you run SQL queries directly against files in S3 without loading them into a database.

Lifecycle policies handle cost optimization automatically. Raw data that is queried frequently stays in S3 Standard, but data older than a configurable threshold transitions to S3 Infrequent Access or Glacier. This keeps storage costs proportional to data utility without requiring manual archival.

S3 Storage Zones

The data lake uses a multi-zone structure within S3. The raw zone stores ingested data in its original format, preserving a complete audit trail. The processed zone contains cleaned and normalized data in columnar formats like Parquet for efficient querying. The curated zone holds purpose-built datasets optimized for specific dashboards, reports, or machine learning models.

Glue Crawlers and ETL

Glue crawlers scan S3 paths on a schedule, detect file formats and column types, and register tables in the Glue Data Catalog. ETL jobs then transform raw data: parsing JSON logs into structured columns, deduplicating records, converting formats, and writing results to the processed zone in Parquet. Parquet reduces query costs because Athena reads only the columns it needs.

Athena Query Engine

Athena provides serverless SQL queries against data stored in S3. There are no servers to provision or clusters to manage. You pay per query based on the amount of data scanned, which makes Parquet format valuable since it reduces scan volume. Analysts connect to Athena through standard SQL clients, BI tools, or the AWS console to explore processed datasets.

Common Customizations

  • Add event-driven ingestion: Extend the prompt to include S3 event notifications that trigger Lambda functions when new files arrive, starting ETL processing automatically.
  • Add data quality checks: Request Glue Data Quality rules that validate transformed data against expectations like null-rate thresholds or value ranges before it moves to the curated zone.
  • Add partitioning: Ask for partitioned tables (by date, region, or event type) to reduce Athena query costs and improve performance on large datasets.
  • Connect a BI tool: Request a QuickSight dataset or a Grafana connection to Athena for dashboard visualizations on top of the data lake.