Back to blog
ELT Pipeline on AWS: S3, Glue and Athena Step by Step
Cloud12 min readJan 25, 2026

ELT Pipeline on AWS: S3, Glue and Athena Step by Step

Guide to building a scalable, serverless ELT pipeline on AWS: S3 for storage, Glue for transformation, Athena for queries. Modern data architectures.
Diego Velez
Diego Velez
Technical leadership

Why ELT Instead of ETL?

Traditional ETL (Extract, Transform, Load) transforms data before loading it into a warehouse. ELT (Extract, Load, Transform) inverts the approach: load first, then transform.

Why does it matter? Current data volumes are huge. Transforming before loading creates bottlenecks and demands expensive compute that sits idle between runs.

With ELT you can:

  • Load raw data quickly into cheap storage (S3)
  • Transform on demand when you actually need it
  • Scale compute independently from storage
  • Keep raw data for future use cases you haven't imagined yet

The ELT Stack on AWS

This pipeline uses three main AWS services:

  1. Amazon S3: Object storage for raw and transformed data
  2. AWS Glue: Serverless ETL for data transformation
  3. Amazon Athena: Serverless SQL query engine over S3 data

Together they form a fully serverless, pay-per-use pipeline that scales from gigabytes to petabytes.

Architecture Overview

Data flows as:

Source systems → S3 (Raw zone) → Glue (Transform) → S3 (Curated zone) → Athena (Queries)

S3 Data Zones

  • Raw zone: s3://your-bucket/raw/ — Unprocessed source data
  • Staging zone: s3://your-bucket/staging/ — Intermediate transforms
  • Curated zone: s3://your-bucket/curated/ — Final data ready for query

Step 1: Configure S3 Buckets

Create your bucket and zone structure. Enable versioning, lifecycle policies (e.g. move to Glacier after 90 days), encryption (SSE-S3 or SSE-KMS), and block public access.

Step 2: Create AWS Glue Crawlers

Glue Crawlers discover schema automatically. Point them at your raw zone; they create or update the Glue Data Catalog. Use an IAM role with S3 and Glue permissions. Run crawlers on a schedule or on-demand.

Step 3: Create Glue ETL Jobs

Glue jobs transform data in Python or Scala. Read from the Glue catalog, clean and enrich, write to the curated zone as Parquet with partition keys (e.g. year, month, day). Tune worker type (G.1X) and count; set job timeout.

Step 4: Query with Amazon Athena

Create a database and tables (or use the Glue catalog). Use Parquet/ORC, partition by date, SELECT only needed columns, and filter early with WHERE to minimize cost.

Orchestration

Use AWS Glue Workflows (visual) or EventBridge + Step Functions to chain crawler → ETL → crawler with error handling.

Cost Optimization

  • S3: Lifecycle to Glacier for old data
  • Glue: Right-size workers; use spot where possible
  • Athena: Partition, use columnar formats, limit columns scanned

Best Practices

Schema evolution (Glue Schema Registry), dead-letter queues for failed records, CloudWatch alarms, encryption and IAM, test with sample data, document lineage and transformation logic.

Common Mistakes

Over-partitioning (tiny partitions slow queries), under-partitioning (full scans are expensive), ignoring file size (aim for 128 MB–1 GB Parquet files), no retry logic, hardcoded paths instead of parameterized jobs.

Conclusion

ELT on AWS with S3, Glue and Athena gives a scalable, cost-effective base for modern data architectures. Start simple: load raw into S3, transform with Glue, query with Athena. Add orchestration, monitoring and optimization as you grow.

Construye tu futuro.

¿Listo para transformar tu infraestructura con agentes de IA inteligentes?

Book assessment