
Construir un pipeline ELT en AWS con S3, Glue y Athena
Why ELT Over ETL?
Traditional ETL (Extract, Transform, Load) processes transform data before loading it into a data warehouse. ELT (Extract, Load, Transform) flips this approach: load first, transform later.
Why does this matter? Modern data volumes are massive. Transforming data before loading creates bottlenecks and requires expensive compute resources to sit idle between transformations.
With ELT, you:
- Load raw data quickly into cheap storage (S3)
- Transform on-demand when you actually need it
- Scale compute independently from storage
- Preserve raw data for future use cases you haven't thought of yet
Figure 1: High-level architecture of an AWS ELT pipeline
The AWS ELT Stack
Our pipeline uses three core AWS services:
- Amazon S3: Object storage for raw and transformed data
- AWS Glue: Serverless ETL service for data transformation
- Amazon Athena: Serverless SQL query engine over S3 data
Together, they form a fully serverless, pay-per-use data pipeline that scales from gigabytes to petabytes.
Figure 2: How S3, Glue, and Athena work together
Architecture Overview
Here's how data flows through our pipeline:
Source Systems → S3 (Raw Zone) → Glue (Transform) → S3 (Curated Zone) → Athena (Query)
Data Zones in S3
We organize S3 buckets into logical zones:
- Raw Zone:
s3://your-bucket/raw/- Unprocessed source data - Staging Zone:
s3://your-bucket/staging/- Intermediate transformations - Curated Zone:
s3://your-bucket/curated/- Final, query-ready data
This separation provides:
- Data lineage: You can always trace back to source
- Reprocessing: Easy to rebuild curated data from raw
- Compliance: Raw data preserved for audit requirements
Figure 3: S3 bucket structure with data zones
Step 1: Setting Up S3 Buckets
First, create your S3 bucket with proper organization:
aws s3 mb s3://your-company-data-lake --region us-east-1
Then create the zone structure:
aws s3api put-object --bucket your-company-data-lake --key raw/
aws s3api put-object --bucket your-company-data-lake --key staging/
aws s3api put-object --bucket your-company-data-lake --key curated/
Critical S3 Settings:
- Versioning: Enable versioning for data recovery
- Lifecycle Policies: Move old data to Glacier after 90 days
- Encryption: Enable SSE-S3 or SSE-KMS
- Block Public Access: Always enabled for data lakes
Figure 4: S3 bucket configuration in AWS Console
Step 2: Creating AWS Glue Crawlers
Glue Crawlers automatically discover your data schema. They scan S3 and create or update the Glue Data Catalog.
Setting Up a Crawler
-
Navigate to AWS Glue Console → Crawlers → Add crawler
-
Configure Data Source: Point to your S3 raw zone
- Path:
s3://your-company-data-lake/raw/source-system/
- Path:
-
IAM Role: Create a role with S3 and Glue permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": "arn:aws:s3:::your-company-data-lake/*" }, { "Effect": "Allow", "Action": [ "glue:*" ], "Resource": "*" } ] } -
Output: Create database
raw_dataand tablesource_system
Figure 5: Configuring a Glue Crawler in AWS Console
Running the Crawler
Crawlers can run on schedule or on-demand. For production, schedule them:
- Hourly: For high-frequency data sources
- Daily: For batch processing
- On S3 Event: Trigger via EventBridge when new files arrive
Figure 6: Setting up crawler schedule
Step 3: Building Glue ETL Jobs
Glue ETL Jobs transform your data. You can write them in Python or Scala, or use the visual editor.
Python Glue Job Example
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'input_path', 'output_path'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="raw_data",
table_name="source_system"
)
# Transform: Clean and enrich data
def clean_record(record):
# Remove nulls, standardize formats
record["processed_date"] = datetime.now().isoformat()
return record
cleaned = Map.apply(
frame=datasource,
f=clean_record,
transformation_ctx="cleaned"
)
# Write to curated zone
glueContext.write_dynamic_frame.from_options(
frame=cleaned,
connection_type="s3",
connection_options={
"path": args['output_path'],
"partitionKeys": ["year", "month", "day"]
},
format="parquet"
)
job.commit()
Figure 7: Glue ETL job code in AWS Console
Job Configuration
- Worker Type: G.1X (2 vCPU, 8 GB) for most workloads
- Number of Workers: Start with 2, scale based on data volume
- Job Timeout: Set to 2 hours for safety
- Glue Version: Use latest (4.0+) for Python 3.10 support
Figure 8: Glue job configuration settings
Step 4: Querying with Amazon Athena
Once data is in S3 (raw or curated), Athena lets you query it with standard SQL.
Setting Up Athena
-
Create Database:
CREATE DATABASE IF NOT EXISTS curated_data LOCATION 's3://your-company-data-lake/curated/'; -
Create Table (if not using Glue Catalog):
CREATE EXTERNAL TABLE curated_data.transactions ( transaction_id string, customer_id string, amount decimal(10,2), transaction_date timestamp ) PARTITIONED BY (year int, month int, day int) STORED AS PARQUET LOCATION 's3://your-company-data-lake/curated/transactions/'; -
Load Partitions:
MSCK REPAIR TABLE curated_data.transactions;
Figure 9: Running queries in Athena
Query Optimization Tips
- Use Partitioning: Always partition by date columns
- Use Columnar Formats: Parquet or ORC, not CSV
- Limit Columns: SELECT only what you need
- Use WHERE Early: Filter before joins
-- Good: Partitioned and filtered
SELECT customer_id, SUM(amount) as total
FROM curated_data.transactions
WHERE year = 2025 AND month = 1
GROUP BY customer_id;
-- Bad: Full table scan
SELECT * FROM curated_data.transactions;
Figure 10: Query results and performance metrics
Orchestrating the Pipeline
For production, you need orchestration. Options:
Option 1: AWS Glue Workflows
Visual workflow builder in Glue Console:
- Create workflow
- Add triggers (crawler → ETL job → crawler)
- Set dependencies
- Schedule or trigger on events
Figure 11: Glue workflow visual diagram
Option 2: EventBridge + Step Functions
More control and better error handling:
{
"Comment": "ELT Pipeline Orchestration",
"StartAt": "RunCrawler",
"States": {
"RunCrawler": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startCrawler.sync",
"Parameters": {
"CrawlerName": "raw-data-crawler"
},
"Next": "TransformData"
},
"TransformData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "transform-job"
},
"End": true
}
}
}
Figure 12: Step Functions state machine for pipeline orchestration
Cost Optimization
This serverless stack is cost-effective, but costs can spiral if not managed:
S3 Costs
- Storage: $0.023/GB/month (Standard)
- Requests: $0.0004 per 1,000 GET requests
- Optimization: Use lifecycle policies to move to Glacier
Glue Costs
- Crawler: $0.44 per DPU-hour (minimum 10 minutes)
- ETL Job: $0.44 per DPU-hour
- Optimization: Right-size workers, use spot instances
Athena Costs
- Query: $5 per TB scanned
- Optimization:
- Use Parquet (columnar, compressed)
- Partition tables
- Use column projection
Real Example: A 100GB daily pipeline costs approximately:
- S3 Storage: $2.30/month
- Glue (1 hour/day): $13.20/month
- Athena (10 queries/day, 1GB each): $0.50/month
- Total: ~$16/month
Figure 13: Monthly cost breakdown for a typical ELT pipeline
Best Practices
- Schema Evolution: Use Glue Schema Registry for schema versioning
- Error Handling: Implement dead-letter queues for failed records
- Monitoring: Set up CloudWatch alarms for job failures
- Security: Encrypt data at rest and in transit, use IAM roles
- Testing: Test transformations on sample data before full runs
- Documentation: Document data lineage and transformation logic
Figure 14: CloudWatch dashboard for pipeline monitoring
Common Pitfalls
- Over-partitioning: Too many small partitions slow down queries
- Under-partitioning: Full table scans are expensive
- Ignoring File Sizes: Aim for 128MB-1GB Parquet files
- No Retry Logic: Transient failures need automatic retries
- Hardcoding Paths: Use parameterized jobs for flexibility
Conclusion
AWS ELT pipelines with S3, Glue, and Athena provide a scalable, cost-effective foundation for modern data architectures. They're serverless, which means no infrastructure management, and they scale automatically with your data volume.
The key is starting simple: load raw data into S3, use Glue to transform it, and query with Athena. As your needs grow, add orchestration, monitoring, and optimization.
At Solardevs, we've built ELT pipelines processing terabytes daily for clients across industries. The pattern is proven, the tools are mature, and the costs are predictable.
Next Steps:
- Start with a single data source
- Automate one transformation
- Query the results
- Iterate and expand
The data lake is waiting.
Construye tu futuro.
¿Listo para transformar tu infraestructura con agentes de IA inteligentes?
Iniciar descubrimiento