Construir un pipeline ELT en AWS con S3, Glue y Athena

Why ELT Over ETL?

Traditional ETL (Extract, Transform, Load) processes transform data before loading it into a data warehouse. ELT (Extract, Load, Transform) flips this approach: load first, transform later.

Why does this matter? Modern data volumes are massive. Transforming data before loading creates bottlenecks and requires expensive compute resources to sit idle between transformations.

With ELT, you:

Load raw data quickly into cheap storage (S3)
Transform on-demand when you actually need it
Scale compute independently from storage
Preserve raw data for future use cases you haven't thought of yet

AWS ELT Architecture Overview Figure 1: High-level architecture of an AWS ELT pipeline

The AWS ELT Stack

Our pipeline uses three core AWS services:

Amazon S3: Object storage for raw and transformed data
AWS Glue: Serverless ETL service for data transformation
Amazon Athena: Serverless SQL query engine over S3 data

Together, they form a fully serverless, pay-per-use data pipeline that scales from gigabytes to petabytes.

S3 + Glue + Athena Stack Figure 2: How S3, Glue, and Athena work together

Architecture Overview

Here's how data flows through our pipeline:

Source Systems → S3 (Raw Zone) → Glue (Transform) → S3 (Curated Zone) → Athena (Query)

Data Zones in S3

We organize S3 buckets into logical zones:

Raw Zone: s3://your-bucket/raw/ - Unprocessed source data
Staging Zone: s3://your-bucket/staging/ - Intermediate transformations
Curated Zone: s3://your-bucket/curated/ - Final, query-ready data

This separation provides:

Data lineage: You can always trace back to source
Reprocessing: Easy to rebuild curated data from raw
Compliance: Raw data preserved for audit requirements

S3 Data Zones Figure 3: S3 bucket structure with data zones

Step 1: Setting Up S3 Buckets

First, create your S3 bucket with proper organization:

aws s3 mb s3://your-company-data-lake --region us-east-1

Then create the zone structure:

aws s3api put-object --bucket your-company-data-lake --key raw/
aws s3api put-object --bucket your-company-data-lake --key staging/
aws s3api put-object --bucket your-company-data-lake --key curated/

Critical S3 Settings:

Versioning: Enable versioning for data recovery
Lifecycle Policies: Move old data to Glacier after 90 days
Encryption: Enable SSE-S3 or SSE-KMS
Block Public Access: Always enabled for data lakes

Figure 4: S3 bucket configuration in AWS Console

Step 2: Creating AWS Glue Crawlers

Glue Crawlers automatically discover your data schema. They scan S3 and create or update the Glue Data Catalog.

Setting Up a Crawler

Navigate to AWS Glue Console → Crawlers → Add crawler
Configure Data Source: Point to your S3 raw zone
- Path: s3://your-company-data-lake/raw/source-system/

IAM Role: Create a role with S3 and Glue permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::your-company-data-lake/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:*"
      ],
      "Resource": "*"
    }
  ]
}

Output: Create database raw_data and table source_system

Glue Crawler Configuration Figure 5: Configuring a Glue Crawler in AWS Console

Running the Crawler

Crawlers can run on schedule or on-demand. For production, schedule them:

Hourly: For high-frequency data sources
Daily: For batch processing
On S3 Event: Trigger via EventBridge when new files arrive

Glue Crawler Schedule Figure 6: Setting up crawler schedule

Step 3: Building Glue ETL Jobs

Glue ETL Jobs transform your data. You can write them in Python or Scala, or use the visual editor.

Python Glue Job Example

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'input_path', 'output_path'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="raw_data",
    table_name="source_system"
)

# Transform: Clean and enrich data
def clean_record(record):
    # Remove nulls, standardize formats
    record["processed_date"] = datetime.now().isoformat()
    return record

cleaned = Map.apply(
    frame=datasource,
    f=clean_record,
    transformation_ctx="cleaned"
)

# Write to curated zone
glueContext.write_dynamic_frame.from_options(
    frame=cleaned,
    connection_type="s3",
    connection_options={
        "path": args['output_path'],
        "partitionKeys": ["year", "month", "day"]
    },
    format="parquet"
)

job.commit()

Figure 7: Glue ETL job code in AWS Console

Job Configuration

Worker Type: G.1X (2 vCPU, 8 GB) for most workloads
Number of Workers: Start with 2, scale based on data volume
Job Timeout: Set to 2 hours for safety
Glue Version: Use latest (4.0+) for Python 3.10 support

Figure 8: Glue job configuration settings

Step 4: Querying with Amazon Athena

Once data is in S3 (raw or curated), Athena lets you query it with standard SQL.

Setting Up Athena

Create Database:

CREATE DATABASE IF NOT EXISTS curated_data
LOCATION 's3://your-company-data-lake/curated/';

Create Table (if not using Glue Catalog):

CREATE EXTERNAL TABLE curated_data.transactions (
  transaction_id string,
  customer_id string,
  amount decimal(10,2),
  transaction_date timestamp
)
PARTITIONED BY (year int, month int, day int)
STORED AS PARQUET
LOCATION 's3://your-company-data-lake/curated/transactions/';

Load Partitions:

MSCK REPAIR TABLE curated_data.transactions;

Athena Query Editor Figure 9: Running queries in Athena

Query Optimization Tips

Use Partitioning: Always partition by date columns
Use Columnar Formats: Parquet or ORC, not CSV
Limit Columns: SELECT only what you need
Use WHERE Early: Filter before joins

-- Good: Partitioned and filtered
SELECT customer_id, SUM(amount) as total
FROM curated_data.transactions
WHERE year = 2025 AND month = 1
GROUP BY customer_id;

-- Bad: Full table scan
SELECT * FROM curated_data.transactions;

Athena Query Results Figure 10: Query results and performance metrics

Orchestrating the Pipeline

For production, you need orchestration. Options:

Option 1: AWS Glue Workflows

Visual workflow builder in Glue Console:

Create workflow
Add triggers (crawler → ETL job → crawler)
Set dependencies
Schedule or trigger on events

Figure 11: Glue workflow visual diagram

Option 2: EventBridge + Step Functions

More control and better error handling:

{
  "Comment": "ELT Pipeline Orchestration",
  "StartAt": "RunCrawler",
  "States": {
    "RunCrawler": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startCrawler.sync",
      "Parameters": {
        "CrawlerName": "raw-data-crawler"
      },
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "transform-job"
      },
      "End": true
    }
  }
}

Step Functions Workflow Figure 12: Step Functions state machine for pipeline orchestration

Cost Optimization

This serverless stack is cost-effective, but costs can spiral if not managed:

S3 Costs

Storage: $0.023/GB/month (Standard)
Requests: $0.0004 per 1,000 GET requests
Optimization: Use lifecycle policies to move to Glacier

Glue Costs

Crawler: $0.44 per DPU-hour (minimum 10 minutes)
ETL Job: $0.44 per DPU-hour
Optimization: Right-size workers, use spot instances

Athena Costs

Query: $5 per TB scanned
Optimization:
- Use Parquet (columnar, compressed)
- Partition tables
- Use column projection

Real Example: A 100GB daily pipeline costs approximately:

S3 Storage: $2.30/month
Glue (1 hour/day): $13.20/month
Athena (10 queries/day, 1GB each): $0.50/month
Total: ~$16/month

AWS Cost Breakdown Figure 13: Monthly cost breakdown for a typical ELT pipeline

Best Practices

Schema Evolution: Use Glue Schema Registry for schema versioning
Error Handling: Implement dead-letter queues for failed records
Monitoring: Set up CloudWatch alarms for job failures
Security: Encrypt data at rest and in transit, use IAM roles
Testing: Test transformations on sample data before full runs
Documentation: Document data lineage and transformation logic

CloudWatch Monitoring Figure 14: CloudWatch dashboard for pipeline monitoring

Common Pitfalls

Over-partitioning: Too many small partitions slow down queries
Under-partitioning: Full table scans are expensive
Ignoring File Sizes: Aim for 128MB-1GB Parquet files
No Retry Logic: Transient failures need automatic retries
Hardcoding Paths: Use parameterized jobs for flexibility

Conclusion

AWS ELT pipelines with S3, Glue, and Athena provide a scalable, cost-effective foundation for modern data architectures. They're serverless, which means no infrastructure management, and they scale automatically with your data volume.

The key is starting simple: load raw data into S3, use Glue to transform it, and query with Athena. As your needs grow, add orchestration, monitoring, and optimization.

At Solardevs, we've built ELT pipelines processing terabytes daily for clients across industries. The pattern is proven, the tools are mature, and the costs are predictable.

Next Steps:

Start with a single data source
Automate one transformation
Query the results
Iterate and expand

The data lake is waiting.

Construye tu futuro.

¿Listo para transformar tu infraestructura con agentes de IA inteligentes?

Iniciar descubrimiento