Volver al blog
Construir un pipeline ELT en AWS con S3, Glue y Athena
Nube12 min lectura25 ene 2025

Construir un pipeline ELT en AWS con S3, Glue y Athena

Aprende a construir un pipeline ELT escalable y serverless en AWS con S3 para almacenamiento, Glue para transformación y Athena para consultas. Ideal para arquitecturas de datos modernas.
SD
SolarDevs Team
Liderazgo técnico

Why ELT Over ETL?

Traditional ETL (Extract, Transform, Load) processes transform data before loading it into a data warehouse. ELT (Extract, Load, Transform) flips this approach: load first, transform later.

Why does this matter? Modern data volumes are massive. Transforming data before loading creates bottlenecks and requires expensive compute resources to sit idle between transformations.

With ELT, you:

  • Load raw data quickly into cheap storage (S3)
  • Transform on-demand when you actually need it
  • Scale compute independently from storage
  • Preserve raw data for future use cases you haven't thought of yet

AWS ELT Architecture Overview Figure 1: High-level architecture of an AWS ELT pipeline

The AWS ELT Stack

Our pipeline uses three core AWS services:

  1. Amazon S3: Object storage for raw and transformed data
  2. AWS Glue: Serverless ETL service for data transformation
  3. Amazon Athena: Serverless SQL query engine over S3 data

Together, they form a fully serverless, pay-per-use data pipeline that scales from gigabytes to petabytes.

S3 + Glue + Athena Stack Figure 2: How S3, Glue, and Athena work together

Architecture Overview

Here's how data flows through our pipeline:

Source Systems → S3 (Raw Zone) → Glue (Transform) → S3 (Curated Zone) → Athena (Query)

Data Zones in S3

We organize S3 buckets into logical zones:

  • Raw Zone: s3://your-bucket/raw/ - Unprocessed source data
  • Staging Zone: s3://your-bucket/staging/ - Intermediate transformations
  • Curated Zone: s3://your-bucket/curated/ - Final, query-ready data

This separation provides:

  • Data lineage: You can always trace back to source
  • Reprocessing: Easy to rebuild curated data from raw
  • Compliance: Raw data preserved for audit requirements

S3 Data Zones Figure 3: S3 bucket structure with data zones

Step 1: Setting Up S3 Buckets

First, create your S3 bucket with proper organization:

aws s3 mb s3://your-company-data-lake --region us-east-1

Then create the zone structure:

aws s3api put-object --bucket your-company-data-lake --key raw/
aws s3api put-object --bucket your-company-data-lake --key staging/
aws s3api put-object --bucket your-company-data-lake --key curated/

Critical S3 Settings:

  1. Versioning: Enable versioning for data recovery
  2. Lifecycle Policies: Move old data to Glacier after 90 days
  3. Encryption: Enable SSE-S3 or SSE-KMS
  4. Block Public Access: Always enabled for data lakes

S3 Bucket Configuration Figure 4: S3 bucket configuration in AWS Console

Step 2: Creating AWS Glue Crawlers

Glue Crawlers automatically discover your data schema. They scan S3 and create or update the Glue Data Catalog.

Setting Up a Crawler

  1. Navigate to AWS Glue Console → Crawlers → Add crawler

  2. Configure Data Source: Point to your S3 raw zone

    • Path: s3://your-company-data-lake/raw/source-system/
  3. IAM Role: Create a role with S3 and Glue permissions:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:PutObject"
          ],
          "Resource": "arn:aws:s3:::your-company-data-lake/*"
        },
        {
          "Effect": "Allow",
          "Action": [
            "glue:*"
          ],
          "Resource": "*"
        }
      ]
    }
    
  4. Output: Create database raw_data and table source_system

Glue Crawler Configuration Figure 5: Configuring a Glue Crawler in AWS Console

Running the Crawler

Crawlers can run on schedule or on-demand. For production, schedule them:

  • Hourly: For high-frequency data sources
  • Daily: For batch processing
  • On S3 Event: Trigger via EventBridge when new files arrive

Glue Crawler Schedule Figure 6: Setting up crawler schedule

Step 3: Building Glue ETL Jobs

Glue ETL Jobs transform your data. You can write them in Python or Scala, or use the visual editor.

Python Glue Job Example

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'input_path', 'output_path'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="raw_data",
    table_name="source_system"
)

# Transform: Clean and enrich data
def clean_record(record):
    # Remove nulls, standardize formats
    record["processed_date"] = datetime.now().isoformat()
    return record

cleaned = Map.apply(
    frame=datasource,
    f=clean_record,
    transformation_ctx="cleaned"
)

# Write to curated zone
glueContext.write_dynamic_frame.from_options(
    frame=cleaned,
    connection_type="s3",
    connection_options={
        "path": args['output_path'],
        "partitionKeys": ["year", "month", "day"]
    },
    format="parquet"
)

job.commit()

Glue ETL Job Code Figure 7: Glue ETL job code in AWS Console

Job Configuration

  • Worker Type: G.1X (2 vCPU, 8 GB) for most workloads
  • Number of Workers: Start with 2, scale based on data volume
  • Job Timeout: Set to 2 hours for safety
  • Glue Version: Use latest (4.0+) for Python 3.10 support

Glue Job Configuration Figure 8: Glue job configuration settings

Step 4: Querying with Amazon Athena

Once data is in S3 (raw or curated), Athena lets you query it with standard SQL.

Setting Up Athena

  1. Create Database:

    CREATE DATABASE IF NOT EXISTS curated_data
    LOCATION 's3://your-company-data-lake/curated/';
    
  2. Create Table (if not using Glue Catalog):

    CREATE EXTERNAL TABLE curated_data.transactions (
      transaction_id string,
      customer_id string,
      amount decimal(10,2),
      transaction_date timestamp
    )
    PARTITIONED BY (year int, month int, day int)
    STORED AS PARQUET
    LOCATION 's3://your-company-data-lake/curated/transactions/';
    
  3. Load Partitions:

    MSCK REPAIR TABLE curated_data.transactions;
    

Athena Query Editor Figure 9: Running queries in Athena

Query Optimization Tips

  • Use Partitioning: Always partition by date columns
  • Use Columnar Formats: Parquet or ORC, not CSV
  • Limit Columns: SELECT only what you need
  • Use WHERE Early: Filter before joins
-- Good: Partitioned and filtered
SELECT customer_id, SUM(amount) as total
FROM curated_data.transactions
WHERE year = 2025 AND month = 1
GROUP BY customer_id;

-- Bad: Full table scan
SELECT * FROM curated_data.transactions;

Athena Query Results Figure 10: Query results and performance metrics

Orchestrating the Pipeline

For production, you need orchestration. Options:

Option 1: AWS Glue Workflows

Visual workflow builder in Glue Console:

  1. Create workflow
  2. Add triggers (crawler → ETL job → crawler)
  3. Set dependencies
  4. Schedule or trigger on events

Glue Workflow Figure 11: Glue workflow visual diagram

Option 2: EventBridge + Step Functions

More control and better error handling:

{
  "Comment": "ELT Pipeline Orchestration",
  "StartAt": "RunCrawler",
  "States": {
    "RunCrawler": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startCrawler.sync",
      "Parameters": {
        "CrawlerName": "raw-data-crawler"
      },
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "transform-job"
      },
      "End": true
    }
  }
}

Step Functions Workflow Figure 12: Step Functions state machine for pipeline orchestration

Cost Optimization

This serverless stack is cost-effective, but costs can spiral if not managed:

S3 Costs

  • Storage: $0.023/GB/month (Standard)
  • Requests: $0.0004 per 1,000 GET requests
  • Optimization: Use lifecycle policies to move to Glacier

Glue Costs

  • Crawler: $0.44 per DPU-hour (minimum 10 minutes)
  • ETL Job: $0.44 per DPU-hour
  • Optimization: Right-size workers, use spot instances

Athena Costs

  • Query: $5 per TB scanned
  • Optimization:
    • Use Parquet (columnar, compressed)
    • Partition tables
    • Use column projection

Real Example: A 100GB daily pipeline costs approximately:

  • S3 Storage: $2.30/month
  • Glue (1 hour/day): $13.20/month
  • Athena (10 queries/day, 1GB each): $0.50/month
  • Total: ~$16/month

AWS Cost Breakdown Figure 13: Monthly cost breakdown for a typical ELT pipeline

Best Practices

  1. Schema Evolution: Use Glue Schema Registry for schema versioning
  2. Error Handling: Implement dead-letter queues for failed records
  3. Monitoring: Set up CloudWatch alarms for job failures
  4. Security: Encrypt data at rest and in transit, use IAM roles
  5. Testing: Test transformations on sample data before full runs
  6. Documentation: Document data lineage and transformation logic

CloudWatch Monitoring Figure 14: CloudWatch dashboard for pipeline monitoring

Common Pitfalls

  1. Over-partitioning: Too many small partitions slow down queries
  2. Under-partitioning: Full table scans are expensive
  3. Ignoring File Sizes: Aim for 128MB-1GB Parquet files
  4. No Retry Logic: Transient failures need automatic retries
  5. Hardcoding Paths: Use parameterized jobs for flexibility

Conclusion

AWS ELT pipelines with S3, Glue, and Athena provide a scalable, cost-effective foundation for modern data architectures. They're serverless, which means no infrastructure management, and they scale automatically with your data volume.

The key is starting simple: load raw data into S3, use Glue to transform it, and query with Athena. As your needs grow, add orchestration, monitoring, and optimization.

At Solardevs, we've built ELT pipelines processing terabytes daily for clients across industries. The pattern is proven, the tools are mature, and the costs are predictable.

Next Steps:

  • Start with a single data source
  • Automate one transformation
  • Query the results
  • Iterate and expand

The data lake is waiting.

Construye tu futuro.

¿Listo para transformar tu infraestructura con agentes de IA inteligentes?

Iniciar descubrimiento