• Software Letters
  • Posts
  • Building a Serverless ETL Pipeline with AWS Glue: Integrating DynamoDB, S3, and OpenSearch

Building a Serverless ETL Pipeline with AWS Glue: Integrating DynamoDB, S3, and OpenSearch

AWS Glue is a fully managed ETL (Extract, Transform, Load) service designed to simplify data integration. It automates much of the effort in building, maintaining, and running ETL jobs by providing a serverless environment, built-in transformation logic, and native integration with other AWS services like S3, DynamoDB, and OpenSearch.

Overview

AWS Glue is a fully managed ETL (Extract, Transform, Load) service designed to simplify data integration. It automates much of the effort in building, maintaining, and running ETL jobs by providing a serverless environment, built-in transformation logic, and native integration with other AWS services like S3, DynamoDB, and OpenSearch.

In this article, we’ll break down how AWS Glue works and walk through a detailed use case: syncing data from DynamoDB to S3, transforming it, and indexing the results into Amazon OpenSearch Service for real-time querying.

What is AWS Glue?

AWS Glue is a serverless data integration service that lets you discover, catalog, clean, enrich, and move data between different sources. It’s designed to help you build and manage ETL (Extract, Transform, Load) pipelines at scale without needing to manage infrastructure.

Glue is built for structured, semi-structured, and unstructured data — ideal for analytics, data lakes, machine learning, and search applications.

Core Components of AWS Glue

  1. Glue Data Catalog
    The Data Catalog is a centralized metadata repository. It stores:

    • Table definitions (schema, location, format)

    • Partition info

    • Job metadata

    • Connection definitions (for sources/destinations)

    It integrates natively with services like Amazon Athena, Redshift Spectrum, and EMR.

  2. Glue Crawlers
    Crawlers connect to your data sources (S3, RDS, DynamoDB, JDBC, etc.), infer schema, and populate the Data Catalog automatically. They detect changes like schema drift and can update metadata without manual intervention.

  3. Glue Jobs
    These are scripts (written in PySpark, Scala, or via visual Glue Studio) that extract data from sources, apply transformations, and write the results to targets (e.g., S3, Redshift, OpenSearch).

    Glue jobs support:

    • Built-in transformations (e.g., mapping, filtering, joins, flattening)

    • Custom logic using Python or Spark APIs

    • Bookmarks to track processing state and support incremental loads

    • Job triggers and retries

    • Parallelism and partitioning for performance

  4. Glue Studio
    A no-code/low-code visual IDE to build, test, and run jobs using drag-and-drop components. Great for users unfamiliar with Spark.

  5. Glue Workflows and Triggers
    These allow you to chain multiple jobs, crawlers, and actions in a directed graph. You can run them on a schedule, on job completion, or via events — enabling full orchestration of ETL pipelines.

  6. Glue Streaming ETL
    For processing real-time streaming data from Kinesis Data Streams or Kafka. It enables sub-minute latency ETL — ideal for fraud detection, log monitoring, etc.

  7. Glue Libraries and DynamicFrames
    Glue provides DynamicFrames (an abstraction over Spark DataFrames) tailored for semi-structured data (e.g., nested JSON, Avro). They make it easier to transform and process complex data structures with less code.

Key Features

  • Serverless: No infrastructure to manage; Glue automatically provisions and scales compute resources (called DPUs — Data Processing Units).

  • Automatic Schema Discovery: With crawlers and schema inference.

  • Supports Popular Formats: JSON, CSV, Parquet, Avro, ORC, XML, and more.

  • Built-in Connectors: To S3, RDS, Redshift, DynamoDB, JDBC, Kafka, MongoDB, OpenSearch, and others.

  • Pay-As-You-Go: Charged per DPU-hour. Idle time isn’t billed.

How Glue Fits in the Data Stack

AWS Glue acts as the ETL backbone in modern data architectures:

  • Data Lake Ingestion: Raw data → S3 → Parquet

  • Data Warehouse Loads: S3 → Redshift

  • Search Indexing: Raw events → Cleaned JSON → OpenSearch

  • Machine Learning Pipelines: Data cleaning and feature engineering for SageMaker

  • Log Aggregation & Analytics: DynamoDB/Kinesis logs → Transformed → OpenSearch/QuickSight

The Use Case

Let’s say you're building a log analysis system. Here's the flow:

  1. DynamoDB stores real-time log events from various microservices.

  2. You want to archive those logs in S3 for long-term storage and analytics.

  3. You need to transform and index those logs in OpenSearch for full-text search and dashboarding (e.g., via Kibana).

Goals:

  • Extract data periodically from DynamoDB.

  • Clean and format the data.

  • Store raw and transformed data in S3.

  • Index transformed data into OpenSearch for querying.

Step-by-Step Implementation

1. Set Up DynamoDB Table

Create a table, e.g., AppLogs, with:

  • log_id (partition key)

  • timestamp

  • service_name

  • log_level

  • message

  • metadata (map)

Populate it with test data using Lambda, CLI, or directly via SDKs.

2. Configure AWS Glue

a. Crawler for DynamoDB

  • Set up a crawler that reads from the AppLogs table.

  • Output to the Glue Data Catalog as a table.

  • Schedule it periodically to reflect schema changes if needed.

b. Glue Job to Extract and Transform

Create a new Glue job with the following parameters:

  • Source: The catalog table mapped to DynamoDB.

  • Script Language: Python (PySpark).

  • Target 1: S3 bucket for raw data (s3://your-bucket/logs/raw/)

  • Target 2: S3 bucket for cleaned data (s3://your-bucket/logs/cleaned/)

  • Target 3: OpenSearch index (optional connector or use requests library).

c. Sample ETL Script (Simplified)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
import requests
import json

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = glueContext.create_job(args['JOB_NAME'])

# Load from DynamoDB (via Catalog)
dyf = glueContext.create_dynamic_frame.from_catalog(database="logs_db", table_name="applogs")

# Write raw data to S3
glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-bucket/logs/raw/"},
    format="json"
)

# Clean and transform logs
def clean_logs(rec):
    rec['service_name'] = rec['service_name'].lower()
    rec['timestamp'] = int(rec['timestamp'])  # ensure consistent type
    return rec

mapped_dyf = Map.apply(frame=dyf, f=clean_logs)

# Save transformed data to S3
glueContext.write_dynamic_frame.from_options(
    frame=mapped_dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-bucket/logs/cleaned/"},
    format="json"
)

# Index data into OpenSearch
def index_to_opensearch(partition):
    host = 'https://your-opensearch-domain.region.es.amazonaws.com'
    headers = { 'Content-Type': 'application/json' }
    for row in partition:
        doc = json.dumps(row)
        response = requests.post(f"{host}/logs/_doc/", headers=headers, data=doc, auth=("user", "pass"))
        if response.status_code not in (200, 201):
            print(f"Error indexing document: {response.text}")

mapped_dyf.toDF().foreachPartition(index_to_opensearch)

job.commit()
  • Create an OpenSearch domain with access policies to allow the Glue job to POST data.

  • Create an index logs with appropriate mapping.

  • Use Kibana or OpenSearch Dashboards to visualize logs (search by service name, log level, etc.).

Automation and Monitoring

  • Use AWS Glue Triggers to schedule the job every N minutes.

  • Set up CloudWatch Alarms for job failures.

  • Enable Glue job bookmarks for incremental loads (avoid reprocessing the same data).

Cost Considerations

  • Glue charges per DPU-hour.

  • DynamoDB charges per read/write capacity.

  • OpenSearch charges for instance hours and storage.

  • Use compression (e.g., GZIP) and partitioning in S3 to reduce costs and improve performance.

Conclusion

AWS Glue simplifies complex ETL workflows by offering a fully managed, scalable, and serverless platform. With seamless integration into DynamoDB, S3, and OpenSearch, you can build powerful pipelines for log analytics, data warehousing, and search-driven applications — all without provisioning infrastructure.

This architecture supports real-time ingest (via DynamoDB Streams + Glue Streaming) or batch-driven jobs (as demonstrated), making it flexible for evolving needs.