/4 min read

Designing a system that handled 30 million records a day

From a lonely EC2 instance to processing 30M records daily with serverless architecture

architecturesystem designawsserverless
Designing a system that handled 30 million records a day

Designing a System That Processes 30 Million Records Per Day

There are two types of "big data" stories.

The first type is the conference version:

"We built a distributed system leveraging cutting-edge data technologies to process large-scale datasets."

The second type — the real one — usually starts with:

"Someone uploaded a 1GB file to an EC2 instance and we had to deal with it."

This is the second kind.

Back in early 2023, we had to process tens of millions of financial records daily. The pipeline wasn't designed in a lab with pristine whiteboards and unlimited budgets. It evolved out of constraints, deadlines, and the occasional "this absolutely cannot fail tomorrow morning" message.

The end result?
A system that reliably processed ~30 million records per day.

Let's walk through how it worked.

The Core Constraint

The incoming dataset arrived as a ~1GB file.

It would be uploaded manually to a shared EC2 instance by a partner organization.

Not glamorous. Not event-driven. Just a file sitting there.

So the first question was:

How do we turn a random file on an EC2 box into a reliable processing pipeline?

Step 1: The Cron Job (Yes, Really)

Every few minutes, a cron job ran on the EC2 instance.

Its job was simple:

  • Check if a new file had been uploaded
  • Download it locally
  • Push it into an S3 bucket

Cron jobs get a bad reputation in modern architecture discussions. But they are also extremely reliable and sometimes the simplest solution wins.

Once the file landed in S3, the real pipeline started.

Step 2: S3 Becomes the Trigger

Uploading the file into S3 triggered two parallel workflows.

This split was intentional:

  • Analytics pipeline
  • Processing pipeline

Each had different performance requirements.

Path A — Analytics Pipeline (Glue + Athena)

The first path was designed to make the data queryable quickly.

When the file arrived:

  • An S3 event trigger launched an AWS Glue job
  • The Glue job converted the raw dataset into Parquet format
  • The data became immediately queryable via Athena

Why Parquet?

Because querying raw CSV at scale is painful. Parquet reduces storage size and dramatically improves query performance.

This meant analysts could run queries like:

SELECT COUNT(*) FROM transactions WHERE status='failed'

…without waiting for the entire processing pipeline to complete.

In other words:

Analytics became near-real-time.

Path B — The Processing Pipeline

The second pipeline handled the actual record processing.

This was orchestrated using AWS Step Functions.

The main challenge was simple:

  • ~30 million records
  • limited execution time
  • Lambda constraints
  • predictable throughput

We solved it with controlled parallelism.

The dataset was split into batches.

Each batch triggered Lambda workers.

Rough numbers looked something like this:

  • ~250 Lambda functions
  • each processing ~500 records
  • batches executed in parallel
  • Step Functions orchestrated the flow

This approach had several advantages:

  • Parallel scaling
  • Failure isolation
  • Retry support

If one Lambda failed, Step Functions simply retried the batch instead of restarting the entire job.

Why the Batch Size Mattered

Lambda has a maximum execution window.

At the time, keeping execution under 15 minutes was critical.

If we sent too many records per function, we risked timeouts.

Too few records, and overhead would dominate.

After testing multiple configurations, we settled on something close to:

~500 records per Lambda
~250 parallel workers

This gave us a good balance between:

  • throughput
  • cost
  • execution time

Throughput Math

Roughly speaking:

250 lambdas * 500 records
= 125,000 records per batch wave

Multiple waves ran through the pipeline until the entire dataset finished processing.

The system scaled horizontally without needing bigger machines.

Just more workers.

Observability (The Part People Forget)

When you run hundreds of distributed workers, debugging becomes... interesting.

So we leaned heavily on:

  • CloudWatch logs
  • structured logging
  • retry tracking
  • Step Function execution traces

Without observability, distributed systems quickly become distributed confusion.

Why This Architecture Worked

Looking back, the system worked well because it respected a few simple principles:

1. Use the Right Tool for Each Stage

  • S3 → storage and event trigger
  • Glue → data transformation
  • Athena → analytics
  • Step Functions → orchestration
  • Lambda → parallel compute

Each service did one job well.

2. Embrace Parallelism

Trying to process millions of records sequentially is a losing game.

Parallel workers allowed us to process data at a scale that would otherwise require far larger infrastructure.

3. Design for Failure

In distributed systems, things will fail.

Step Functions made retries straightforward and prevented the dreaded "restart everything" scenario.

What I Would Do Differently Today

This system was built in early 2023.

If I redesigned it today, I might explore:

  • AWS Batch for large compute jobs
  • Kafka / streaming ingestion
  • Fargate workers instead of Lambda bursts
  • Iceberg tables instead of plain Parquet

But for the constraints we had at the time, the architecture held up surprisingly well.

Final Thoughts

Processing 30 million records per day sounds intimidating.

In reality, it's often just:

  • a file
  • a bucket
  • a few hundred workers
  • and a lot of careful orchestration

The real challenge isn't writing code.

It's designing systems that stay predictable under scale.

And sometimes, it all starts with a cron job on a lonely EC2 instance.

AWS Serverless Patterns

Collection of proven serverless architecture patterns for large-scale data processing

If you're building similar pipelines or have opinions on better architectures, I'd love to hear them.

The internet is full of strong opinions about distributed systems — and tech Twitter is rarely shy about sharing them.

Comments3

SC
MJ
EW

Join the discussion with 3 others

Designing a system that handled 30 million records a day
0%