How to handle 2 million events per second

How to handle 2 million events per second all day every day

Using Ruby, SQS, SNS, S3, and Redis on K8s

by Mark Long | markclong.com

Problem

We need to track processes over time
Did they open files?
Make network connections?
Change files?
One event might not be a problem but a sequence might be a threat
Process events come FAST, very fast!
We have less than 15 minutes to get a threat in front of a person

Problem

Operations need to be scalable and scale horizontally
Vertical scaling needs to be minimized as much as possible
Process events, while small, are massive in aggregate
Storage scales linearly

Solution

Ruby code running on JRuby
Using parallel and concurrent-ruby
Using scalable Redis cluster as a datastore (Elasticache on AWS)
Batched files on S3 with SNS to add a message to SQS
2,000 or more Kubernetes pods and over 16,000 CPUs

Features

Fully scalable
Pods scale to the queue depth
Elasticache can scale up or down manually
S3 provides virtually limitless capacity

Elasticache/Redis

Runs on 64 nodes.
Use more nodes of smaller size to spread load
Can add nodes as memory pressure increases
Scale downs are rare but possible
Use only watch, get and setex operations

Kuberentes pods

They run with six gigs of RAM
Handle all the data manipulation
Runs in parallel on eight cores
Unzips files in memory to process
Fully elastic depending on queue depth

Data storage

Redis is in memory
Compressing and compacting reduced RAM usage by 75%
Limits visibility in Redis without using a custom tool to inspect

Challenges

Data are coming so fast it changes while in flight
Dirty writes occur when the value is changed by another process
Handling operations at the batch file level exposes a significant challenge here
If one record fails, the whole file should fail
This violates most of the ACID properties
Pipelining with Redis is impossible due to dirty writes

Mistakes

Premature optimization
I focued on making one file FAST
It was amazing on one file on my 12 core MBP
But terrible when there were 350+ files in flight at once
Focus on getting it working, then make it fast

Mistakes

To prevent dirty writes I tried using pessimistic locking
It was an unmitigaged disaster
Settled on optimistic locking since dirty writes are generally rare

Results

The system has been running without issue for the most part
There was a crash with Elasticache
It has scaled to handle over 2 million events per second
We could scale it at least 8x before we run out of Redis nodes
It is no longer the bottleneck in our pipeline
The only issue has been AWS Redis node failures
The on call engineer isn’t being woken up all night