How to handle 2 million events per second all day every day
Using Ruby, SQS, SNS, S3, and Redis on K8s
by Mark Long | markclong.com
Problem
- We need to track processes over time
- Did they open files?
- Make network connections?
- Change files?
- One event might not be a problem but a sequence might be a threat
- Process events come FAST, very fast!
- We have less than 15 minutes to get a threat in front of a person
Problem
- Operations need to be scalable and scale horizontally
- Vertical scaling needs to be minimized as much as possible
- Process events, while small, are massive in aggregate
- Storage scales linearly
Solution
- Ruby code running on JRuby
- Using
parallel
and concurrent-ruby
- Using scalable Redis cluster as a datastore (Elasticache on AWS)
- Batched files on S3 with SNS to add a message to SQS
- 2,000 or more Kubernetes pods and over 16,000 CPUs
Features
- Fully scalable
- Pods scale to the queue depth
- Elasticache can scale up or down manually
- S3 provides virtually limitless capacity
Elasticache/Redis
- Runs on 64 nodes.
- Use more nodes of smaller size to spread load
- Can add nodes as memory pressure increases
- Scale downs are rare but possible
- Use only
watch
, get
and setex
operations
Kuberentes pods
- They run with six gigs of RAM
- Handle all the data manipulation
- Runs in parallel on eight cores
- Unzips files in memory to process
- Fully elastic depending on queue depth
Data storage
- Redis is in memory
- Compressing and compacting reduced RAM usage by 75%
- Limits visibility in Redis without using a custom tool to inspect
Challenges
- Data are coming so fast it changes while in flight
- Dirty writes occur when the value is changed by another process
- Handling operations at the batch file level exposes a significant challenge here
- If one record fails, the whole file should fail
- This violates most of the ACID properties
- Pipelining with Redis is impossible due to dirty writes
Mistakes
- Premature optimization
- I focued on making one file FAST
- It was amazing on one file on my 12 core MBP
- But terrible when there were 350+ files in flight at once
- Focus on getting it working, then make it fast
Mistakes
- To prevent dirty writes I tried using pessimistic locking
- It was an unmitigaged disaster
- Settled on optimistic locking since dirty writes are generally rare
Results
- The system has been running without issue for the most part
- There was a crash with Elasticache
- It has scaled to handle over 2 million events per second
- We could scale it at least 8x before we run out of Redis nodes
- It is no longer the bottleneck in our pipeline
- The only issue has been AWS Redis node failures
- The on call engineer isn’t being woken up all night