How to handle 2 million events per second all day every day

Using Ruby, SQS, SNS, S3, and Redis on K8s

by Mark Long | markclong.com

Problem

  • We need to track processes over time
  • Did they open files?
  • Make network connections?
  • Change files?
  • One event might not be a problem but a sequence might be a threat
  • Process events come FAST, very fast!
  • We have less than 15 minutes to get a threat in front of a person

Problem

  • Operations need to be scalable and scale horizontally
  • Vertical scaling needs to be minimized as much as possible
  • Process events, while small, are massive in aggregate
  • Storage scales linearly

Solution

  • Ruby code running on JRuby
  • Using parallel and concurrent-ruby
  • Using scalable Redis cluster as a datastore (Elasticache on AWS)
  • Batched files on S3 with SNS to add a message to SQS
  • 2,000 or more Kubernetes pods and over 16,000 CPUs

Features

  • Fully scalable
  • Pods scale to the queue depth
  • Elasticache can scale up or down manually
  • S3 provides virtually limitless capacity

Elasticache/Redis

  • Runs on 64 nodes.
  • Use more nodes of smaller size to spread load
  • Can add nodes as memory pressure increases
  • Scale downs are rare but possible
  • Use only watch, get and setex operations

Kuberentes pods

  • They run with six gigs of RAM
  • Handle all the data manipulation
  • Runs in parallel on eight cores
  • Unzips files in memory to process
  • Fully elastic depending on queue depth

Data storage

  • Redis is in memory
  • Compressing and compacting reduced RAM usage by 75%
  • Limits visibility in Redis without using a custom tool to inspect

Challenges

  • Data are coming so fast it changes while in flight
  • Dirty writes occur when the value is changed by another process
  • Handling operations at the batch file level exposes a significant challenge here
  • If one record fails, the whole file should fail
  • This violates most of the ACID properties
  • Pipelining with Redis is impossible due to dirty writes

Mistakes

  • Premature optimization
  • I focued on making one file FAST
  • It was amazing on one file on my 12 core MBP
  • But terrible when there were 350+ files in flight at once
  • Focus on getting it working, then make it fast

Mistakes

  • To prevent dirty writes I tried using pessimistic locking
  • It was an unmitigaged disaster
  • Settled on optimistic locking since dirty writes are generally rare

Results

  • The system has been running without issue for the most part
  • There was a crash with Elasticache
  • It has scaled to handle over 2 million events per second
  • We could scale it at least 8x before we run out of Redis nodes
  • It is no longer the bottleneck in our pipeline
  • The only issue has been AWS Redis node failures
  • The on call engineer isn’t being woken up all night