Cache Wars: A Journey from Simple LRU to Advanced Adaptive Caching

As dawn breaks over the ultra‑low‑latency trading floor of Trading Exchange in Mumbai, glowing dashboards reveal a critical issue: cache hit rates are plummeting, and tail latencies are spiking. Two engineers spring into action—Chandan, the quiet atomics specialist, and Avinash, the fearless systems strategist. Together, they embark on a mission to tame the unpredictable behavior of an LRU cache under intense load. 1. Facing the LRU Meltdown Avinash leans over a screen, eyebrow raised. “Misses are shooting up—orders are slowing down!” Chandan scans the logs, nodding. “Our fixed‑size LRU is thrashing. We need O(1) get and put, and bulletproof thread safety.” They agree on three core requirements: Capacity limit: hold at most N entries, evicting the least‑recently‑used when full. Constant‑time operations: every lookup and insertion must complete in O(1). Concurrency: support dozens of simultaneous threads without data races. To satisfy O(1) access and eviction, they combine a hash map (for key lookup) with a doubly linked list (to track usage order). This classic pattern moves accessed items to the front and evicts from the tail. 2. Avinash’s Mutex‑Guarded LRU Avinash’s first draft wraps all operations in a simple global lock: class LockedLRUCache { private final ReentrantLock lock = new ReentrantLock(); private final int capacity; private final Map map; private final Node head, tail; class Node { K key; V value; Node prev, next; } public LockedLRUCache(int cap) { capacity = cap; map = new HashMap(cap); head = new Node(); tail = new Node(); head.next = tail; tail.prev = head; } public V get(K key) { lock.lock(); try { Node n = map.get(key); if (n == null) return null; remove(n); insertAtFront(n); return n.value; } finally { lock.unlock(); } } public void put(K key, V value) { lock.lock(); try { Node n = map.get(key); if (n != null) { n.value = value; remove(n); } else if (map.size() == capacity) { Node lru = tail.prev; remove(lru); map.remove(lru.key); } Node newNode = new Node(); newNode.key = key; newNode.value = value; map.put(key, newNode); insertAtFront(newNode); } finally { lock.unlock(); } } private void remove(Node n) { n.prev.next = n.next; n.next.prev = n.prev; } private void insertAtFront(Node n) { n.next = head.next; head.next.prev = n; head.next = n; n.prev = head; } } This locking approach is straightforward and easy to reason about. However, every thread must wait its turn at the lock, leading to contention under high concurrency. Benchmarks with 16 threads peaked around 12 M ops/sec, leaving room for improvement. 3. Sharding Meets Lock‑Free Magic To break the contention bottleneck, Chandan proposes sharding combined with lock‑free techniques: Shard Cache: split the keyspace into S independent sub‑caches. Each shard handles ~N/S entries. Lock‑Free Ring Buffers: within each shard, use atomic pointers and a pre‑allocated array to implement O(1) enqueue/dequeue without locks. Hazard Pointers: ensure safe memory reclamation by preventing nodes from being freed while still in use. class LockFreeShard { private final AtomicInteger head = new AtomicInteger(0); private final AtomicInteger tail = new AtomicInteger(0); private final Entry[] buffer; // fixed size public V get(K key) { /* spin-read with version tags / } public void put(K key, V value) { / spin-write with ABA protection */ } } By routing threads to different shards and eliminating locks inside each shard, the design avoids cache-line ping‑pong and reduces stalls. In tests, this sharded lock‑free LRU soared to 50 M ops/sec on 16 threads—a 4× uplift. 4. Balancing Throughput and Latency Avinash raises a fair concern: “Spin‑loops are fast, but uncontrolled spinning can hurt tail latency.” Chandan agrees and suggests a hybrid strategy: Adaptive Backoff: attempt a few lock‑free spins, then yield or briefly lock if contention persists. Read‑Write Locks: for mostly-read workloads, allow concurrent get calls without blocking each other. NUMA‑Awareness: bind shards to local memory banks to minimize cross‑node traffic. They measure not only average throughput but critical percentiles (p95, p99) to ensure consistent performance even under bursts. 5. Advanced Adaptive Caching When single‑machine LRU reaches its limits, consider adaptive policies: ARC (Adaptive Replacement Cache): auto‑tunes between recency and frequency by maintaining two LRU lists, adapting to changing access patterns. W‑TinyLFU: combines a small LRU window for recent items with a TinyLFU sketch to decide long‑term retention, improving hit rates with minimal memory. SLRU (Segmented LRU): divides items into probationary and protected segments, pro

Apr 22, 2025 - 22:54

Cache Wars: A Journey from Simple LRU to Advanced Adaptive Caching

As dawn breaks over the ultra‑low‑latency trading floor of Trading Exchange in Mumbai, glowing dashboards reveal a critical issue: cache hit rates are plummeting, and tail latencies are spiking. Two engineers spring into action—Chandan, the quiet atomics specialist, and Avinash, the fearless systems strategist. Together, they embark on a mission to tame the unpredictable behavior of an LRU cache under intense load.

1. Facing the LRU Meltdown

Avinash leans over a screen, eyebrow raised.

“Misses are shooting up—orders are slowing down!”

Chandan scans the logs, nodding.

“Our fixed‑size LRU is thrashing. We need O(1) get and put, and bulletproof thread safety.”

They agree on three core requirements:

Capacity limit: hold at most N entries, evicting the least‑recently‑used when full.
Constant‑time operations: every lookup and insertion must complete in O(1).
Concurrency: support dozens of simultaneous threads without data races.

To satisfy O(1) access and eviction, they combine a hash map (for key lookup) with a doubly linked list (to track usage order). This classic pattern moves accessed items to the front and evicts from the tail.

2. Avinash’s Mutex‑Guarded LRU

Avinash’s first draft wraps all operations in a simple global lock:

class LockedLRUCache<K, V> {
  private final ReentrantLock lock = new ReentrantLock();
  private final int capacity;
  private final Map<K, Node> map;
  private final Node head, tail;

  class Node { K key; V value; Node prev, next; }

  public LockedLRUCache(int cap) {
    capacity = cap;
    map = new HashMap<>(cap);
    head = new Node(); tail = new Node();
    head.next = tail; tail.prev = head;
  }

  public V get(K key) {
    lock.lock();
    try {
      Node n = map.get(key);
      if (n == null) return null;
      remove(n);
      insertAtFront(n);
      return n.value;
    } finally {
      lock.unlock();
    }
  }

  public void put(K key, V value) {
    lock.lock();
    try {
      Node n = map.get(key);
      if (n != null) {
        n.value = value;
        remove(n);
      } else if (map.size() == capacity) {
        Node lru = tail.prev;
        remove(lru);
        map.remove(lru.key);
      }
      Node newNode = new Node(); newNode.key = key; newNode.value = value;
      map.put(key, newNode);
      insertAtFront(newNode);
    } finally { lock.unlock(); }
  }

  private void remove(Node n) {
    n.prev.next = n.next;
    n.next.prev = n.prev;
  }

  private void insertAtFront(Node n) {
    n.next = head.next;
    head.next.prev = n;
    head.next = n;
    n.prev = head;
  }
}

This locking approach is straightforward and easy to reason about. However, every thread must wait its turn at the lock, leading to contention under high concurrency. Benchmarks with 16 threads peaked around 12 M ops/sec, leaving room for improvement.

3. Sharding Meets Lock‑Free Magic

To break the contention bottleneck, Chandan proposes sharding combined with lock‑free techniques:

Shard Cache: split the keyspace into S independent sub‑caches. Each shard handles ~N/S entries.
Lock‑Free Ring Buffers: within each shard, use atomic pointers and a pre‑allocated array to implement O(1) enqueue/dequeue without locks.
Hazard Pointers: ensure safe memory reclamation by preventing nodes from being freed while still in use.

class LockFreeShard<K, V> {
  private final AtomicInteger head = new AtomicInteger(0);
  private final AtomicInteger tail = new AtomicInteger(0);
  private final Entry[] buffer; // fixed size

  public V get(K key) { /* spin-read with version tags */ }
  public void put(K key, V value) { /* spin-write with ABA protection */ }
}

By routing threads to different shards and eliminating locks inside each shard, the design avoids cache-line ping‑pong and reduces stalls. In tests, this sharded lock‑free LRU soared to 50 M ops/sec on 16 threads—a 4× uplift.

4. Balancing Throughput and Latency

Avinash raises a fair concern: “Spin‑loops are fast, but uncontrolled spinning can hurt tail latency.”

Chandan agrees and suggests a hybrid strategy:

Adaptive Backoff: attempt a few lock‑free spins, then yield or briefly lock if contention persists.
Read‑Write Locks: for mostly-read workloads, allow concurrent get calls without blocking each other.
NUMA‑Awareness: bind shards to local memory banks to minimize cross‑node traffic.

They measure not only average throughput but critical percentiles (p95, p99) to ensure consistent performance even under bursts.

5. Advanced Adaptive Caching

When single‑machine LRU reaches its limits, consider adaptive policies:

ARC (Adaptive Replacement Cache): auto‑tunes between recency and frequency by maintaining two LRU lists, adapting to changing access patterns.
W‑TinyLFU: combines a small LRU window for recent items with a TinyLFU sketch to decide long‑term retention, improving hit rates with minimal memory.
SLRU (Segmented LRU): divides items into probationary and protected segments, protecting frequently accessed data.
Counting Bloom Filters: maintain approximate counts to quickly filter evictions in massive keyspaces.
RCU (Read‑Copy‑Update): for read‑heavy scenarios, enable lock‑free reads with deferred updates.

Beyond algorithms, hardware and deployment optimizations—like prefetch hints, persistent NVDIMM caches, or RDMA‑enabled distributed caches—can push latencies further below the microsecond barrier.

6. Real‑World Example Scenarios

High‑Frequency Trading Quotes

   // cache.js exports a LockedLRUCache implementation
   const { LockedLRUCache } = require('./cache');
   const quoteCache = new LockedLRUCache(100_000);

   async function onNewTick(symbol, bid, ask) {
     quoteCache.put(symbol, { timestamp: Date.now(), bid, ask });
   }

   async function buildOrderBook(symbol) {
     const last = await quoteCache.get(symbol);
     if (last) {
       processOrderWithPrice(last);
     } else {
       const fresh = await fetchFromMarketData(symbol);
       processOrderWithPrice(fresh);
     }
   }

Sub‑microsecond lookups for hot symbols, evicting the least‐recently‐seen under heavy churn.

Web Session Caching

   // cache.js exports a ShardedLRUCache implementation
   const { ShardedLRUCache } = require('./cache');
   const sessionCache = new ShardedLRUCache(1_000, 16);

   // On user login
   app.post('/login', async (req, res) => {
     const sessionData = { userId: req.user.id, loginTime: Date.now() };
     sessionCache.put(req.sessionID, sessionData);
     res.redirect('/dashboard');
   });

   // On each request
   app.use((req, res, next) => {
     const session = sessionCache.get(req.sessionID);
     if (session && session.loginTime + 30*60*1000 > Date.now()) {
       req.userId = session.userId;
       next();
     } else {
       res.redirect('/login');
     }
   });

Sharding spreads load across 16 independent LRU sub‑caches, minimizing lock contention.

Image Thumbnail Store

   const LRU = require('lru-cache');
   const thumbnailCache = new LRU({ max: 500 });
   const sharp = require('sharp');

   async function generateThumbnail(imagePath) {
     let thumb = thumbnailCache.get(imagePath);
     if (!thumb) {
       thumb = await sharp(imagePath)
         .resize(128, 128)
         .png()
         .toBuffer();
       thumbnailCache.set(imagePath, thumb);
     }
     return thumb;
   }

   // In Express handler
   app.get('/thumb/:name', async (req, res) => {
     try {
       const buf = await generateThumbnail(`/images/${req.params.name}`);
       res.type('image/png').send(buf);
     } catch (err) {
       res.status(500).send('Error generating thumbnail');
     }
   });

An off‑the‑shelf LRU cache paired with Sharp delivers fast, cached thumbnails for web clients.

7. Key Takeaways Key Takeaways

Start simple: implement a lock‑based LRU to validate correctness and workload characteristics.
Measure holistically: track both average and tail latencies under realistic traffic.
Scale incrementally: introduce sharding, lock‑free structures, or hybrid locks only when and where contention demands.
Adopt adaptive policies: ARC, TinyLFU, or SLRU can boost hit rates in dynamic workloads.
Harness hardware: leverage NUMA affinity, prefetch, persistence, and RDMA to squeeze every microsecond.

In high‑frequency trading, every cache hit is a victory—and every saved microsecond, a win.

Start simple: implement a lock‑based LRU to validate correctness and workload characteristics.
Measure holistically: track both average and tail latencies under realistic traffic.
Scale incrementally: introduce sharding, lock‑free structures, or hybrid locks only when and where contention demands.
Adopt adaptive policies: ARC, TinyLFU, or SLRU can boost hit rates in dynamic workloads.
Harness hardware: leverage NUMA affinity, prefetch, persistence, and RDMA to squeeze every microsecond.

In high‑frequency trading, every cache hit is a victory—and every saved microsecond, a win.