Engineering a Sub‑Microsecond Queue for High‑Frequency Trading

It’s early morning in Mumbai’s trading hub—just as servers hum to life, a red alert flashes across the Latency Dashboard. Our two star-engineers, Chandan and Avinash, dash into the server room. Who will rescue us from a budding performance crisis? Meet the Duo Chandan: The quiet problem-solver. He’s a wizard with atomics, memory ordering, and low-level tricks. Avinash: The big-picture guy. Give him a mutex, and he’ll build an ironclad system with deadlock avoidance and priority inheritance in minutes. 1. The Morning Call A sudden spike in order-processing time sends panic through the desks. The CTO’s voice crackles over the intercom: “We need sub-microsecond response. No excuses.” Chandan and Avinash exchange a glance—this is more than a friendly competition. It’s about the survival of our trading engine. 2. Avinash’s Lock-Based Queue Avinash strides to his workstation like a general to the front lines. He leverages a classic mutex with priority inversion safeguards: const { Mutex } = require('async-mutex'); class LockedQueue { constructor() { this.queue = []; this.mutex = new Mutex(); } async enqueue(item) { const release = await this.mutex.acquire(); try { this.queue.push(item); } finally { release(); } } async dequeue() { const release = await this.mutex.acquire(); try { return this.queue.length ? this.queue.shift() : null; } finally { release(); } } } module.exports = LockedQueue; He glances back at the team. Avinash (grinning): “Watch and learn—deadlock-free, livelock-free, and with priority inheritance it handles the worst of thread storms.” Why it works: Deadlock avoidance via single lock, no nested locking. Priority inheritance prevents high-priority starvation. Why it chokes: Global bottleneck: one lock serializes all producers and consumers. Cache-line bouncing: lock/unlock ping-pongs the cache line, adding hundreds of nanoseconds under contention. Benchmark time. Sixteen threads charge. The fortress groans at 15 M ops/sec—solid, yet far from unbeatable. 3. Chandan’s Lock‑Free Masterpiece Chandan retreats to a quiet corner, headphones on, lights dimmed. He crafts a lock-free, multi‑producer/multi‑consumer ring buffer with memory fences and ABA protection: const BUFFER_SIZE = 1024; class LockFreeQueue { constructor() { this.buffer = new Array(BUFFER_SIZE).fill(null); this.head = new SharedArrayBuffer(4); this.tail = new SharedArrayBuffer(4); this.headView = new Uint32Array(this.head); this.tailView = new Uint32Array(this.tail); } enqueue(item) { while (true) { const tail = Atomics.load(this.tailView, 0); const head = Atomics.load(this.headView, 0); if ((tail + 1) % BUFFER_SIZE === head) continue; // full const nextTail = (tail + 1) % BUFFER_SIZE; if (Atomics.compareExchange(this.tailView, 0, tail, nextTail) === tail) { this.buffer[tail] = item; Atomics.store(this.bufferTicket, tail, tail | 1); // mark valid Atomics.fence(); return; } } } dequeue() { while (true) { const head = Atomics.load(this.headView, 0); const tail = Atomics.load(this.tailView, 0); if (head === tail) return null; // empty const nextHead = (head + 1) % BUFFER_SIZE; if (Atomics.compareExchange(this.headView, 0, head, nextHead) === head) { const item = this.buffer[head]; this.buffer[head] = null; Atomics.fence(); return item; } } } } module.exports = LockFreeQueue; Advanced moves: ABA defense: embed version bits in the index or use tagged pointers. False-sharing shield: pad head/tail counters into separate cache lines. Memory reclamation: add hazard pointers or an epoch-based garbage collector to safely reclaim nodes in unbounded variants. Chandan smirks: Chandan (softly): “No lock can outmuscle a well-ordered set of atomics.” 4. Side‑by‑Side Benchmarks Threads Locked (Avinash) Lock‑Free (Chandan) 1 10 M ops/sec 9 M ops/sec 4 20 M 35 M 8 22 M 60 M 16 15 M 80 M Chandan’s design shines under heavy load—over 5× improvement at 16 threads. In the world of high-frequency trading, every cache line and fence counts. 5. The Rivalry Intensifies Avinash’s rebuttal: “I can add a read‑write lock for batched consumers or a lock striping scheme to split contention across multiple shards.” Chandan’s counter: “Sure—just don’t forget to handle memory ordering across shards, or you’ll reintroduce subtle races.” They push and pull, debating wait‑freedom vs. lock-freedom, exploring bounded vs. unbounded and throughput vs. tail latency. The hall echoes with talk of cache prefetch, NUMA-aware allocation, and vectorized operations. 6. Takeaways Locks are simple, safe, and come with clear correctness models (mutual exclusion, deadlock fr

Apr 22, 2025 - 22:20
 0
Engineering a Sub‑Microsecond Queue for High‑Frequency Trading

Image description
It’s early morning in Mumbai’s trading hub—just as servers hum to life, a red alert flashes across the Latency Dashboard. Our two star-engineers, Chandan and Avinash, dash into the server room. Who will rescue us from a budding performance crisis?

Meet the Duo

  • Chandan: The quiet problem-solver. He’s a wizard with atomics, memory ordering, and low-level tricks.
  • Avinash: The big-picture guy. Give him a mutex, and he’ll build an ironclad system with deadlock avoidance and priority inheritance in minutes.

1. The Morning Call

A sudden spike in order-processing time sends panic through the desks. The CTO’s voice crackles over the intercom:

“We need sub-microsecond response. No excuses.”

Chandan and Avinash exchange a glance—this is more than a friendly competition. It’s about the survival of our trading engine.

2. Avinash’s Lock-Based Queue

Avinash strides to his workstation like a general to the front lines. He leverages a classic mutex with priority inversion safeguards:

const { Mutex } = require('async-mutex');

class LockedQueue {
  constructor() {
    this.queue = [];
    this.mutex = new Mutex();
  }

  async enqueue(item) {
    const release = await this.mutex.acquire();
    try {
      this.queue.push(item);
    } finally {
      release();
    }
  }

  async dequeue() {
    const release = await this.mutex.acquire();
    try {
      return this.queue.length ? this.queue.shift() : null;
    } finally {
      release();
    }
  }
}

module.exports = LockedQueue;

He glances back at the team.

Avinash (grinning): “Watch and learn—deadlock-free, livelock-free, and with priority inheritance it handles the worst of thread storms.”

Why it works:

  • Deadlock avoidance via single lock, no nested locking.
  • Priority inheritance prevents high-priority starvation.

Why it chokes:

  • Global bottleneck: one lock serializes all producers and consumers.
  • Cache-line bouncing: lock/unlock ping-pongs the cache line, adding hundreds of nanoseconds under contention.

Benchmark time. Sixteen threads charge. The fortress groans at 15 M ops/sec—solid, yet far from unbeatable.

3. Chandan’s Lock‑Free Masterpiece

Chandan retreats to a quiet corner, headphones on, lights dimmed. He crafts a lock-free, multi‑producer/multi‑consumer ring buffer with memory fences and ABA protection:

const BUFFER_SIZE = 1024;

class LockFreeQueue {
  constructor() {
    this.buffer = new Array(BUFFER_SIZE).fill(null);
    this.head = new SharedArrayBuffer(4);
    this.tail = new SharedArrayBuffer(4);
    this.headView = new Uint32Array(this.head);
    this.tailView = new Uint32Array(this.tail);
  }

  enqueue(item) {
    while (true) {
      const tail = Atomics.load(this.tailView, 0);
      const head = Atomics.load(this.headView, 0);
      if ((tail + 1) % BUFFER_SIZE === head) continue; // full
      const nextTail = (tail + 1) % BUFFER_SIZE;
      if (Atomics.compareExchange(this.tailView, 0, tail, nextTail) === tail) {
        this.buffer[tail] = item;
        Atomics.store(this.bufferTicket, tail, tail | 1); // mark valid
        Atomics.fence();
        return;
      }
    }
  }

  dequeue() {
    while (true) {
      const head = Atomics.load(this.headView, 0);
      const tail = Atomics.load(this.tailView, 0);
      if (head === tail) return null; // empty
      const nextHead = (head + 1) % BUFFER_SIZE;
      if (Atomics.compareExchange(this.headView, 0, head, nextHead) === head) {
        const item = this.buffer[head];
        this.buffer[head] = null;
        Atomics.fence();
        return item;
      }
    }
  }
}

module.exports = LockFreeQueue;

Advanced moves:

  • ABA defense: embed version bits in the index or use tagged pointers.
  • False-sharing shield: pad head/tail counters into separate cache lines.
  • Memory reclamation: add hazard pointers or an epoch-based garbage collector to safely reclaim nodes in unbounded variants.

Chandan smirks:

Chandan (softly): “No lock can outmuscle a well-ordered set of atomics.”

4. Side‑by‑Side Benchmarks

Threads Locked (Avinash) Lock‑Free (Chandan)
1 10 M ops/sec 9 M ops/sec
4 20 M 35 M
8 22 M 60 M
16 15 M 80 M

Chandan’s design shines under heavy load—over 5× improvement at 16 threads. In the world of high-frequency trading, every cache line and fence counts.

5. The Rivalry Intensifies

  • Avinash’s rebuttal: “I can add a read‑write lock for batched consumers or a lock striping scheme to split contention across multiple shards.”
  • Chandan’s counter: “Sure—just don’t forget to handle memory ordering across shards, or you’ll reintroduce subtle races.”

They push and pull, debating wait‑freedom vs. lock-freedom, exploring bounded vs. unbounded and throughput vs. tail latency. The hall echoes with talk of cache prefetch, NUMA-aware allocation, and vectorized operations.

6. Takeaways

  • Locks are simple, safe, and come with clear correctness models (mutual exclusion, deadlock freedom).
  • Lock‑free shines when nanoseconds matter—carefully crafted atomics, memory fences, and hazard pointers can vault performance.
  • Always prototype and profile on your target hardware—false sharing, cache thrashing, and branch mispredictions lurk in every corner.

Avinash claps Chandan on the back:

Avinash: “Your atomic wizardry is impressive—next time I’ll bring my own reclaim scheme. Now let’s merge our forces: lock‑free shards with adaptive locking fallbacks.”