Engineering a Sub‑Microsecond Queue for High‑Frequency Trading

It’s early morning in Mumbai’s trading hub—just as servers hum to life, a red alert flashes across the Latency Dashboard. Our two star-engineers, Chandan and Avinash, dash into the server room. Who will rescue us from a budding performance crisis? Meet the Duo Chandan: The quiet problem-solver. He’s a wizard with atomics, memory ordering, and low-level tricks. Avinash: The big-picture guy. Give him a mutex, and he’ll build an ironclad system with deadlock avoidance and priority inheritance in minutes. 1. The Morning Call A sudden spike in order-processing time sends panic through the desks. The CTO’s voice crackles over the intercom: “We need sub-microsecond response. No excuses.” Chandan and Avinash exchange a glance—this is more than a friendly competition. It’s about the survival of our trading engine. 2. Avinash’s Lock-Based Queue Avinash strides to his workstation like a general to the front lines. He leverages a classic mutex with priority inversion safeguards: const { Mutex } = require('async-mutex'); class LockedQueue { constructor() { this.queue = []; this.mutex = new Mutex(); } async enqueue(item) { const release = await this.mutex.acquire(); try { this.queue.push(item); } finally { release(); } } async dequeue() { const release = await this.mutex.acquire(); try { return this.queue.length ? this.queue.shift() : null; } finally { release(); } } } module.exports = LockedQueue; He glances back at the team. Avinash (grinning): “Watch and learn—deadlock-free, livelock-free, and with priority inheritance it handles the worst of thread storms.” Why it works: Deadlock avoidance via single lock, no nested locking. Priority inheritance prevents high-priority starvation. Why it chokes: Global bottleneck: one lock serializes all producers and consumers. Cache-line bouncing: lock/unlock ping-pongs the cache line, adding hundreds of nanoseconds under contention. Benchmark time. Sixteen threads charge. The fortress groans at 15 M ops/sec—solid, yet far from unbeatable. 3. Chandan’s Lock‑Free Masterpiece Chandan retreats to a quiet corner, headphones on, lights dimmed. He crafts a lock-free, multi‑producer/multi‑consumer ring buffer with memory fences and ABA protection: const BUFFER_SIZE = 1024; class LockFreeQueue { constructor() { this.buffer = new Array(BUFFER_SIZE).fill(null); this.head = new SharedArrayBuffer(4); this.tail = new SharedArrayBuffer(4); this.headView = new Uint32Array(this.head); this.tailView = new Uint32Array(this.tail); } enqueue(item) { while (true) { const tail = Atomics.load(this.tailView, 0); const head = Atomics.load(this.headView, 0); if ((tail + 1) % BUFFER_SIZE === head) continue; // full const nextTail = (tail + 1) % BUFFER_SIZE; if (Atomics.compareExchange(this.tailView, 0, tail, nextTail) === tail) { this.buffer[tail] = item; Atomics.store(this.bufferTicket, tail, tail | 1); // mark valid Atomics.fence(); return; } } } dequeue() { while (true) { const head = Atomics.load(this.headView, 0); const tail = Atomics.load(this.tailView, 0); if (head === tail) return null; // empty const nextHead = (head + 1) % BUFFER_SIZE; if (Atomics.compareExchange(this.headView, 0, head, nextHead) === head) { const item = this.buffer[head]; this.buffer[head] = null; Atomics.fence(); return item; } } } } module.exports = LockFreeQueue; Advanced moves: ABA defense: embed version bits in the index or use tagged pointers. False-sharing shield: pad head/tail counters into separate cache lines. Memory reclamation: add hazard pointers or an epoch-based garbage collector to safely reclaim nodes in unbounded variants. Chandan smirks: Chandan (softly): “No lock can outmuscle a well-ordered set of atomics.” 4. Side‑by‑Side Benchmarks Threads Locked (Avinash) Lock‑Free (Chandan) 1 10 M ops/sec 9 M ops/sec 4 20 M 35 M 8 22 M 60 M 16 15 M 80 M Chandan’s design shines under heavy load—over 5× improvement at 16 threads. In the world of high-frequency trading, every cache line and fence counts. 5. The Rivalry Intensifies Avinash’s rebuttal: “I can add a read‑write lock for batched consumers or a lock striping scheme to split contention across multiple shards.” Chandan’s counter: “Sure—just don’t forget to handle memory ordering across shards, or you’ll reintroduce subtle races.” They push and pull, debating wait‑freedom vs. lock-freedom, exploring bounded vs. unbounded and throughput vs. tail latency. The hall echoes with talk of cache prefetch, NUMA-aware allocation, and vectorized operations. 6. Takeaways Locks are simple, safe, and come with clear correctness models (mutual exclusion, deadlock fr

Apr 22, 2025 - 22:20

Engineering a Sub‑Microsecond Queue for High‑Frequency Trading

It’s early morning in Mumbai’s trading hub—just as servers hum to life, a red alert flashes across the Latency Dashboard. Our two star-engineers, Chandan and Avinash, dash into the server room. Who will rescue us from a budding performance crisis?

Meet the Duo

Chandan: The quiet problem-solver. He’s a wizard with atomics, memory ordering, and low-level tricks.
Avinash: The big-picture guy. Give him a mutex, and he’ll build an ironclad system with deadlock avoidance and priority inheritance in minutes.

1. The Morning Call

A sudden spike in order-processing time sends panic through the desks. The CTO’s voice crackles over the intercom:

“We need sub-microsecond response. No excuses.”

Chandan and Avinash exchange a glance—this is more than a friendly competition. It’s about the survival of our trading engine.

2. Avinash’s Lock-Based Queue

Avinash strides to his workstation like a general to the front lines. He leverages a classic mutex with priority inversion safeguards:

const { Mutex } = require('async-mutex');

class LockedQueue {
  constructor() {
    this.queue = [];
    this.mutex = new Mutex();
  }

  async enqueue(item) {
    const release = await this.mutex.acquire();
    try {
      this.queue.push(item);
    } finally {
      release();
    }
  }

  async dequeue() {
    const release = await this.mutex.acquire();
    try {
      return this.queue.length ? this.queue.shift() : null;
    } finally {
      release();
    }
  }
}

module.exports = LockedQueue;

He glances back at the team.

Avinash (grinning): “Watch and learn—deadlock-free, livelock-free, and with priority inheritance it handles the worst of thread storms.”

Why it works:

Deadlock avoidance via single lock, no nested locking.
Priority inheritance prevents high-priority starvation.

Why it chokes:

Global bottleneck: one lock serializes all producers and consumers.
Cache-line bouncing: lock/unlock ping-pongs the cache line, adding hundreds of nanoseconds under contention.

Benchmark time. Sixteen threads charge. The fortress groans at 15 M ops/sec—solid, yet far from unbeatable.

3. Chandan’s Lock‑Free Masterpiece

Chandan retreats to a quiet corner, headphones on, lights dimmed. He crafts a lock-free, multi‑producer/multi‑consumer ring buffer with memory fences and ABA protection:

const BUFFER_SIZE = 1024;

class LockFreeQueue {
  constructor() {
    this.buffer = new Array(BUFFER_SIZE).fill(null);
    this.head = new SharedArrayBuffer(4);
    this.tail = new SharedArrayBuffer(4);
    this.headView = new Uint32Array(this.head);
    this.tailView = new Uint32Array(this.tail);
  }

  enqueue(item) {
    while (true) {
      const tail = Atomics.load(this.tailView, 0);
      const head = Atomics.load(this.headView, 0);
      if ((tail + 1) % BUFFER_SIZE === head) continue; // full
      const nextTail = (tail + 1) % BUFFER_SIZE;
      if (Atomics.compareExchange(this.tailView, 0, tail, nextTail) === tail) {
        this.buffer[tail] = item;
        Atomics.store(this.bufferTicket, tail, tail | 1); // mark valid
        Atomics.fence();
        return;
      }
    }
  }

  dequeue() {
    while (true) {
      const head = Atomics.load(this.headView, 0);
      const tail = Atomics.load(this.tailView, 0);
      if (head === tail) return null; // empty
      const nextHead = (head + 1) % BUFFER_SIZE;
      if (Atomics.compareExchange(this.headView, 0, head, nextHead) === head) {
        const item = this.buffer[head];
        this.buffer[head] = null;
        Atomics.fence();
        return item;
      }
    }
  }
}

module.exports = LockFreeQueue;

Advanced moves:

ABA defense: embed version bits in the index or use tagged pointers.
False-sharing shield: pad head/tail counters into separate cache lines.
Memory reclamation: add hazard pointers or an epoch-based garbage collector to safely reclaim nodes in unbounded variants.

Chandan smirks:

Chandan (softly): “No lock can outmuscle a well-ordered set of atomics.”

4. Side‑by‑Side Benchmarks

Threads	Locked (Avinash)	Lock‑Free (Chandan)
1	10 M ops/sec	9 M ops/sec
4	20 M	35 M
8	22 M	60 M
16	15 M	80 M

Chandan’s design shines under heavy load—over 5× improvement at 16 threads. In the world of high-frequency trading, every cache line and fence counts.

5. The Rivalry Intensifies

Avinash’s rebuttal: “I can add a read‑write lock for batched consumers or a lock striping scheme to split contention across multiple shards.”
Chandan’s counter: “Sure—just don’t forget to handle memory ordering across shards, or you’ll reintroduce subtle races.”

They push and pull, debating wait‑freedom vs. lock-freedom, exploring bounded vs. unbounded and throughput vs. tail latency. The hall echoes with talk of cache prefetch, NUMA-aware allocation, and vectorized operations.

6. Takeaways

Locks are simple, safe, and come with clear correctness models (mutual exclusion, deadlock freedom).
Lock‑free shines when nanoseconds matter—carefully crafted atomics, memory fences, and hazard pointers can vault performance.
Always prototype and profile on your target hardware—false sharing, cache thrashing, and branch mispredictions lurk in every corner.

Avinash claps Chandan on the back:

Avinash: “Your atomic wizardry is impressive—next time I’ll bring my own reclaim scheme. Now let’s merge our forces: lock‑free shards with adaptive locking fallbacks.”