Skip to main content
C++2026-03-10

Building a Sub-Microsecond Matching Engine in C++20

How a three-layer data structure design achieves O(1) cancellation and 84ns median latency.

Why Build a Matching Engine?

Every electronic exchange — NYSE, NASDAQ, CME — runs a matching engine at its core. It is the system that pairs buy orders with sell orders at the best available price. Understanding how one works at the systems level means understanding price-time priority, memory allocation on the hot path, and sub-microsecond latency engineering.

NanoMatch is my from-scratch implementation in modern C++20, achieving 9.29M operations per second with 84ns p50 latency.

The Three-Layer Data Structure

A naive implementation might use a single priority queue. The problem: O(1) best price but O(n) cancellation. On real exchanges, 90%+ of all order activity is cancellations, so optimising cancel latency matters more than insert.

NanoMatch uses three cooperating structures:

| Layer | Structure | Purpose |

|-------|-----------|---------|

| Price levels | std::map<Price, PriceLevel> | Sorted by price. std::greater for bids (highest first), std::less for asks (lowest first). O(log M) insert, O(1) best via begin(). |

| Order queue | std::list<Order> per level | FIFO queue at each price. O(1) append, O(1) erase by iterator. |

| Lookup | std::unordered_map<OrderID, Iterator> | O(1) cancel — hash lookup yields a list iterator, erase is constant time. |

This is the same asymptotic profile used by production exchange engines.

Integer Prices — Avoiding Floating-Point Traps

Prices are stored as int32_t in cents ($101.50 = 10150). Floating-point comparison (==) is unreliable due to representation error. In a matching engine, this could cause price levels that should merge remaining as separate entries, or orders matching at the wrong price. Integer arithmetic makes comparison exact and deterministic.

The Pool Allocator

Every malloc / free on the hot path risks kernel-mode syscalls, lock contention in the allocator, and cache pollution. NanoMatch uses a custom free-list pool allocator:

  • Pre-allocates a contiguous std::vector<T> of fixed capacity
  • allocate() = pop head of free list = O(1), zero syscalls
  • deallocate() = push to head = O(1), zero syscalls
  • All memory lives in a single cache-friendly contiguous block

The benchmark shows the pool allocator eliminates the allocation jitter that causes tail latency spikes.

Order Types Done Right

Four order types with production-grade semantics:

  • Limit: Match what crosses, rest the remainder on the book
  • Market: Price set to MAX/MIN to guarantee crossing, never rests
  • IOC (Immediate-or-Cancel): Like limit but cancel unfilled remainder
  • FOK (Fill-or-Kill): Atomic pre-check via can_fill_completely() — scans all crossable levels before executing. If insufficient liquidity exists, zero fills occur

Modify is implemented as cancel + re-add, which is correct exchange behavior — modifications lose time priority, just like on NYSE and NASDAQ.

Latency Profile

| Percentile | Latency |

|------------|---------|

| p50 | 84 ns |

| p99 | 625 ns |

| p99.9 | 1,250 ns |

The 15x ratio between p50 and p99.9 is realistic. Spikes come from red-black tree rebalancing on new price level insertion, orders sweeping multiple levels, and cache misses on cold levels. In production, tail latency is the metric that differentiates systems.

END_OF_TRANSMISSIONID: NANOMATCH-DEEP-DIVE