Building a Sub-Microsecond Matching Engine in C++20 | Write-ups

Why Build a Matching Engine?

Every electronic exchange — NYSE, NASDAQ, CME — runs a matching engine at its core. It is the system that pairs buy orders with sell orders at the best available price. Understanding how one works at the systems level means understanding price-time priority, memory allocation on the hot path, and sub-microsecond latency engineering.

NanoMatch is my from-scratch implementation in modern C++20, processing 1M+ orders per second with sub-microsecond latency.

The Three-Layer Data Structure

A naive implementation might use a single priority queue. The problem: O(1) best price but O(n) cancellation. On real exchanges, 90%+ of all order activity is cancellations, so optimising cancel latency matters more than insert.

NanoMatch uses three cooperating structures:

Layer	Structure	Purpose
Price levels	`std::map<Price, PriceLevel>`	Sorted by price. `std::greater` for bids (highest first), `std::less` for asks (lowest first). O(log M) insert, O(1) best via `begin()`.
Order queue	`std::list<Order>` per level	FIFO queue at each price. O(1) append, O(1) erase by iterator.
Lookup	`std::unordered_map<OrderID, Iterator>`	O(1) cancel — hash lookup yields a list iterator, erase is constant time.

This is the same asymptotic profile used by production exchange engines.

Integer Prices — Avoiding Floating-Point Traps

Prices are stored as int32_t in cents ($101.50 = 10150). Floating-point comparison (==) is unreliable due to representation error. In a matching engine, this could cause price levels that should merge remaining as separate entries, or orders matching at the wrong price. Integer arithmetic makes comparison exact and deterministic.

The Pool Allocator

Every malloc / free on the hot path risks kernel-mode syscalls, lock contention in the allocator, and cache pollution. NanoMatch uses a custom free-list pool allocator:

▹Pre-allocates a contiguous std::vector<T> of fixed capacity
▹allocate() = pop head of free list = O(1), zero syscalls
▹deallocate() = push to head = O(1), zero syscalls
▹All memory lives in a single cache-friendly contiguous block

The benchmark shows the pool allocator eliminates the allocation jitter that causes tail latency spikes.

Order Types Done Right

Four order types with production-grade semantics:

▹Limit: Match what crosses, rest the remainder on the book
▹Market: Price set to MAX/MIN to guarantee crossing, never rests
▹IOC (Immediate-or-Cancel): Like limit but cancel unfilled remainder
▹FOK (Fill-or-Kill): Atomic pre-check via can_fill_completely() — scans all crossable levels before executing. If insufficient liquidity exists, zero fills occur

Modify is implemented as cancel + re-add, which is correct exchange behavior — modifications lose time priority, just like on NYSE and NASDAQ.

Latency Profile

Percentile	Latency
p50	84 ns
p99	625 ns
p99.9	1,250 ns

The 15x ratio between p50 and p99.9 is realistic. Spikes come from red-black tree rebalancing on new price level insertion, orders sweeping multiple levels, and cache misses on cold levels. In production, tail latency is the metric that differentiates systems.