A high-performance HTTP/1.1 and WebSocket load generator built on Linux io_uring.
Official load generator for Http Arena
Requires Linux 6.1+, gcc, liburing-dev 2.5+
sudo apt install build-essential liburing-dev
git clone https://github.com/MDA2AV/gcannon.git && cd gcannon && make
The fastest HTTP load generator available. Built on io_uring's batched async I/O to maximize requests per second from a single machine.
Per-request latency tracking with microsecond-resolution histograms via CLOCK_MONOTONIC. Every response is recorded — percentiles are exact, not estimated.
Pass --tui for a rich terminal interface with live progress, throughput graph, and colored results.
During execution, the TUI displays a progress bar, real-time throughput stats, and a sparkline graph showing req/s over time. Updates every second.
Percentile latencies displayed in a clean box-drawn table with color coding: cyan for normal, yellow for p99, red for p99.9.
When using -r N, each connection closes and reconnects after N request/response pairs. The results show the total reconnect count and confirm that every response was latency-sampled. In this example, 28.97M responses across 10 req/conn = ~2.9M reconnects.
The histogram automatically zooms into your data range. Bucket boundaries are computed from p0 to p99.9 of the actual latency distribution, divided into equal-width slices. Control granularity with -b.
Results are saved after every run to ~/.gcannon/history.bin (up to 100 runs). In TUI mode, bar graphs show req/s and avg latency trends across the last 10 runs. The Y-axis adapts to the data range so small differences are visible, and each bar shows its value at the tip. The current run is highlighted in green. Clear history with --clear-history.
Run History (5 runs) Req/s │ 3.51M 3.52M │ █████ │ 3.50M █████ 3.49M │ 3.48M █████ █████ │ █████ █████ █████ │ █████ █████ █████ 3.46M └────────────────────────────────── Avg Latency │ 152us 154us │ █████ │ █████ 148us 149us │ █████ 147us █████ │ █████ █████ █████ │ █████ █████ █████ 144us └──────────────────────────────────
When using multiple --raw request files, pass --per-tpl-latency to track latency histograms per template. Each template gets its own percentile breakdown (avg, p50, p99, p99.9). Zero overhead when not enabled.
Machine-readable output for scripts, CI pipelines, and dashboards. Pass --json to get a single JSON object on stdout with no banner or progress output.
In WebSocket mode (--ws), additional ws_upgrades and ws_frames fields are included. When using multiple --raw templates with --per-tpl-latency, a per_template array is added with per-template response counts and latency percentiles.
Pipe to jq for filtering:
Simple CLI interface. The only required argument is the target URL.
| Flag | Default | Description |
|---|---|---|
| <url> | required | Target URL. Only http:// is supported (no TLS). Can be omitted if --raw files contain a Host header. |
| -c <N> | 100 | Total number of concurrent TCP connections, distributed evenly across worker threads. |
| -t <N> | 1 | Number of worker threads. Each runs an independent io_uring event loop. Typically set to the number of CPU cores. |
| -d <duration> | 10s | Benchmark duration. Accepts seconds (5s) or minutes (1m). A 100ms warmup is excluded from results. |
| -p <N> | 1 | HTTP pipeline depth (max 64). Sends N requests per connection before waiting for responses. Higher values increase throughput but raise latency. |
| -r <N> | unlimited | Requests per connection. After N request/response pairs complete, the connection is closed and reopened. Useful for testing connection handling. Default is keep-alive forever. |
| -s <code> | 200 | Expected HTTP status code. Glass Cannon warns and exits with code 1 if a significant portion of responses don't match this status class. |
| --raw <files> | Comma-separated list of raw HTTP request files. Each file contains a complete HTTP request including headers. Connections rotate through templates on reconnect. | |
| --ws | WebSocket echo mode. Performs an HTTP upgrade, then sends and receives WebSocket frames. | |
| --ws-msg <text> | hello | Custom WebSocket message payload. Implies --ws. |
| --tui | Enable TUI mode. Shows a live progress bar with throughput sparkline during execution, and colored table-formatted results. | |
| --json | JSON output mode. Prints a single machine-readable JSON object to stdout. No banner or progress output. Ideal for scripts, CI pipelines, and dashboards. | |
| -b <N> | 10 | Number of latency histogram buckets in TUI mode (max 100). Buckets are computed adaptively from the actual data range, not fixed thresholds. |
| --recv-buf <bytes> | 4096 | Receive buffer size in bytes (minimum 512). Each worker allocates 4096 buffers of this size. Larger values suit big response bodies; smaller values save memory with many connections. |
| --cqe-latency | Measure latency at io_uring CQE arrival (when the kernel signals data is ready) instead of after the full HTTP response is parsed. Gives a lower-level "time to first data" measurement that excludes userspace parse time. | |
| --per-tpl-latency | Enable per-template latency histograms when using multiple --raw request files. Each template gets its own percentile breakdown. Only allocates extra memory when enabled. |
|
| --clear-history | Clear saved run history (~/.gcannon/history.bin) and exit. Does not run a benchmark. |
Latency numbers are only useful if they're measured correctly. Glass Cannon tracks per-request latency with microsecond resolution using a two-tier histogram.
CLOCK_MONOTONIC
All timestamps use the Linux kernel's CLOCK_MONOTONIC via clock_gettime(). This is a kernel-maintained clock that counts nanoseconds since an arbitrary point (usually boot). Unlike wall-clock time (CLOCK_REALTIME), it is immune to NTP adjustments, leap seconds, and manual time changes — it only moves forward at a steady rate. This makes it the correct clock for measuring elapsed durations. On modern x86_64, clock_gettime(CLOCK_MONOTONIC) reads the TSC register via the kernel's vDSO, so it doesn't even require a syscall — it's a fast userspace read with nanosecond resolution.
When a batch of pipelined requests is dispatched via io_uring, clock_gettime(CLOCK_MONOTONIC) captures a nanosecond timestamp. This timestamp is stored for each request in the batch in a per-connection circular buffer (send_times[]). Since all requests in a pipeline batch go out in a single io_uring_prep_send(), they share the same dispatch timestamp — this measures latency from application dispatch, not individual wire time.
When the HTTP response parser (picohttpparser) completes parsing a response, a second clock_gettime(CLOCK_MONOTONIC) call captures the arrival time. The latency for that request is arrival − send_times[oldest], measured in microseconds. The oldest timestamp is consumed from the circular buffer, matching responses to requests in FIFO order.
The latency sample is recorded in a two-tier histogram. Tier 1 covers 0–10ms at 1μs resolution (10,000 buckets) — this is where most samples land for fast servers. Tier 2 covers 10ms–5s at 100μs resolution (49,900 buckets). Anything above 5s goes into an overflow counter. This gives exact percentile calculations without storing individual samples or doing any heap allocation.
After the benchmark completes, histograms from all worker threads are merged by summing corresponding bucket counts. Percentiles (p50, p90, p99, p99.9) are computed by walking the merged histogram and finding the bucket where the cumulative count crosses the target threshold. This is an exact calculation, not a statistical estimate.
Many benchmarking tools report only averages or sample a fraction of requests. Averages hide tail latency — your p99 could be 100x your average, and you'd never know. Glass Cannon records every single response in a fixed-memory histogram, so percentiles are exact, not estimated.
The two-tier histogram uses ~234KB per worker and requires zero allocations during the benchmark. The sub-10ms tier has 1μs precision, which matters when your server responds in tens of microseconds. And because CLOCK_MONOTONIC reads via vDSO, the timestamping itself adds negligible overhead — no syscall, no context switch, just a fast register read.
The tool itself must not be the bottleneck. If your load generator maxes out before your server does, you're measuring the wrong thing.
io_uring processes thousands of operations per kernel transition. With epoll, each send/recv is a separate syscall — at 1M req/s that's 2M+ context switches per second.
These io_uring flags tell the kernel only one thread submits per ring, and to defer work until the next submission. This eliminates internal locking and reduces interrupts.
Written in C with all buffers pre-allocated. No GC pauses, no allocation jitter. The benchmark runs at constant memory usage from start to finish.
Glass Cannon uses a fundamentally different approach to I/O than traditional load generators. Instead of one thread per connection or async callbacks, it talks directly to the Linux kernel's io_uring interface.
Most load generators (wrk, hey, ab) use epoll — the application asks the kernel "which sockets are ready?", then makes individual read() and write() system calls for each one. Every syscall means a context switch between your program and the kernel.
io_uring uses two shared memory ring buffers between your program and the kernel. You write requests into the submission queue, and the kernel writes results into the completion queue. No system calls per operation. The kernel processes batches of I/O while your code processes batches of results.
The main thread spawns N worker threads. Each worker owns an independent io_uring ring, a set of connections, and pre-built request buffers. There is zero communication between workers during the benchmark — no mutexes, no atomics, no shared state.
Pre-registered buffer pool managed by the kernel. Eliminates per-recv buffer allocation and reduces submission overhead. Buffer size is configurable via --recv-buf.
One submission arms continuous receive on a socket. The kernel keeps delivering data without repeated requests.
N copies of the HTTP request are pre-built and concatenated at startup. One send call pushes the entire pipeline.
Each connection has a generation counter packed into io_uring user-data. Stale completions from reconnected sockets are safely ignored.
Each worker thread sets up its own isolated environment before entering the event loop. Nothing is shared between workers — each one is a fully self-contained load generator.
A ring with 4096 submission queue entries is created with IORING_SETUP_SINGLE_ISSUER (only this thread will ever submit) and IORING_SETUP_DEFER_TASKRUN (kernel defers work until the next submission, reducing interrupts). These two flags together eliminate all internal locking in the kernel's io_uring path.
4096 receive buffers (4KB each by default, configurable via --recv-buf) are pre-registered with the kernel. When a recv completion fires, the kernel picks a buffer from this pool automatically — no userspace allocation or buffer address needed in the submission. After processing, the buffer is returned to the ring for reuse.
An array of gc_conn_t structs is allocated — one per connection assigned to this worker. Each connection gets a non-blocking TCP socket with TCP_NODELAY, and an io_uring_prep_connect() is submitted for each. All connects happen in parallel via the ring.
The core of Glass Cannon is a tight loop that processes io_uring completions in batches. Each iteration drains up to 2048 completions, processes them, then submits new operations — all in a single kernel transition.
Every io_uring submission carries a 64-bit user-data value that comes back with the completion. Glass Cannon packs three fields into this value to identify what each completion means:
The generation counter solves a critical race: when a connection is closed and a new socket is opened on the same index, completions from the old socket may still arrive. The worker compares cqe_gen != c->gen and silently discards stale completions. This avoids use-after-free bugs without any locking.
Each connection follows a state machine driven entirely by io_uring completions. No threads block, no callbacks are registered — just CQEs moving state forward.
Traditional recv requires submitting a new SQE for every chunk of data you want to read, and you must tell the kernel which buffer to write into. Glass Cannon uses two io_uring features to eliminate both costs:
A single io_uring_prep_recv_multishot() call arms continuous receive on a socket. The kernel fires a CQE for each chunk of data that arrives, without the application resubmitting. The IORING_CQE_F_MORE flag on each CQE indicates whether the multishot is still armed. Only when it's not (socket closed, error, or buffer exhaustion) does the application need to rearm.
Instead of specifying a buffer address in each recv SQE, the application registers a pool of buffers with the kernel upfront. The kernel picks a free buffer from the pool when data arrives and reports which one it chose via the IORING_CQE_F_BUFFER flag and buffer ID in the CQE. After processing, the application returns the buffer to the pool. No allocation, no copying.
Combined, these mean a single SQE handles all future data on a connection, with zero-copy buffer management. The application only touches buffers when it has data to parse.
Glass Cannon keeps the pipeline full at all times. When responses arrive, it immediately sends new requests to replace them:
All requests in a pipeline batch are pre-built and concatenated at startup into a single buffer. Sending N pipelined requests is one io_uring_prep_send() call — no per-request formatting or allocation. If the kernel can't send the full buffer in one shot (partial send), the worker resubmits the remainder automatically.
The key to Glass Cannon's throughput is batching at every level. Instead of processing one event at a time:
io_uring_peek_batch_cqe() drains up to 2048 completions in one call. All are processed before any new submissions, amortizing the cost of the kernel transition across thousands of operations.
New operations (sends, recv rearms, connects) are queued as SQEs during CQE processing, then flushed to the kernel in a single io_uring_submit() at the end of each batch. One syscall for potentially thousands of new operations.
N HTTP requests are concatenated and sent in a single send operation. The response parser handles multiple responses per recv buffer, matching them to pipelined requests in FIFO order.