Glass Cannon

TUI Mode

Pass --tui for a rich terminal interface with live progress, throughput graph, and colored results.

Live Progress

During execution, the TUI displays a progress bar, real-time throughput stats, and a sparkline graph showing req/s over time. Updates every second.

gcannon --tui (running)

−

□

gcannon — io_uring HTTP load generator Target: localhost:8080/json Threads: 8 Conns: 512 (64/thread) Pipeline: 1 Req/conn: 10 Duration: 30s ████████████████████████████░░░░░░░░░░░░ 70% 21s / 30s 965.78K req/s │ 20.28M responses │ 42.15MB/s Req/s ▂▅▇█████▇▆▇████▇▆▅▆▇█ peak 1.02M

Results — Latency Table

Percentile latencies displayed in a clean box-drawn table with color coding: cyan for normal, yellow for p99, red for p99.9.

gcannon --tui (results)

−

□

Glass Cannon ── Results ────────────────────────────────────────────────────────── Latency ┌──────────┬──────────┬──────────┬──────────┬──────────┐ │ Avg │ p50 │ p90 │ p99 │ p99.9 │ ├──────────┼──────────┼──────────┼──────────┼──────────┤ │ 148us │ 132us │ 215us │ 487us │ 1.23ms │ └──────────┴──────────┴──────────┴──────────┴──────────┘

Results — Throughput & Status Codes

When using -r N, each connection closes and reconnects after N request/response pairs. The results show the total reconnect count and confirm that every response was latency-sampled. In this example, 28.97M responses across 10 req/conn = ~2.9M reconnects.

throughput & status codes

−

□

Throughput ┌──────────────┬──────────────────────────────────────┐ │ Req/s │ 965.78K │ ├──────────────┼──────────────────────────────────────┤ │ Bandwidth │ 42.15MB/s │ ├──────────────┼──────────────────────────────────────┤ │ Total │ 28.97M req / 28.97M resp in 30.00s │ └──────────────┴──────────────────────────────────────┘ Status Codes ┌─────────────┬─────────────┬─────────────┬─────────────┐ │ 2xx │ 3xx │ 4xx │ 5xx │ ├─────────────┼─────────────┼─────────────┼─────────────┤ │ 28973400 │ 0 │ 0 │ 0 │ └─────────────┴─────────────┴─────────────┴─────────────┘ Latency samples: 28973400 / 28973400 (100.0%) Reconnects: 2897340

Results — Adaptive Latency Histogram

The histogram automatically zooms into your data range. Bucket boundaries are computed from p0 to p99.9 of the actual latency distribution, divided into equal-width slices. Control granularity with -b.

latency distribution (-b 10)

−

□

Latency Distribution

≤ 145us4.12%

≤ 158us18.94%

≤ 171us38.21%

≤ 184us24.56%

≤ 197us9.87%

≤ 210us2.45%

≤ 223us1.02%

≤ 236us0.54%

> 236us0.29%

latency distribution (-b 20)

−

□

Latency Distribution

≤ 137us1.23%

≤ 143us2.89%

≤ 150us6.45%

≤ 156us10.32%

≤ 163us15.87%

≤ 169us19.21%

≤ 176us17.45%

≤ 182us11.23%

≤ 189us6.78%

≤ 195us3.89%

≤ 202us2.12%

≤ 208us1.34%

≤ 215us0.67%

> 215us0.55%

Run History

Results are saved after every run to ~/.gcannon/history.bin (up to 100 runs). In TUI mode, bar graphs show req/s and avg latency trends across the last 10 runs. The Y-axis adapts to the data range so small differences are visible, and each bar shows its value at the tip. The current run is highlighted in green. Clear history with --clear-history.

run history

  Run History (5 runs)

    Req/s
            │                               3.51M
      3.52M │                               █████
            │                 3.50M         █████
      3.49M │       3.48M     █████         █████
            │       █████     █████         █████
            │       █████     █████         █████
      3.46M └──────────────────────────────────

    Avg Latency
            │ 152us
      154us │ █████
            │ █████                         148us
      149us │ █████     147us               █████
            │ █████     █████               █████
            │ █████     █████               █████
      144us └──────────────────────────────────

Per-Template Latency

When using multiple --raw request files, pass --per-tpl-latency to track latency histograms per template. Each template gets its own percentile breakdown (avg, p50, p99, p99.9). Zero overhead when not enabled.

Commands & Options

Simple CLI interface. The only required argument is the target URL.

usage

−

□

gcannon <url> [options] # Examples gcannon http://localhost:8080/ -c 512 -t 8 -d 10s gcannon http://localhost:8080/ -c 1024 -t 16 -d 30s -p 16 --tui gcannon http://localhost:8080/ws --ws -c 256 -t 8 -d 5s gcannon http://localhost:8080 --raw get.raw,post.raw -c 256 -t 4

All Options

Flag	Default	Description
<url>	required	Target URL. Only `http://` is supported (no TLS). Can be omitted if `--raw` files contain a `Host` header.
-c <N>	100	Total number of concurrent TCP connections, distributed evenly across worker threads.
-t <N>	1	Number of worker threads. Each runs an independent io_uring event loop. Typically set to the number of CPU cores.
-d <duration>	10s	Benchmark duration. Accepts seconds (`5s`) or minutes (`1m`). A 100ms warmup is excluded from results.
-p <N>	1	HTTP pipeline depth (max 64). Sends N requests per connection before waiting for responses. Higher values increase throughput but raise latency.
-r <N>	unlimited	Requests per connection. After N request/response pairs complete, the connection is closed and reopened. Useful for testing connection handling. Default is keep-alive forever.
-s <code>	200	Expected HTTP status code. Glass Cannon warns and exits with code 1 if a significant portion of responses don't match this status class.
--raw <files>		Comma-separated list of raw HTTP request files. Each file contains a complete HTTP request including headers. Connections rotate through templates on reconnect.
--ws		WebSocket echo mode. Performs an HTTP upgrade, then sends and receives WebSocket frames.
--ws-msg <text>	hello	Custom WebSocket message payload. Implies `--ws`.
--tui		Enable TUI mode. Shows a live progress bar with throughput sparkline during execution, and colored table-formatted results.
--json		JSON output mode. Prints a single machine-readable JSON object to stdout. No banner or progress output. Ideal for scripts, CI pipelines, and dashboards.
-b <N>	10	Number of latency histogram buckets in TUI mode (max 100). Buckets are computed adaptively from the actual data range, not fixed thresholds.
--recv-buf <bytes>	4096	Receive buffer size in bytes (minimum 512). Each worker allocates 4096 buffers of this size. Larger values suit big response bodies; smaller values save memory with many connections.
--cqe-latency		Measure latency at io_uring CQE arrival (when the kernel signals data is ready) instead of after the full HTTP response is parsed. Gives a lower-level "time to first data" measurement that excludes userspace parse time.
--per-tpl-latency		Enable per-template latency histograms when using multiple `--raw` request files. Each template gets its own percentile breakdown. Only allocates extra memory when enabled.
--clear-history		Clear saved run history (`~/.gcannon/history.bin`) and exit. Does not run a benchmark.

Latency Measurement

Latency numbers are only useful if they're measured correctly. Glass Cannon tracks per-request latency with microsecond resolution using a two-tier histogram.

The Clock: `CLOCK_MONOTONIC`

All timestamps use the Linux kernel's CLOCK_MONOTONIC via clock_gettime(). This is a kernel-maintained clock that counts nanoseconds since an arbitrary point (usually boot). Unlike wall-clock time (CLOCK_REALTIME), it is immune to NTP adjustments, leap seconds, and manual time changes — it only moves forward at a steady rate. This makes it the correct clock for measuring elapsed durations. On modern x86_64, clock_gettime(CLOCK_MONOTONIC) reads the TSC register via the kernel's vDSO, so it doesn't even require a syscall — it's a fast userspace read with nanosecond resolution.

How Each Request Is Timed

Timestamp at send

When a batch of pipelined requests is dispatched via io_uring, clock_gettime(CLOCK_MONOTONIC) captures a nanosecond timestamp. This timestamp is stored for each request in the batch in a per-connection circular buffer (send_times[]). Since all requests in a pipeline batch go out in a single io_uring_prep_send(), they share the same dispatch timestamp — this measures latency from application dispatch, not individual wire time.

Timestamp at response

When the HTTP response parser (picohttpparser) completes parsing a response, a second clock_gettime(CLOCK_MONOTONIC) call captures the arrival time. The latency for that request is arrival − send_times[oldest], measured in microseconds. The oldest timestamp is consumed from the circular buffer, matching responses to requests in FIFO order.

Record in histogram

The latency sample is recorded in a two-tier histogram. Tier 1 covers 0–10ms at 1μs resolution (10,000 buckets) — this is where most samples land for fast servers. Tier 2 covers 10ms–5s at 100μs resolution (49,900 buckets). Anything above 5s goes into an overflow counter. This gives exact percentile calculations without storing individual samples or doing any heap allocation.

Merge across workers

After the benchmark completes, histograms from all worker threads are merged by summing corresponding bucket counts. Percentiles (p50, p90, p99, p99.9) are computed by walking the merged histogram and finding the bucket where the cumulative count crosses the target threshold. This is an exact calculation, not a statistical estimate.

Why This Matters

Many benchmarking tools report only averages or sample a fraction of requests. Averages hide tail latency — your p99 could be 100x your average, and you'd never know. Glass Cannon records every single response in a fixed-memory histogram, so percentiles are exact, not estimated.

The two-tier histogram uses ~234KB per worker and requires zero allocations during the benchmark. The sub-10ms tier has 1μs precision, which matters when your server responds in tens of microseconds. And because CLOCK_MONOTONIC reads via vDSO, the timestamping itself adds negligible overhead — no syscall, no context switch, just a fast register read.

How It Works

Glass Cannon uses a fundamentally different approach to I/O than traditional load generators. Instead of one thread per connection or async callbacks, it talks directly to the Linux kernel's io_uring interface.

The Traditional Way

Most load generators (wrk, hey, ab) use epoll — the application asks the kernel "which sockets are ready?", then makes individual read() and write() system calls for each one. Every syscall means a context switch between your program and the kernel.

The Glass Cannon Way

io_uring uses two shared memory ring buffers between your program and the kernel. You write requests into the submission queue, and the kernel writes results into the completion queue. No system calls per operation. The kernel processes batches of I/O while your code processes batches of results.

Your Program
submit 2048 operations
(send, recv, connect...)
──── Submission Queue ────>
<── Completion Queue ────
process completions
(no syscall needed)
Minimal context switches in steady state
Linux Kernel
process all I/O
in kernel space
batch completions
into shared ring

Architecture

The main thread spawns N worker threads. Each worker owns an independent io_uring ring, a set of connections, and pre-built request buffers. There is zero communication between workers during the benchmark — no mutexes, no atomics, no shared state.

main thread | |— spawn workers, wait for duration, aggregate stats | worker 0 worker 1 worker N | | | io_uring ring io_uring ring io_uring ring buffer ring buffer ring buffer ring connections[0..K] connections[0..K] connections[0..K] | | | connect → send → recv → count → refill → ...

Provided Buffer Rings

Pre-registered buffer pool managed by the kernel. Eliminates per-recv buffer allocation and reduces submission overhead. Buffer size is configurable via --recv-buf.

Multishot Receive

One submission arms continuous receive on a socket. The kernel keeps delivering data without repeated requests.

Request Pipelining

N copies of the HTTP request are pre-built and concatenated at startup. One send call pushes the entire pipeline.

Generation Counters

Each connection has a generation counter packed into io_uring user-data. Stale completions from reconnected sockets are safely ignored.

Worker Initialization

Each worker thread sets up its own isolated environment before entering the event loop. Nothing is shared between workers — each one is a fully self-contained load generator.

Create the io_uring ring

A ring with 4096 submission queue entries is created with IORING_SETUP_SINGLE_ISSUER (only this thread will ever submit) and IORING_SETUP_DEFER_TASKRUN (kernel defers work until the next submission, reducing interrupts). These two flags together eliminate all internal locking in the kernel's io_uring path.

Set up the provided buffer ring

4096 receive buffers (4KB each by default, configurable via --recv-buf) are pre-registered with the kernel. When a recv completion fires, the kernel picks a buffer from this pool automatically — no userspace allocation or buffer address needed in the submission. After processing, the buffer is returned to the ring for reuse.

Allocate connections and connect

An array of gc_conn_t structs is allocated — one per connection assigned to this worker. Each connection gets a non-blocking TCP socket with TCP_NODELAY, and an io_uring_prep_connect() is submitted for each. All connects happen in parallel via the ring.

The Event Loop

The core of Glass Cannon is a tight loop that processes io_uring completions in batches. Each iteration drains up to 2048 completions, processes them, then submits new operations — all in a single kernel transition.

worker_loop() │ │ io_uring_peek_batch_cqe() ── drain up to 2048 completions │ │ │ ├── no completions? │ │ io_uring_submit_and_wait_timeout() ── submit pending, wait 1ms │ │ │ └── for each CQE: │ │ │ ├── unpack user_data → kind, gen, conn_idx │ ├── stale gen? → skip (return buffer if recv) │ │ │ ├── UD_CONNECT → arm multishot recv → fire requests │ ├── UD_SEND → partial? resubmit remainder : refill pipeline │ ├── UD_RECV → parse response(s) → record latency → refill │ └── UD_CANCEL → no-op (cleanup from reconnect) │ │ io_uring_cq_advance() ── mark CQEs as consumed │ io_uring_submit() ── flush all new SQEs to kernel │ └── loop until g_running == 0

User-Data Packing

Every io_uring submission carries a 64-bit user-data value that comes back with the completion. Glass Cannon packs three fields into this value to identify what each completion means:

64-bit user_data layout ┌──────────────┬──────────────┬──────────────────────────────┐ │ kind (16b) │ gen (16b) │ conn_idx (32b) │ └──────────────┴──────────────┴──────────────────────────────┘ kind ─ operation type: CONNECT(1), RECV(2), SEND(3), CANCEL(4) gen ─ 16-bit generation counter, incremented on each reconnect conn_idx ─ index into the worker's connection array

The generation counter solves a critical race: when a connection is closed and a new socket is opened on the same index, completions from the old socket may still arrive. The worker compares cqe_gen != c->gen and silently discards stale completions. This avoids use-after-free bugs without any locking.

Connection Lifecycle

Each connection follows a state machine driven entirely by io_uring completions. No threads block, no callbacks are registered — just CQEs moving state forward.

HTTP connection CONNECTING ──── connect() CQE ────→ ACTIVE │ │ │ arm_recv_multishot() │ fire_requests(pipeline_depth) │ │ │ ┌────┴────┐ │ │ SEND │──→ partial? resubmit remainder │ │ RECV │──→ parse response(s) │ │ │ record latency │ │ │ fire_requests(completed) │ └────┬────┘ │ │ │ quota reached? ─────┘ │ │ CLOSED ←── close + cancel recv ─── reconnect() ──→ gen++ → CONNECTING WebSocket connection CONNECTING ── connect() CQE ──→ WS_UPGRADING │ │ fire_ws_upgrade() │ RECV: parse HTTP 101 │ ACTIVE │ fire_requests() sends masked WS frames │ RECV: ws_parse_frames() counts echoes │ same latency tracking as HTTP

How Recv Works: Multishot + Provided Buffers

Traditional recv requires submitting a new SQE for every chunk of data you want to read, and you must tell the kernel which buffer to write into. Glass Cannon uses two io_uring features to eliminate both costs:

Multishot Recv

A single io_uring_prep_recv_multishot() call arms continuous receive on a socket. The kernel fires a CQE for each chunk of data that arrives, without the application resubmitting. The IORING_CQE_F_MORE flag on each CQE indicates whether the multishot is still armed. Only when it's not (socket closed, error, or buffer exhaustion) does the application need to rearm.

Provided Buffers

Instead of specifying a buffer address in each recv SQE, the application registers a pool of buffers with the kernel upfront. The kernel picks a free buffer from the pool when data arrives and reports which one it chose via the IORING_CQE_F_BUFFER flag and buffer ID in the CQE. After processing, the application returns the buffer to the pool. No allocation, no copying.

Combined, these mean a single SQE handles all future data on a connection, with zero-copy buffer management. The application only touches buffers when it has data to parse.

Pipeline Refill

Glass Cannon keeps the pipeline full at all times. When responses arrive, it immediately sends new requests to replace them:

pipeline_depth = 4 fire_requests(4) ── initial fill: send 4 pipelined requests │ │ pipeline_inflight = 4 │ RECV: 2 responses parsed ── pipeline_inflight drops to 2 │ fire_requests(2) ── refill: send 2 more to restore depth │ │ pipeline_inflight = 4 ── always at target depth

All requests in a pipeline batch are pre-built and concatenated at startup into a single buffer. Sending N pipelined requests is one io_uring_prep_send() call — no per-request formatting or allocation. If the kernel can't send the full buffer in one shot (partial send), the worker resubmits the remainder automatically.

Batch Processing

The key to Glass Cannon's throughput is batching at every level. Instead of processing one event at a time:

CQE Batching

io_uring_peek_batch_cqe() drains up to 2048 completions in one call. All are processed before any new submissions, amortizing the cost of the kernel transition across thousands of operations.

SQE Batching

New operations (sends, recv rearms, connects) are queued as SQEs during CQE processing, then flushed to the kernel in a single io_uring_submit() at the end of each batch. One syscall for potentially thousands of new operations.

Pipeline Batching

N HTTP requests are concatenated and sent in a single send operation. The response parser handles multiple responses per recv buffer, matching them to pipelined requests in FIFO order.

Install

Extreme Throughput

Accurate Latency