Skip to main content

Key Metrics for Benchmarking LLMs

When benchmarking a self-hosted LLM, it's important to look beyond simple throughput numbers. Different metrics capture different aspects of model performance, from how quickly a response starts to how efficiently tokens are generated over time. Below are the key metrics every customer should understand before running performance tests.

Time to First Token (TTFT)

Time to First Token measures how long it takes for the model to process a prompt and produce the very first token of output. It represents the user's initial waiting time, the delay between sending a request and seeing the first word appear. Formula: first_non_empty_token_received_time - request_send_time What it includes:

  • Request queuing time (wait time if other requests are being processed)
  • Prefill time (model processing the entire input prompt to create the KV cache)
  • Network latency

Technical considerations:

  • Longer prompts result in larger TTFT; attention computation scales with input sequence length
  • Benchmarking tools like Inference Perf disregard initial empty responses to ensure meaningful measurements
  • TTFT represents your minimum latency baseline for the current deployment configuration; while users cannot see faster response times than TTFT within the same setup, this baseline can be reduced through optimisations such as model quantization, faster hardware, or improved parallelism strategies
  • Advanced optimisation note: Some production deployments use disaggregated prefill/decode architectures where prefill and token generation phases run on separate infrastructure to optimize resource utilisation; this is beyond the scope of this guide but can significantly impact TTFT in specialized setups.

Intertoken Latency (ITL)

Intertoken Latency (also called Time per Output Token, or TPOT) measures the average delay between consecutive tokens generated by the model. This reflects how smoothly and steadily the model streams output once generation starts. Formula (Inference Perf): generation_time / (output_tokens - 1) Key detail: The formula excludes TTFT by subtracting 1 in the denominator, making ITL a metric of only the decoding/generation phase. Technical considerations:

  • As output grows, the KV cache grows, impacting memory bandwidth and attention computation cost (linear with input + output length)
  • Consistent ITL indicates efficient memory management and bandwidth utilization
  • Increasing ITL during long generations suggests memory bandwidth constraints or KV cache pressure
  • Different tools calculate ITL differently; Inference Perf excludes TTFT while some tools (like LLMPerf) include it in the average

Tokens per Second (TPS)

Tokens per Second captures the overall throughput of the system, how many output tokens the model generates each second across all active requests. As more requests run in parallel, TPS typically increases until the hardware reaches its saturation point. Beyond that, performance may plateau or even drop as resources become overutilized. This metric helps teams understand system capacity and plan for scaling or batching strategies.

Formula (Inference Perf): total_output_tokens / (last_response_time - first_request_time)

Two perspectives to understand:

  • System TPS (total throughput): Aggregate capacity across all concurrent requests, increases with concurrency until saturation
  • Per-user TPS: Individual request throughput (output_tokens / e2e_latency), decreases as system load increases

Critical relationship: Critical relationship: As concurrent load increases beyond available model replicas, System TPS ↑ while per-user TPS ↓ (assuming the number of concurrent requests exceeds the system's immediate serving capacity)

Technical considerations:

  • Inference Perf uses a sliding window technique, excluding warmup and cooldown requests for stable measurements
  • Other tools may include full benchmark duration, adding overhead (up to 33% in single-concurrency scenarios)
  • The saturation point is where TTFT spikes, TPS plateaus, and error rates begin appearing

Why it matters: TPS is essential for capacity planning; understanding maximum sustainable throughput helps determine infrastructure sizing and production operating points (typically 50-70% of saturation).