Anthropic messages
anthropic_messages
LLMeter endpoints for the Anthropic Messages API.
Supports any client provided by the anthropic Python SDK:
"anthropic"- Direct API atapi.anthropic.com"bedrock-mantle"- Amazon Bedrock Mantle"vertex"- Google Vertex AI (requiresanthropic[vertex])"foundry"- Azure Foundry
Install the dependency::
pip install 'llmeter[anthropic]'
Extended thinking
Claude models can perform internal reasoning ("thinking") before producing a visible answer. The
configuration for this is controlled via the thinking parameter on the request payload (also
available in the
create_payload
utility function).
It's important to understand how these extra thinking/reasoning tokens that aren't part of the "final" output will be treated for response timing and token counting.
Token accounting
The Anthropic API reports a single output_tokens count that includes both thinking and
visible text tokens. There is no separate reasoning_tokens field. As a result:
InvocationResponse.num_tokens_outputreflects the total billed output tokens (thinking and output).InvocationResponse.num_tokens_output_reasoningis alwaysNonefor Anthropic endpoints because the API does not provide this breakdown.
This differs from OpenAI, which reports reasoning_tokens separately. When comparing across
providers, keep in mind that LLMeter's num_tokens_output is semantically consistent (total billed
output) but the reasoning breakdown is only available where the provider exposes it.
Time to first token (TTFT) and the display setting
The display field on the thinking configuration controls whether thinking content is streamed
back to the client:
"summarized"(default on most models) -thinking_deltaevents stream before the visible text."omitted"(default on Claude Opus 4.7 and Mythos) - nothinking_deltaevents are emitted; only asignature_deltasignals that the thinking block completed.
The
AnthropicMessagesStream.ttft_visible_tokens_only
parameter controls how
InvocationResponse.time_to_first_token is measured:
True(default) - TTFT records the first visibletext_delta. Thinking events (thinking_delta,signature_delta) are ignored. This measures the latency the end user experiences before seeing output.False-- TTFT records the first token of any kind, includingthinking_delta(summarized mode) orsignature_delta(omitted mode). This measures when the model first started producing output.
Because display: "omitted" suppresses thinking_delta events entirely, the signature_delta is
the earliest signal available. With ttft_visible_tokens_only=False, the measured TTFT will
therefore differ between summarized and omitted modes for the same model and prompt: summarized
mode captures the first thinking token, while omitted mode captures the signature that arrives
after all thinking is complete.
AnthropicMessages
AnthropicMessages(model_id, endpoint_name='anthropic-messages', provider='anthropic', api_key=None, aws_region=None, **kwargs)
Bases: AnthropicMessagesEndpoint[Message]
Endpoint for the Anthropic Messages API (non-streaming).
When extended thinking is enabled, the response may contain thinking content blocks
alongside text blocks. Only text blocks contribute to
InvocationResponse.response_text.
The reported num_tokens_output is the total billed count (thinking + text);
num_tokens_output_reasoning is None because the Anthropic API does not provide a separate
thinking token count.
Examples:
Direct Anthropic API:
endpoint = AnthropicMessages(model_id="claude-opus-4-7")
Amazon Bedrock Mantle:
endpoint = AnthropicMessages(
model_id="anthropic.claude-opus-4-7",
provider="bedrock-mantle",
aws_region="us-east-1",
)
Source code in llmeter/endpoints/anthropic_messages.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
invoke
invoke(payload)
Invoke the Anthropic Messages API (non-streaming).
Source code in llmeter/endpoints/anthropic_messages.py
318 319 320 321 322 | |
process_raw_response
process_raw_response(raw_response, start_t, response)
Parse a non-streaming Anthropic Messages API response.
Only text content blocks are extracted into response_text. thinking and
redacted_thinking blocks are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_response
|
Message
|
The |
required |
start_t
|
float
|
Start time of the API call. |
required |
response
|
InvocationResponse
|
The LLMeter response object to be populated in-place. |
required |
Source code in llmeter/endpoints/anthropic_messages.py
329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | |
AnthropicMessagesEndpoint
AnthropicMessagesEndpoint(model_id, endpoint_name='anthropic-messages', provider='anthropic', api_key=None, aws_region=None, **kwargs)
Bases: Endpoint[TAnthropicResponseBase], Generic[TAnthropicResponseBase]
Base class for Anthropic Messages API endpoints.
Works with any client provided by the anthropic SDK. The provider
argument selects which client to instantiate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
Model identifier (e.g. |
required |
endpoint_name
|
str
|
Display name for this endpoint. Defaults to
|
'anthropic-messages'
|
provider
|
str
|
Backend to use -- one of |
'anthropic'
|
api_key
|
str | None
|
API key for the direct Anthropic API. |
None
|
aws_region
|
str | None
|
AWS region for Bedrock Mantle. |
None
|
**kwargs
|
Any
|
Additional keyword arguments forwarded to the underlying
|
{}
|
Source code in llmeter/endpoints/anthropic_messages.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
create_payload
staticmethod
create_payload(user_message, max_tokens=256, thinking=None, **kwargs)
Create a payload for the Anthropic Messages API.
This is a convenience helper. You can also build the payload dict directly following the Anthropic Messages API reference
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
user_message
|
str
|
The user message text. |
required |
max_tokens
|
int
|
Maximum tokens to generate. Defaults to 256. |
256
|
thinking
|
dict | None
|
Extended thinking configuration. Common values:
The
Example with display: |
None
|
**kwargs
|
Any
|
Additional payload parameters ( |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
MessageCreateParams
|
Formatted payload for the Anthropic Messages API. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
TypeError
|
If |
Examples:
Text only:
create_payload("Hello, Claude!")
With system prompt:
create_payload(
"Explain quantum computing",
system="You are a physics professor.",
max_tokens=1024,
)
With adaptive thinking:
create_payload(
"Prove that there are infinitely many primes.",
max_tokens=16000,
thinking={"type": "adaptive"},
)
With thinking explicitly disabled:
create_payload(
"Hello!",
thinking={"type": "disabled"},
)
Source code in llmeter/endpoints/anthropic_messages.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
AnthropicMessagesStream
AnthropicMessagesStream(model_id, endpoint_name='anthropic-messages', provider='anthropic', api_key=None, aws_region=None, ttft_visible_tokens_only=True, **kwargs)
Bases: AnthropicMessagesEndpoint[Iterable[RawMessageStreamEvent]]
Endpoint for the Anthropic Messages API (streaming).
Uses client.messages.create(..., stream=True) to stream SSE events, enabling
time-to-first-token and time-to-last-token measurements.
Extended thinking and TTFT
When extended thinking is enabled, the stream contains thinking-related events before the
visible text. The ttft_visible_tokens_only parameter controls which event sets
time_to_first_token:
True(default) - TTFT is set on the firsttext_delta. Thinking events are ignored. Use this to measure the latency an end user experiences before seeing output.False- TTFT is set on the first event of any kind, includingthinking_delta(whendisplayis"summarized") orsignature_delta(whendisplayis"omitted"). Use this to measure when the model first started producing output.
The display setting on the thinking configuration affects which events are emitted:
"summarized"-thinking_deltaevents stream before the text. Withttft_visible_tokens_only=False, TTFT captures the first thinking token."omitted"- nothinking_deltaevents; only asignature_deltasignals the end of the thinking block. Withttft_visible_tokens_only=False, TTFT captures the signature, which arrives later than a thinking delta would.
This means that for the same model and prompt, measured TTFT with
ttft_visible_tokens_only=False will differ between summarized and omitted modes. Summarized
mode captures the first thinking token; omitted mode captures the signature that arrives after
all thinking is complete.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
Model identifier. |
required |
endpoint_name
|
str
|
Display name. Defaults to |
'anthropic-messages'
|
provider
|
str
|
Backend to use. Defaults to |
'anthropic'
|
api_key
|
str | None
|
API key for the direct Anthropic API. |
None
|
aws_region
|
str | None
|
AWS region for Bedrock Mantle. |
None
|
ttft_visible_tokens_only
|
bool
|
When |
True
|
**kwargs
|
Any
|
Additional arguments forwarded to the client constructor. |
{}
|
Examples:
Direct Anthropic API:
endpoint = AnthropicMessagesStream(model_id="claude-opus-4-7")
Measure TTFT including thinking:
endpoint = AnthropicMessagesStream(
model_id="claude-sonnet-4-6",
ttft_visible_tokens_only=False,
)
Amazon Bedrock Mantle:
endpoint = AnthropicMessagesStream(
model_id="anthropic.claude-opus-4-7",
provider="bedrock-mantle",
aws_region="us-east-1",
)
Source code in llmeter/endpoints/anthropic_messages.py
434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 | |
invoke
invoke(payload)
Invoke the Anthropic Messages API with streaming.
Source code in llmeter/endpoints/anthropic_messages.py
454 455 456 457 458 | |
process_raw_response
process_raw_response(raw_response, start_t, response)
Parse a streaming Anthropic Messages API response.
Processes SSE events to extract text, token counts, and timing.
Only text_delta events contribute to response_text. thinking_delta and
signature_delta events are used solely for TTFT measurement when
ttft_visible_tokens_only is False.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_response
|
Iterable[RawMessageStreamEvent]
|
The streaming iterator of SSE events. |
required |
start_t
|
float
|
Start time of the API call. |
required |
response
|
InvocationResponse
|
The LLMeter response object to be populated in-place. |
required |
Source code in llmeter/endpoints/anthropic_messages.py
466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 | |