ANTHROPIC CLAUDE API REFERENCE
Complete Model Specifications, Pricing, Tools, and Practical Details
Written February 26, 2026

This document is a comprehensive reference for the Anthropic Claude API as of late February 2026, covering every currently available model, its pricing and capabilities, how the API works at a mechanical level, every tool and feature available to developers, and enough practical detail (endpoints, headers, curl examples) that a developer could begin making API calls without consulting any other source. The information here was compiled from Anthropic's official documentation and pricing pages. Models are presented with current recommended models first and legacy models at the end.


THE API ENDPOINT AND AUTHENTICATION

All requests to the Anthropic API go to a single endpoint:

    POST https://api.anthropic.com/v1/messages

Authentication is via an API key passed in the x-api-key header. You must also include an anthropic-version header specifying the API version string. The current version string is "2023-06-01" and has been stable for a long time. A minimal curl request looks like this:

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-sonnet-4-6",
        "max_tokens": 1024,
        "messages": [
          {"role": "user", "content": "Hello, Claude."}
        ]
      }'

The -H flag in curl sets an HTTP header, which is a key-value pair sent alongside your request that provides metadata the server needs to process it. The x-api-key header carries your authentication credentials. The anthropic-version header tells the API which version of the protocol you expect. The content-type header tells the server you are sending JSON. Beta features require an additional anthropic-beta header whose value is the beta feature identifier (more on this below). You can include multiple beta features as a comma-separated list in a single anthropic-beta header.

The response comes back as a JSON object containing, among other things, a "content" array of content blocks (which may include text blocks, thinking blocks, tool_use blocks, and others), a "usage" object reporting token consumption, a "stop_reason" indicating why the model stopped generating, and a "model" field confirming which model was used.


HOW THE API WORKS: STATELESSNESS AND CONVERSATION MANAGEMENT

The Anthropic API is completely stateless. There is no conversation ID, no session, no server-side memory of previous calls. Every API request must contain the entire conversation history you want the model to see. From the model's perspective, every request is the first time it is seeing the conversation.

In practice this means you maintain a messages array on your side. You start with a single user message. The API returns an assistant message. You append that assistant message to your array. When the user says something new, you append their new message and send the entire array back. By turn ten of a conversation, you are sending every previous user message, every previous assistant message, every tool call and tool result, and every thinking block as input on every request. The model reads all of it and produces its next response.

This is why token costs grow over the course of a conversation. On turn one you might send 500 input tokens. By turn ten you might be sending 30,000 input tokens per request even if the new user message is only 50 tokens, because every previous exchange is re-sent and re-billed as input tokens.

A basic multi-turn conversation looks like this in the request body:

    {
      "model": "claude-sonnet-4-6",
      "max_tokens": 1024,
      "messages": [
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."},
        {"role": "user", "content": "What is its population?"}
      ]
    }

The assistant message in the middle is the response you received from the previous API call, which you are now sending back so the model has context for the follow-up question.


CURRENT RECOMMENDED MODELS

The three models Anthropic currently recommends for new work are Claude Opus 4.6, Claude Sonnet 4.6, and Claude Haiku 4.5. All three support text and image input, text output, multilingual capabilities, vision, PDF support, extended thinking, tool use, web search, web fetch, code execution, computer use, prompt caching, batch processing, structured outputs, and agent skills.


Claude Opus 4.6

API model ID: claude-opus-4-6
There is no separate alias and snapshot ID for this model; "claude-opus-4-6" is the identifier.
Released: February 4, 2026
Anthropic's most intelligent and capable model, designed for complex reasoning, long-horizon agentic tasks, coding, multi-step workflows, and tasks requiring sustained performance over large contexts.

Pricing: $5 per million input tokens, $25 per million output tokens.
Batch pricing: $2.50 per million input tokens, $12.50 per million output tokens.
Context window: 200,000 tokens standard, 1,000,000 tokens in beta.
Maximum output: 128,000 tokens.
Reliable knowledge cutoff: May 2025.
Training data cutoff: August 2025.

Supports adaptive thinking (recommended), extended thinking, interleaved thinking natively without a beta header, the effort parameter at all four levels (low, medium, high, max), structured outputs, vision, PDF support, all tools, context compaction, memory, and agent skills. Also supports fast mode in beta for up to 2.5x faster output generation at premium pricing of $30/$150 per million tokens.

AWS Bedrock ID: anthropic.claude-opus-4-6-v1
Google Vertex AI ID: claude-opus-4-6


Claude Sonnet 4.6

API model ID: claude-sonnet-4-6
Released: February 17, 2026
Near-flagship intelligence at the mid-tier Sonnet price point. Matches Opus 4.6 on many benchmarks while costing one-fifth as much. The best balance of speed, intelligence, and cost for most use cases, with exceptional performance in coding, agentic tasks, and computer use.

Pricing: $3 per million input tokens, $15 per million output tokens.
Batch pricing: $1.50 per million input tokens, $7.50 per million output tokens.
Context window: 200,000 tokens standard, 1,000,000 tokens in beta.
Maximum output: 64,000 tokens.
Reliable knowledge cutoff: August 2025.
Training data cutoff: January 2026 (the most recent training data of any model in the lineup).

Supports adaptive thinking, extended thinking, interleaved thinking via the "interleaved-thinking-2025-05-14" beta header (or automatically via adaptive thinking), the effort parameter (new to the Sonnet family with this release, recommended at medium for most use cases), structured outputs, vision, PDF support, all tools including web search with dynamic filtering, context compaction, and agent skills.

AWS Bedrock ID: anthropic.claude-sonnet-4-6
Google Vertex AI ID: claude-sonnet-4-6


Claude Haiku 4.5

API model ID: claude-haiku-4-5-20251001
Alias: claude-haiku-4-5
Released: October 15, 2025
The fastest model in the lineup with near-frontier intelligence. Designed for high-volume, latency-sensitive workloads where cost matters.

Pricing: $1 per million input tokens, $5 per million output tokens.
Batch pricing: $0.50 per million input tokens, $2.50 per million output tokens.
Context window: 200,000 tokens (no 1M beta option).
Maximum output: 64,000 tokens.
Reliable knowledge cutoff: February 2025.
Training data cutoff: July 2025.

Supports extended thinking but not adaptive thinking. First Haiku model with context awareness (the ability to track remaining context window during a conversation). Supports vision, PDF support, structured outputs, tool use, coding tasks, and agent skills.

AWS Bedrock ID: anthropic.claude-haiku-4-5-20251001-v1:0
Google Vertex AI ID: claude-haiku-4-5@20251001


EXTENDED THINKING

Extended thinking is the mechanism by which Claude reasons through a problem before producing its final answer. When enabled, the model generates internal "thinking" content blocks that appear in the API response before the actual text output. These blocks contain Claude's step-by-step reasoning--working through logic, considering alternatives, catching its own mistakes--and the quality improvement on complex tasks is substantial. Thinking tokens are billed as output tokens at the standard rate for whatever model you are using. There is no separate "thinking price."

On older Claude 4 models (Opus 4, 4.1, 4.5, Sonnet 4, 4.5, Haiku 4.5), you enable thinking by setting the thinking parameter to type "enabled" and specifying a budget_tokens value--a minimum of 1,024 tokens--that caps how many tokens the model can spend reasoning:

    {
      "model": "claude-sonnet-4-5-20250929",
      "max_tokens": 16000,
      "thinking": {
        "type": "enabled",
        "budget_tokens": 10000
      },
      "messages": [...]
    }

On Opus 4.6 and Sonnet 4.6, the recommended mode is adaptive thinking:

    {
      "model": "claude-opus-4-6",
      "max_tokens": 16000,
      "thinking": {
        "type": "adaptive"
      },
      "messages": [...]
    }

With adaptive thinking, the model decides for itself whether and how much to think based on the complexity of the query. At the default effort level of "high," Claude will almost always think. At lower effort levels it may skip thinking entirely for simple questions. The older budget_tokens approach still works on Opus 4.6 and Sonnet 4.6 but is deprecated and will be removed in a future release.


INTERLEAVED THINKING

Interleaved thinking is a refinement of extended thinking that matters for multi-step tool use. Without it, Claude does all its thinking in one block before responding. With interleaved thinking, Claude can think between tool calls--reason about a search result, decide what to do next, call another tool, reason again. This produces much better results on agentic tasks where the model needs to react to intermediate information.

On Opus 4.6, interleaved thinking is automatic when you use adaptive thinking. No beta header is needed. The "interleaved-thinking-2025-05-14" beta header is deprecated on Opus 4.6 and safely ignored if included. On Sonnet 4.6, you can get interleaved thinking either through adaptive thinking or by adding the beta header with manual extended thinking. On all older Claude 4 models, the beta header is required:

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "anthropic-beta: interleaved-thinking-2025-05-14" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-sonnet-4-5-20250929",
        "max_tokens": 16000,
        "thinking": {"type": "enabled", "budget_tokens": 10000},
        "messages": [...]
      }'

Starting with Opus 4.5 and continuing through Opus 4.6, thinking blocks from previous assistant turns are preserved in model context by default. Earlier models discard thinking blocks from prior turns. Preserving them enables cache hits in multi-step workflows and maintains reasoning continuity, but it also means long conversations consume more context because those thinking blocks accumulate as input tokens on every subsequent turn.


THE EFFORT PARAMETER

The effort parameter gives you a coarse-grained knob over how much work Claude puts into a response. The four levels are "low," "medium," "high," and "max." High is the default. Lower effort means fewer tokens, faster responses, and lower cost but potentially less thorough answers. Higher effort means more reasoning, more careful output, and more tokens consumed.

    {
      "model": "claude-opus-4-6",
      "max_tokens": 16000,
      "thinking": {"type": "adaptive"},
      "effort": "medium",
      "messages": [...]
    }

The effort parameter interacts with adaptive thinking: at lower effort levels, the model is more likely to skip thinking entirely, while at higher levels it will think more deeply. On Opus 4.6, the "max" level provides the absolute highest capability. Anthropic recommends "medium" for most Sonnet 4.6 use cases to balance speed, cost, and performance. The effort parameter is now generally available with no beta header required. It was originally introduced with Opus 4.5 and was initially exclusive to Opus models. With the 4.6 release, Sonnet 4.6 also supports it.


TOOL USE OVERVIEW

Tool use (sometimes called function calling) is the mechanism by which Claude can invoke external capabilities during a conversation. You define tools in your API request by providing a name, a description, and a JSON schema for the tool's input parameters in the "tools" array. When Claude determines it needs to use a tool, it generates a "tool_use" content block in its response specifying which tool to call and what arguments to pass. What happens next depends on whether the tool is client-side or server-side.

With client-side tools (bash, text editor, computer use, and any custom tools you define), the API call returns with a stop_reason of "tool_use." Your application receives the tool_use block, executes the requested action in whatever environment you control, and sends a new API request containing the original messages plus the assistant's tool_use response plus a "tool_result" content block with the output. Claude then reads the tool result and continues generating its response. This back-and-forth may happen multiple times in what feels like a single "turn" from the user's perspective--Claude might call several tools in sequence, each requiring a round trip through your application.

With server-side tools (web search, web fetch, code execution), the entire tool loop happens within a single API request. Claude decides to search, Anthropic's infrastructure executes the search, the results flow back to Claude, and Claude reads them and composes its answer--all before the response is returned to you. You send one request and receive a complete response that includes both the tool use content blocks (so you can see what Claude searched for and what came back) and Claude's final text answer. You do not need to execute anything yourself or make intermediate API calls.

Defining tools adds a small overhead to every request: a system prompt for tool use consumes 346 tokens on most current models (313 tokens if you use the "any" or specific tool-name tool_choice modes), and each tool definition adds tokens proportional to the length of its name, description, and schema.


CLIENT-SIDE TOOLS

Bash Tool

The bash tool lets Claude execute shell commands. When Claude wants to run a command, it generates a tool_use block containing the command string. Your application receives that block, runs the command in whatever environment you have set up (a Docker container, a VM, a sandboxed shell), captures stdout and stderr, and sends the output back as a tool_result. Claude never has direct access to a terminal--it produces structured descriptions of what it wants to run, and your application is the intermediary. The tool definition adds 245 input tokens per request.

The real cost variability comes from command outputs. If Claude runs a command that dumps a large file or produces verbose logs, all of that text flows back as input tokens on the next turn. In agentic coding workflows where Claude might execute dozens of commands in sequence, the accumulated stdout can consume tens of thousands of tokens across the session. Thoughtful truncation of command output before sending it back as tool results is one of the most effective cost-control measures.


Text Editor Tool

The text editor tool gives Claude the ability to view, create, and edit files. It generates structured commands--"view this file from line 50 to line 80," "replace this exact string with this other string," "create a file with this content"--and your application executes them against an actual filesystem. The tool definition adds 700 input tokens per request. The current tool version is "text_editor_20250728" for Claude 4.x models.

String replacement operations require the target string to appear exactly once in the file, ensuring edits are precise and unambiguous. The view operation supports optional line ranges so Claude can look at a specific section of a file rather than ingesting the whole thing, which matters for large files where the full content would consume many thousands of tokens. In practice the text editor and bash tool are almost always used together: Claude reads a file with the text editor, reasons about what needs to change, makes an edit, then uses bash to run tests and verify the change worked.


Computer Use Tool

Computer use lets Claude interact with a computer desktop--clicking, typing, scrolling, taking screenshots--to accomplish tasks through a graphical interface. This is a client-side tool despite involving visual interaction. When Claude decides to click a button or type into a text field, it generates a tool_use block describing the action: coordinates to click, text to type, keys to press, or a request for a screenshot. Your application executes that action on an actual computer (typically a virtual machine or browser instance you control), takes a screenshot of the result, and sends that screenshot back as a tool_result containing an image. Claude looks at the new screenshot, reasons about what it sees, and decides what to do next.

The tool definition adds 735 input tokens on Claude 4.x models, and the computer use beta adds 466-499 tokens to the system prompt. The primary cost beyond token overhead is screenshots: each screenshot is processed through Claude's vision capabilities, and a typical 1280x800 screenshot consumes several thousand tokens. In complex workflows where Claude takes dozens or hundreds of actions, those screenshot tokens dominate the cost of the interaction. There is no separate per-action charge.


Custom Client-Side Tools

You can define arbitrary custom tools by including them in the "tools" array of your request. You provide a name, description, and a JSON schema for the input parameters. Claude will generate tool_use blocks specifying your custom tool when it determines the tool would be helpful. You execute the tool however you see fit and return the result. For example:

    {
      "model": "claude-sonnet-4-6",
      "max_tokens": 1024,
      "tools": [
        {
          "name": "get_weather",
          "description": "Get the current weather in a given location.",
          "input_schema": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City and state, e.g. San Francisco, CA"
              }
            },
            "required": ["location"]
          }
        }
      ],
      "messages": [
        {"role": "user", "content": "What is the weather in London?"}
      ]
    }

Client-side tools cost nothing beyond the tokens consumed by the tool definitions, the tool_use blocks, and the tool_result blocks.


SERVER-SIDE TOOLS

Web Search

Web search lets Claude search the internet during a conversation. You enable it by adding a tool of the appropriate type to your request. The current versions are "web_search_20250305" for standard search and "web_search_20260209" for dynamic filtering on Opus 4.6 and Sonnet 4.6.

    {
      "model": "claude-opus-4-6",
      "max_tokens": 4096,
      "tools": [
        {
          "type": "web_search_20250305",
          "name": "web_search"
        }
      ],
      "messages": [
        {"role": "user", "content": "What happened in the news today?"}
      ]
    }

When Claude decides to search, Anthropic's infrastructure executes the search, retrieves and ranks the results, and returns them as structured content blocks within the assistant message in the API response. Claude reads those results and composes its final text answer, all within a single API round trip. The response you receive contains both the tool_use blocks (showing what Claude searched for), the search result content blocks (the actual results Claude read), and Claude's synthesized text answer with citations. When you append this full assistant response to your messages array for the next turn, the search results persist in context as input tokens, meaning Claude can reference them on follow-up questions without searching again.

The cost is $10 per 1,000 searches plus standard token pricing for all search result content. Each search counts as one use regardless of how many results come back. Failed searches are not billed. The usage object in the response includes a server_tool_use field reporting the number of web_search_requests made. For complex research questions, Claude may perform several searches in a single turn, and the accumulated search result text can reach tens of thousands of tokens.

The dynamic filtering feature on the 4.6 models (via "web_search_20260209") lets Claude write and execute code to filter search results before they enter the context window, keeping only relevant information and reducing both immediate token cost and ongoing context bloat.


Web Fetch

Web fetch lets Claude retrieve the full content of a specific URL. You enable it similarly:

    {
      "model": "claude-sonnet-4-6",
      "max_tokens": 4096,
      "tools": [
        {
          "type": "web_fetch_20250305",
          "name": "web_fetch"
        }
      ],
      "messages": [
        {"role": "user", "content": "Summarize the article at https://example.com/article"}
      ]
    }

The fetched content enters the conversation as input tokens. There is no additional charge beyond standard token pricing. However, the token consumption can be significant: a typical web page might consume around 2,500 tokens, a large documentation page around 25,000, and a research paper PDF around 125,000. Use the max_content_tokens parameter to cap how much content gets pulled in. Like web search, the newer "web_fetch_20260209" version supports dynamic filtering on the 4.6 models. The fetched content persists in the conversation context on subsequent turns, so it inflates input token costs for the rest of the conversation.


Code Execution

Code execution is a server-side tool that gives Claude access to a sandboxed Python environment where it can write and run code. You enable it by including the tool and the required beta header:

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "anthropic-beta: code-execution-2025-05-22" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-sonnet-4-6",
        "max_tokens": 4096,
        "tools": [
          {"type": "code_execution_20250522", "name": "code_execution"}
        ],
        "messages": [
          {"role": "user", "content": "Calculate the first 100 prime numbers."}
        ]
      }'

Unlike the bash tool, this is genuinely server-side: Anthropic provisions an actual sandboxed container that Claude writes to and executes code in. You do not need to set up or manage any execution environment. The entire cycle--Claude writing code, executing it, reading the output--happens within a single API request. The tradeoff is that the environment is constrained to Python in a sandbox, not an arbitrary shell.

Each organization gets 1,550 free hours of execution time per month. Additional usage beyond that is billed at $0.05 per hour per container. Execution time has a minimum granularity of 5 minutes, so even a one-second execution rounds up. If files are included in the request, execution time is billed even if Claude never runs any code, because the files are preloaded onto the container. The container persists across turns within a single conversation, so files created in one code execution call are available in subsequent calls.


MCP CONNECTOR

The MCP (Model Context Protocol) connector lets Claude interact with external services through a standardized protocol. MCP servers are remote endpoints that expose tools--Asana for task management, Gmail for email, Google Calendar, Salesforce, and many others. You include them in your API request by specifying the MCP server URL:

    {
      "model": "claude-sonnet-4-6",
      "max_tokens": 4096,
      "messages": [
        {"role": "user", "content": "Create a task in Asana for reviewing the Q3 report."}
      ],
      "mcp_servers": [
        {
          "type": "url",
          "url": "https://mcp.asana.com/sse",
          "name": "asana-mcp"
        }
      ]
    }

Claude discovers the available tools from the MCP server at request time, uses them as needed, and the results flow back through the same tool_use/tool_result pattern. MCP tools are treated as client-side from a pricing perspective: there is no additional Anthropic charge beyond the tokens consumed by the tool definitions, call parameters, and returned results. However, the MCP server itself is a third-party service and may have its own costs or rate limits. The protocol is open-source and standardized, so anyone can build and host an MCP server.


AGENT SKILLS

Agent skills are organized packages of instructions, scripts, and resources that Claude loads dynamically to perform specialized tasks--creating PowerPoint files, generating Excel spreadsheets, working with Word documents, producing PDFs, and more. They require the code execution tool to be enabled, because they typically involve Claude writing and running Python code that uses libraries like python-pptx, openpyxl, or python-docx to produce output files.

Anthropic provides pre-built skills for common document types, and you can upload custom skills via the Skills API (/v1/skills endpoints) to package domain-specific expertise or organizational workflows. Skills are accessed via the "skills-2025-10-02" beta header. From a pricing standpoint, skills add no separate charge--they consume tokens (for the skill instructions that enter the context) and code execution time (for the scripts that produce the files), both billed at standard rates.


PROMPT CACHING

Prompt caching lets you avoid re-sending and re-processing the same tokens on every request. If you have a long system prompt, a large document, or any stable block of text that stays the same across multiple API calls, you can cache it so the model only processes it once and reads it from cache on subsequent requests.

There are two cache durations. The default 5-minute cache charges 1.25x the base input token price when you write to the cache, then only 0.1x (a 90% discount) when you read from it on subsequent requests. The 1-hour cache charges 2x on writes but the same 0.1x on reads, and exists specifically for extended thinking workflows where a single reasoning session can easily exceed five minutes and you would lose your cache before the next turn.

To use caching, you mark specific content blocks in your request with a cache_control parameter. The cache is keyed on the exact content, so any change to the cached block invalidates it and requires a new write. Caching is most valuable when you have a large system prompt or reference document that stays constant while user messages change--you pay the write cost once and then get 90% off on every subsequent request that references the same content.

The pricing multipliers for all models are:
- 5-minute cache write: 1.25x base input price
- 1-hour cache write: 2x base input price
- Cache read (both durations): 0.1x base input price

Prompt caching stacks with batch processing discounts and with long context pricing multipliers.


BATCH PROCESSING

The Message Batches API lets you submit large volumes of requests asynchronously at a 50% discount on all token prices. You assemble a batch of individual message requests, submit them as a group, and Anthropic processes them with no guaranteed turnaround time (though they typically complete within minutes to hours depending on volume). This is designed for workloads that do not need real-time responses: bulk classification, dataset processing, generating summaries of a document corpus, running evaluations, or any pipeline where you can tolerate latency.

The 50% discount applies to both input and output tokens and is available for every model in the API. Batch discounts stack with prompt caching, so if you run a batch of requests that all share a cached system prompt, you get both the caching read discount and the batch discount.

Batch pricing by model:
- Opus 4.6, Opus 4.5: $2.50 input / $12.50 output per million tokens
- Sonnet 4.6, Sonnet 4.5, Sonnet 4: $1.50 / $7.50
- Haiku 4.5: $0.50 / $2.50
- Haiku 3.5: $0.40 / $2.00
- Haiku 3: $0.125 / $0.625
- Opus 4.1, Opus 4: $7.50 / $37.50


LONG CONTEXT WINDOW

The standard context window for all current models is 200,000 tokens. A beta 1,000,000-token context window is available for Opus 4.6, Sonnet 4.6, Sonnet 4.5, and Sonnet 4. To enable it, include the beta header:

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "anthropic-beta: context-1m-2025-08-07" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-opus-4-6",
        "max_tokens": 1024,
        "messages": [...]
      }'

Access is restricted to organizations in usage tier 4 or those with custom rate limits. The pricing is tiered: requests with 200,000 or fewer input tokens are charged at standard rates, but requests exceeding 200K input tokens are charged at premium rates--2x the base input price and 1.5x the base output price. The 200K threshold is based on the total of input_tokens, cache_creation_input_tokens, and cache_read_input_tokens. If you exceed it, the premium rate applies to the entire request, not just the tokens above 200K. Even with the beta header enabled, requests with fewer than 200K input tokens are charged at standard rates.

For Sonnet models using the 1M context window, the long context rates are:
- 200K tokens or fewer: $3 input / $15 output per million tokens (standard)
- Over 200K tokens: $6 input / $22.50 output per million tokens (premium)

Long context pricing stacks with batch discounts and prompt caching multipliers.


CONTEXT COMPACTION

Context compaction, introduced with the 4.6 models, provides automatic server-side context summarization that enables effectively infinite conversations. When context approaches the window limit, the API automatically summarizes earlier parts of the conversation, compressing the context so the conversation can continue. This is what powers the "Infinite Chats" feature on claude.ai.

Despite being server-side processing, compaction does not introduce true statefulness. You still send the full conversation history on each request. Anthropic's infrastructure compresses it before the model processes it, so the model sees a shorter version. This is particularly valuable for long-running agent sessions where context accumulates over hours of tool calls and reasoning.


STRUCTURED OUTPUTS

Structured outputs constrain Claude's response to conform to a JSON schema you specify. You provide a schema in the output_config.format parameter:

    {
      "model": "claude-sonnet-4-6",
      "max_tokens": 1024,
      "output_config": {
        "format": {
          "type": "json_schema",
          "schema": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "age": {"type": "integer"}
            },
            "required": ["name", "age"],
            "additionalProperties": false
          }
        }
      },
      "messages": [...]
    }

Claude guarantees its output will validate against the schema. This is generally available on Sonnet 4.5, Sonnet 4.6, Opus 4.5, Opus 4.6, and Haiku 4.5, and in public beta on Amazon Bedrock and Microsoft Foundry. There is no additional cost beyond standard token pricing. The older output_format parameter still works but is deprecated in favor of output_config.format.


MEMORY AND CONTEXT EDITING

The memory tool provides genuine persistent state across separate API requests and conversations. When Claude writes something to the memory tool, that data is stored server-side and persists across completely separate conversations. This is the one real exception to the API's otherwise stateless design. It is accessed via the "context-management-2025-06-27" beta header.

Context editing, a related beta feature accessible via the same header, lets you define rules for automatically managing conversation context as it grows--for example, clearing old tool call results while keeping the most recent ones, so the model stays within its context window without losing critical information. Both features are supported on Opus 4.5, Opus 4.6, and other recent models.

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "anthropic-beta: context-management-2025-06-27" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-opus-4-6",
        "max_tokens": 4096,
        "tools": [
          {"type": "memory_20250818", "name": "memory"}
        ],
        "messages": [...]
      }'


FAST MODE

Fast mode is a research preview on Opus 4.6 that delivers up to 2.5x faster output token generation at premium pricing. It runs the same model with the same intelligence--the only difference is inference speed.

    curl https://api.anthropic.com/v1/messages \
      -H "x-api-key: $ANTHROPIC_API_KEY" \
      -H "anthropic-version: 2023-06-01" \
      -H "anthropic-beta: fast-mode-2026-02-01" \
      -H "content-type: application/json" \
      -d '{
        "model": "claude-opus-4-6",
        "max_tokens": 4096,
        "speed": "fast",
        "messages": [...]
      }'

Fast mode pricing is $30 per million input tokens and $150 per million output tokens--6x the standard Opus rate. It exists for latency-sensitive deployments where you need full Opus reasoning power but cannot tolerate typical Opus generation times.


INFERENCE GEOGRAPHY

For Opus 4.6 and newer models, specifying US-only inference via the inference_geo parameter incurs a 1.1x multiplier on all token pricing categories. On third-party platforms (AWS Bedrock and Google Vertex AI), starting with Sonnet 4.5 and Haiku 4.5 and all subsequent models, regional endpoints (which guarantee data routing through specific geographic regions) include a 10% premium over global endpoints. The Claude API itself is global-only and is unaffected by this regional/global distinction.


RATE LIMITS

Rate limits are organized into tiers numbered 1 through 4, with each tier allowing progressively higher request rates, tokens per minute, and tokens per day. Your tier advances as you spend more on the API. Enterprise customers can negotiate custom limits. The 1M token context window requires tier 4 or custom rate limits. Anthropic introduced weekly rate limits for heavy Claude Code users in August 2025 to prevent cost spikes and uneven infrastructure load.


THIRD-PARTY PLATFORM AVAILABILITY

All Claude models are available on AWS Bedrock, Google Vertex AI, and Microsoft Foundry in addition to the Anthropic API. Platform-specific model IDs differ from the Anthropic API IDs (these are listed under each model above). Pricing on third-party platforms may differ from Anthropic's direct pricing--consult each platform's pricing page for current rates.


LEGACY MODELS

The following models are still available in the API but are superseded by newer, cheaper, or more capable alternatives. They are listed here for completeness.


Claude Opus 4.5

API model ID: claude-opus-4-5-20251101
Alias: claude-opus-4-5
Released: November 24, 2025
Superseded by: Opus 4.6

Pricing: $5 input / $25 output per million tokens (a 66% reduction from Opus 4.1).
Context window: 200,000 tokens.
Maximum output: 128,000 tokens.

Was the first Opus model to introduce the effort parameter, persistent memory via the memory tool, and automatic preservation of thinking blocks across conversation turns. Scored 80.9% on SWE-bench Verified at launch. Supports extended thinking, interleaved thinking via beta header, structured outputs, and all standard tools. Still a strong model but Opus 4.6 offers improved long-context performance, adaptive thinking, and the "max" effort level at the same price.


Claude Sonnet 4.5

API model ID: claude-sonnet-4-5-20250929
Alias: claude-sonnet-4-5
Released: September 29, 2025
Superseded by: Sonnet 4.6

Pricing: $3 input / $15 output per million tokens.
Context window: 200,000 tokens (1M beta available).
Maximum output: 64,000 tokens.
Reliable knowledge cutoff: January 2025.
Training data cutoff: July 2025.

Scored 77.2% on SWE-bench Verified at launch and was marketed as "the best coding model in the world." Supports extended thinking, interleaved thinking via beta header, context awareness, structured outputs, and all standard tools. A capable model, but Sonnet 4.6 at the same price offers higher benchmark scores, adaptive thinking, the effort parameter, and a more recent training data cutoff.


Claude Sonnet 4

API model ID: claude-sonnet-4-20250514
Alias: claude-sonnet-4
Released: May 22, 2025
Superseded by: Sonnet 4.5 and Sonnet 4.6

Pricing: $3 input / $15 output per million tokens.
Context window: 200,000 tokens (1M beta available).
Maximum output: 64,000 tokens.

Marked the transition from the Claude 3.x naming scheme to the Claude 4 family. Was the default model on claude.ai for free-tier users through much of 2025. Supports extended thinking and interleaved thinking via beta header.


Claude Opus 4.1

API model ID: claude-opus-4-1-20250805
Alias: claude-opus-4-1
Released: August 5, 2025
Superseded by: Opus 4.5 and Opus 4.6

Pricing: $15 input / $75 output per million tokens.
Context window: 200,000 tokens.
Maximum output: 32,000 tokens.

The most expensive model still available in the API--three times the cost of Opus 4.5 and 4.6 with lower benchmark scores. Some developers still use it for final code reviews before merging, where it reportedly catches async bugs and resource leaks that faster models miss. For most purposes, Opus 4.5 or 4.6 is strictly better and much cheaper. Supports extended thinking and interleaved thinking via beta header.


Claude Opus 4

API model ID: claude-opus-4-20250514
Alias: claude-opus-4
Released: May 22, 2025
Superseded by: Opus 4.1, 4.5, and 4.6

Pricing: $15 input / $75 output per million tokens.
Context window: 200,000 tokens.
Maximum output: 32,000 tokens.

The original Claude 4 flagship. Anthropic classified it as a Level 3 model on their four-point safety scale at launch. Like Opus 4.1, it is priced three times higher than its successors with lower performance. There is little reason to use it for new work. Supports extended thinking and interleaved thinking via beta header.


Claude Haiku 3.5

API model ID: claude-3-5-haiku-20241022
Alias: claude-haiku-3-5
Released: October 2024

Pricing: $0.80 input / $4 output per million tokens.
Context window: 200,000 tokens.
Maximum output: 8,192 tokens.

A Claude 3.5-era model. Serviceable for simple high-volume tasks where even Haiku 4.5's cost is a consideration, though substantially less capable than the newer model. The tool use system prompt overhead is different from Claude 4 models: 264 tokens for auto/none tool choice, 340 tokens for any/tool choice.


Claude Haiku 3

API model ID: claude-3-haiku-20240307
Released: March 2024

Pricing: $0.25 input / $1.25 output per million tokens.
Context window: 200,000 tokens.
Maximum output: 4,096 tokens.

The cheapest model in the API. A Claude 3-era model that lacks many features of the Claude 4 family--no extended thinking, no adaptive thinking, no effort parameter, no structured outputs. For cost-sensitive classification, routing, or simple extraction tasks at enormous volume, nothing in the API is cheaper.


DEPRECATED MODELS

Claude Sonnet 3.7 (claude-3-7-sonnet-20250219) at $3/$15 per million tokens and Claude Opus 3 (claude-3-opus-20240229) at $15/$75 per million tokens are both flagged for deprecation. They are still callable but Anthropic recommends migrating away from them.


SUMMARY OF ALL BETA HEADERS

The following beta feature identifiers can be passed in the anthropic-beta header:

- interleaved-thinking-2025-05-14 (interleaved thinking for pre-4.6 models; deprecated on Opus 4.6)
- context-1m-2025-08-07 (1M token context window)
- context-management-2025-06-27 (memory tool and context editing)
- code-execution-2025-05-22 (code execution tool)
- fast-mode-2026-02-01 (fast inference for Opus 4.6)
- skills-2025-10-02 (agent skills)
- output-128k-2025-02-19 (128K max output for Claude Sonnet 3.7 only; not needed for Claude 4 models)
- token-efficient-tools-2025-02-19 (legacy; not needed on 4.6 models)
- extended-cache-ttl-2025-04-11 (1-hour prompt cache duration)
- files-api-2025-04-14 (Files API)
- mcp-client-2025-11-20 (MCP connector, newer version)
- model-context-window-exceeded-2025-08-26 (returns stop reason instead of error when context exceeded)


SUMMARY OF PRICING

Current recommended models:
- Opus 4.6: $5 input / $25 output per million tokens
- Sonnet 4.6: $3 input / $15 output per million tokens
- Haiku 4.5: $1 input / $5 output per million tokens

Legacy models still available:
- Opus 4.5: $5 / $25
- Sonnet 4.5: $3 / $15
- Sonnet 4: $3 / $15
- Opus 4.1: $15 / $75
- Opus 4: $15 / $75
- Haiku 3.5: $0.80 / $4
- Haiku 3: $0.25 / $1.25

Additional costs:
- Web search: $10 per 1,000 searches plus standard token costs
- Web fetch: no additional cost beyond tokens
- Code execution: $0.05/hour per container after 1,550 free hours/month (5-minute minimum)
- Fast mode (Opus 4.6): $30 / $150 per million tokens
- US-only inference (Opus 4.6+): 1.1x multiplier on all token prices
- 1M context (over 200K input tokens): 2x input, 1.5x output multiplier
- Prompt cache writes (5-minute): 1.25x input price
- Prompt cache writes (1-hour): 2x input price
- Prompt cache reads: 0.1x input price
- Batch processing: 50% discount on all token prices, all models

All prices are in USD. One token is approximately 4 characters or 0.75 words in English.