How Claude Code's QueryEngine orchestrates every conversation turn —
from the first user message to streaming tokens, tool calls, retries, autocompact,
and stop hooks — before looping back for more.
When you send a message to Claude Code, it passes through at least four distinct layers before the model's reply reaches your terminal. Understanding those layers is the key to understanding why Claude Code behaves the way it does — why it retries, why it compacts, why it can run tools in the middle of a stream.
query().
async function* that loops until the model stops calling tools. Each iteration is one model call.
withRetry().
query() generator. The generator is the single source of truth for how turns work.
Below is the full message flow for a single conversation turn that involves at least one tool call. Follow the arrows: the loop between queryLoop and queryModel is the heart of the agentic behavior.
while (true) at line 307 of query.ts. Every iteration
through that loop is exactly one API call.
QueryEngine is a stateful class instantiated once
per conversation. It holds the mutable message history, total token usage,
permission denials, and the abort controller. Each call to
submitMessage() is one "turn" within that conversation.
QueryEngine.ts (simplified)export class QueryEngine { private mutableMessages: Message[] private abortController: AbortController private totalUsage: NonNullableUsage private permissionDenials: SDKPermissionDenial[] // Turn-scoped: cleared at start of each submitMessage() call private discoveredSkillNames = new Set<string>() async *submitMessage( prompt: string | ContentBlockParam[], options?: { uuid?: string; isMeta?: boolean }, ): AsyncGenerator<SDKMessage> { // 1. Build system prompt (fetchSystemPromptParts) // 2. processUserInput — handles slash commands // 3. recordTranscript — persists BEFORE the API call // 4. yield* query({ messages, ... }) // 5. yield final result SDKMessage } }
Before query() is even called, submitMessage() persists
the user's message to disk. This means a session is resumable even if
the process is killed before the model ever responds. The comment in the source
is illuminating:
// If the process is killed before that (e.g. user clicks Stop in
// cowork seconds after send), the transcript is left with only
// queue-operation entries; getLastSessionLog filters those out,
// returns null, and --resume fails with "No conversation found".
// Writing now makes the transcript resumable from the point the
// user message was accepted, even if no API response ever arrives.
QueryEngine is used by the SDK/headless path. The REPL has its own
wiring through ask() but calls the same query() function
underneath.
queryLoop() in query.ts is a while(true)
loop that carries a typed State object between iterations. Rather
than nine separate let variables, a single
state = { ... } reassignment at each continue site makes
the transitions explicit and auditable.
query.tstype State = { messages: Message[] toolUseContext: ToolUseContext autoCompactTracking: AutoCompactTrackingState | undefined maxOutputTokensRecoveryCount: number hasAttemptedReactiveCompact: boolean maxOutputTokensOverride: number | undefined pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined stopHookActive: boolean | undefined turnCount: number transition: Continue | undefined // WHY we looped again }
The transition field records why the loop continued. This
makes the continue sites self-documenting and lets tests assert which recovery
path fired without inspecting message content:
| transition.reason | Meaning |
|---|---|
max_output_tokens_escalate | First hit of 8k cap; retry at 64k max_tokens |
max_output_tokens_recovery | Model hit output limit; inject recovery nudge (up to 3×) |
reactive_compact_retry | Prompt-too-long → compacted history → retry |
collapse_drain_retry | Prompt-too-long → drained context-collapse stages → retry |
stop_hook_blocking | A stop hook returned a blocking error; re-query with error as user message |
token_budget_continuation | Token budget says work isn't done; inject nudge and continue |
| (needs follow-up) | Normal: model returned tool_use blocks → run tools → loop |
Terminal) on: completed,
blocking_limit, model_error, prompt_too_long,
aborted_streaming, stop_hook_prevented,
image_error. Each maps to a different user-visible outcome.
queryModel in claude.ts is an async function*
that calls the Anthropic beta messages endpoint and re-yields each stream event as
an internal AssistantMessage or StreamEvent.
query.ts (inner stream loop, simplified)for await (const message of deps.callModel({ messages: prependUserContext(messagesForQuery, userContext), systemPrompt: fullSystemPrompt, thinkingConfig: toolUseContext.options.thinkingConfig, tools: toolUseContext.options.tools, signal: toolUseContext.abortController.signal, options: { model: currentModel, fallbackModel, ... }, })) { if (message.type === 'assistant') { assistantMessages.push(message) // tool_use blocks trigger needsFollowUp = true const toolBlocks = message.message.content .filter(b => b.type === 'tool_use') if (toolBlocks.length > 0) needsFollowUp = true } yield yieldMessage // surfaces to SDK caller / REPL }
When config.gates.streamingToolExecution is enabled, a
StreamingToolExecutor fires tools while the stream is still
open. Tools whose inputs arrive early start executing in parallel with the
model still generating text, cutting latency on multi-tool turns.
AssistantMessage objects are tombstoned — the engine yields
{ type: 'tombstone', message } so the UI and transcript can remove
them. This prevents "thinking blocks cannot be modified" API errors on the retry.
Every API call goes through withRetry() in
services/api/withRetry.ts. The function is an
async function* that retries up to
DEFAULT_MAX_RETRIES = 10 times by default, yielding a
SystemAPIErrorMessage before each sleep so the user sees a
live status update.
// Backoff formula (from withRetry.ts) export function getRetryDelay( attempt: number, retryAfterHeader?: string | null, maxDelayMs = 32000, ): number { if (retryAfterHeader) { const seconds = parseInt(retryAfterHeader, 10) if (!isNaN(seconds)) return seconds * 1000 } const baseDelay = Math.min( BASE_DELAY_MS * Math.pow(2, attempt - 1), maxDelayMs, ) const jitter = Math.random() * 0.25 * baseDelay return baseDelay + jitter }
Key retry decision rules:
FallbackTriggeredError which queryLoop catches and switches to fallbackModel.handleOAuth401Error() before the next attempt.maxTokensOverride.disableKeepAlive() is called before the retry.
The Anthropic streaming API sends Server-Sent Events in this sequence:
message_start → one or more content_block_start /
content_block_delta / content_block_stop pairs →
message_delta (with final usage + stop_reason) →
message_stop.
queryModel reconstructs a complete AssistantMessage
object per content block and yields it. Usage is mutated in-place on the last
message once message_delta arrives — the final stop_reason and
token counts are not available until the stream ends.
// From QueryEngine.ts — usage tracking if (message.event.type === 'message_start') { currentMessageUsage = updateUsage(EMPTY_USAGE, message.event.message.usage) } if (message.event.type === 'message_delta') { currentMessageUsage = updateUsage(currentMessageUsage, message.event.usage) if (message.event.delta.stop_reason != null) { lastStopReason = message.event.delta.stop_reason } }
One subtlety: tool_use blocks include their JSON
input via deltas. If a tool's
backfillObservableInput method adds fields to the input (e.g.,
expanding a file path), only a clone of the message is yielded to
observers — the original stays byte-for-byte identical for prompt caching.
Before each API call, queryLoop runs a pipeline of context
reduction strategies in a fixed priority order:
The blocking limit check happens after all compaction strategies
have run. If context is still over the limit, a synthetic
PROMPT_TOO_LONG_ERROR_MESSAGE is yielded and the loop exits with
reason blocking_limit — the user must manually run
/compact.
Reactive compact is a fallback path triggered by a real 413 from the API (prompt-too-long). The engine withholds the error message during streaming, then attempts one reactive compaction. If that fails, the error is surfaced and stop hooks are skipped (to prevent a death spiral).
The task_budget feature tracks total context tokens consumed
across compact boundaries. When the server summarizes history, it would
normally under-count the pre-compact spend; taskBudgetRemaining
carries the correct cumulative spend across boundaries.
// task_budget carryover across compaction (query.ts ~508) if (params.taskBudget) { const preCompactContext = finalContextTokensFromLastResponse(messagesForQuery) taskBudgetRemaining = Math.max( 0, (taskBudgetRemaining ?? params.taskBudget.total) - preCompactContext, ) }
After the model finishes (no tool calls, no recovery needed), the engine calls
handleStopHooks() in query/stopHooks.ts.
Stop hooks are external shell scripts or commands configured by the user. They
run after every turn and can:
{ reason: 'stop_hook_prevented' }
handleStopHooks() runs three categories of hooks in order:
Registered via settings.json hooks configuration. Run in
parallel; each result is collected as a hook_success,
hook_non_blocking_error, or hook_error_during_execution
attachment. A blocking error is any hook exit-code failure where the hook
explicitly signals it should block.
In teammate mode (multi-agent setups), hooks fire for each
in_progress task owned by this agent. These mirror stop hook
semantics (can block, can prevent continuation).
Fire when this teammate transitions to idle. Can also block or prevent continuation.
Skipped in bare mode (-p flag). Fired without await
in interactive mode:
executePromptSuggestion — generates btw... suggestionsexecuteExtractMemories — extracts facts to MEMORY.mdexecuteAutoDream — autonomous background exploration// --bare / SIMPLE: skip background bookkeeping // Scripted -p calls don't want auto-memory or forked agents // contending for resources during shutdown. if (!isBareMode()) { void executePromptSuggestion(stopHookContext) if (feature('EXTRACT_MEMORIES') && isExtractModeActive()) { void extractMemoriesModule!.executeExtractMemories(...) } if (!toolUseContext.agentId) { void executeAutoDream(...) } }
query/tokenBudget.ts implements an auto-continue feature for the
SDK path. When a per-turn token budget is configured, the engine checks after
each clean model stop whether the model has "used up" enough of its budget.
If not, it injects a nudge message and loops again.
query/tokenBudget.tsconst COMPLETION_THRESHOLD = 0.9 // 90% used = done const DIMINISHING_THRESHOLD = 500 // <500 new tokens = no progress export function checkTokenBudget( tracker: BudgetTracker, agentId: string | undefined, budget: number | null, globalTurnTokens: number, ): TokenBudgetDecision { if (agentId || budget === null || budget <= 0) { return { action: 'stop', completionEvent: null } } const pct = Math.round((globalTurnTokens / budget) * 100) const isDiminishing = tracker.continuationCount >= 3 && deltaSinceLastCheck < DIMINISHING_THRESHOLD && tracker.lastDeltaTokens < DIMINISHING_THRESHOLD // Continue if under 90% AND not diminishing if (!isDiminishing && turnTokens < budget * COMPLETION_THRESHOLD) { return { action: 'continue', nudgeMessage: ... } } return { action: 'stop', ... } }
The decision logic has two early-stop conditions:
The nudge message is injected as an isMeta user message so it
doesn't appear in the REPL transcript, and the loop continues with
transition.reason = 'token_budget_continuation'.
The while(true) in queryLoop exits via a typed
Terminal value. Every possible stopping condition —
completion, errors, abort, stop hooks, budget — has a named reason.
submitMessage, query, queryLoop,
queryModel, withRetry, handleStopHooks
— all are async function*. This lets the entire stack compose
cleanly with yield* and backpressure flows naturally.
The user message is written to disk before the API is called. Even a process kill between send and response leaves a resumable session.
feature('HISTORY_SNIP'), feature('TOKEN_BUDGET'),
feature('CONTEXT_COLLAPSE') etc. are evaluated at bundle time
by Bun, eliminating unreachable code from external builds and preventing
string leakage.
Memory extraction, prompt suggestions, auto-dream — all are void
promises. They must not block the response stream and must not run in bare
(-p) mode where resource contention on shutdown matters.
Foreground vs background source routing, fast-mode cooldowns, OAuth refresh, persistent keep-alive, Opus→fallback after 3×529, context-overflow token recalculation — the retry layer is a small state machine, not just a sleep loop.
Five questions to check your understanding. Select an answer, then hit Check.
QueryEngine.submitMessage() writes the transcript to disk before calling query()?withRetry() throws which error?transition.reason field on the State object represent?checkTokenBudget() trigger a diminishing returns early stop?