markdown.engineering
Lesson 04

Query Engine & LLM API

How Claude Code's QueryEngine orchestrates every conversation turn — from the first user message to streaming tokens, tool calls, retries, autocompact, and stop hooks — before looping back for more.

1. The Big Picture

When you send a message to Claude Code, it passes through at least four distinct layers before the model's reply reaches your terminal. Understanding those layers is the key to understanding why Claude Code behaves the way it does — why it retries, why it compacts, why it can run tools in the middle of a stream.

  1. QueryEngine.submitMessage() Validates the prompt, builds the system prompt, resolves the model, records the transcript, then hands off to query().
  2. query() → queryLoop() An async function* that loops until the model stops calling tools. Each iteration is one model call.
  3. queryModel / callModel Calls the Anthropic API via the SDK's streaming interface, wrapping everything in withRetry().
  4. Stop hooks & token budget After the model finishes each turn, external hooks run; the token budget decides whether to inject a nudge and loop again.
Key insight Every public surface of Claude Code — the REPL, the SDK, remote Claude Code — funnels through the same query() generator. The generator is the single source of truth for how turns work.

2. Sequence Diagram

Below is the full message flow for a single conversation turn that involves at least one tool call. Follow the arrows: the loop between queryLoop and queryModel is the heart of the agentic behavior.

sequenceDiagram participant User participant QE as QueryEngine
submitMessage() participant Q as query() /
queryLoop() participant QM as queryModel
(claude.ts) participant API as Anthropic API
(streaming) participant Tools as Tool
Executor participant SH as stopHooks.ts User->>QE: submitMessage(prompt) QE->>QE: fetchSystemPromptParts()
buildSystemInitMessage() QE->>Q: query({ messages, systemPrompt, ... }) Q->>Q: applyToolResultBudget()
microcompact / snip / autocompact loop queryLoop — one iteration per model call Q->>QM: callModel({ messages, tools, ... }) QM->>API: POST /v1/messages (streaming, withRetry) API-->>QM: stream: content_block_delta events QM-->>Q: yield AssistantMessage (text / tool_use blocks) Q->>Q: StreamingToolExecutor tracks tool_use blocks alt tool_use blocks present Q->>Tools: runTools(toolUseBlocks) Tools-->>Q: yield progress + tool_result UserMessages Q->>Q: append tool_results to messages Note over Q: needsFollowUp = true → loop continues else no tool calls Note over Q: needsFollowUp = false Q->>SH: handleStopHooks() SH-->>Q: yield hook progress/attachments alt hook blocking error Q->>Q: append blockingError, loop again else hook prevents continuation Q-->>QE: Terminal { reason: 'stop_hook_prevented' } else clean stop Q->>Q: checkTokenBudget() alt budget says continue Q->>Q: inject nudge message, loop again else budget says stop Q-->>QE: Terminal { reason: 'completed' } end end end end QE-->>User: yield SDKMessage stream
(assistant / user / result)
Reading the diagram The loop box is not just a diagram convention — it maps directly to the while (true) at line 307 of query.ts. Every iteration through that loop is exactly one API call.

3. QueryEngine — One Engine Per Conversation

QueryEngine is a stateful class instantiated once per conversation. It holds the mutable message history, total token usage, permission denials, and the abort controller. Each call to submitMessage() is one "turn" within that conversation.

QueryEngine.ts (simplified)export class QueryEngine {
  private mutableMessages: Message[]
  private abortController: AbortController
  private totalUsage: NonNullableUsage
  private permissionDenials: SDKPermissionDenial[]

  // Turn-scoped: cleared at start of each submitMessage() call
  private discoveredSkillNames = new Set<string>()

  async *submitMessage(
    prompt: string | ContentBlockParam[],
    options?: { uuid?: string; isMeta?: boolean },
  ): AsyncGenerator<SDKMessage> {
    // 1. Build system prompt (fetchSystemPromptParts)
    // 2. processUserInput — handles slash commands
    // 3. recordTranscript — persists BEFORE the API call
    // 4. yield* query({ messages, ... })
    // 5. yield final result SDKMessage
  }
}

Why transcript is written before the API call

Before query() is even called, submitMessage() persists the user's message to disk. This means a session is resumable even if the process is killed before the model ever responds. The comment in the source is illuminating:

// If the process is killed before that (e.g. user clicks Stop in
// cowork seconds after send), the transcript is left with only
// queue-operation entries; getLastSessionLog filters those out,
// returns null, and --resume fails with "No conversation found".
// Writing now makes the transcript resumable from the point the
// user message was accepted, even if no API response ever arrives.
SDK vs REPL QueryEngine is used by the SDK/headless path. The REPL has its own wiring through ask() but calls the same query() function underneath.

4. Inside queryLoop() — The While(true) Core

queryLoop() in query.ts is a while(true) loop that carries a typed State object between iterations. Rather than nine separate let variables, a single state = { ... } reassignment at each continue site makes the transitions explicit and auditable.

query.tstype State = {
  messages: Message[]
  toolUseContext: ToolUseContext
  autoCompactTracking: AutoCompactTrackingState | undefined
  maxOutputTokensRecoveryCount: number
  hasAttemptedReactiveCompact: boolean
  maxOutputTokensOverride: number | undefined
  pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined
  stopHookActive: boolean | undefined
  turnCount: number
  transition: Continue | undefined   // WHY we looped again
}

Continue transitions — the seven reasons to loop

The transition field records why the loop continued. This makes the continue sites self-documenting and lets tests assert which recovery path fired without inspecting message content:

transition.reasonMeaning
max_output_tokens_escalateFirst hit of 8k cap; retry at 64k max_tokens
max_output_tokens_recoveryModel hit output limit; inject recovery nudge (up to 3×)
reactive_compact_retryPrompt-too-long → compacted history → retry
collapse_drain_retryPrompt-too-long → drained context-collapse stages → retry
stop_hook_blockingA stop hook returned a blocking error; re-query with error as user message
token_budget_continuationToken budget says work isn't done; inject nudge and continue
(needs follow-up)Normal: model returned tool_use blocks → run tools → loop
Termination conditions The loop exits (returns a Terminal) on: completed, blocking_limit, model_error, prompt_too_long, aborted_streaming, stop_hook_prevented, image_error. Each maps to a different user-visible outcome.

5. Streaming & the API Layer

queryModel in claude.ts is an async function* that calls the Anthropic beta messages endpoint and re-yields each stream event as an internal AssistantMessage or StreamEvent.

query.ts (inner stream loop, simplified)for await (const message of deps.callModel({
  messages: prependUserContext(messagesForQuery, userContext),
  systemPrompt: fullSystemPrompt,
  thinkingConfig: toolUseContext.options.thinkingConfig,
  tools: toolUseContext.options.tools,
  signal: toolUseContext.abortController.signal,
  options: { model: currentModel, fallbackModel, ... },
})) {
  if (message.type === 'assistant') {
    assistantMessages.push(message)
    // tool_use blocks trigger needsFollowUp = true
    const toolBlocks = message.message.content
      .filter(b => b.type === 'tool_use')
    if (toolBlocks.length > 0) needsFollowUp = true
  }
  yield yieldMessage // surfaces to SDK caller / REPL
}

Streaming tool execution

When config.gates.streamingToolExecution is enabled, a StreamingToolExecutor fires tools while the stream is still open. Tools whose inputs arrive early start executing in parallel with the model still generating text, cutting latency on multi-tool turns.

Tombstone messages If a streaming fallback is triggered mid-stream, partially-received AssistantMessage objects are tombstoned — the engine yields { type: 'tombstone', message } so the UI and transcript can remove them. This prevents "thinking blocks cannot be modified" API errors on the retry.
Deep dive: withRetry() — exponential backoff, 529s, and OAuth refresh

Every API call goes through withRetry() in services/api/withRetry.ts. The function is an async function* that retries up to DEFAULT_MAX_RETRIES = 10 times by default, yielding a SystemAPIErrorMessage before each sleep so the user sees a live status update.

// Backoff formula (from withRetry.ts)
export function getRetryDelay(
  attempt: number,
  retryAfterHeader?: string | null,
  maxDelayMs = 32000,
): number {
  if (retryAfterHeader) {
    const seconds = parseInt(retryAfterHeader, 10)
    if (!isNaN(seconds)) return seconds * 1000
  }
  const baseDelay = Math.min(
    BASE_DELAY_MS * Math.pow(2, attempt - 1),
    maxDelayMs,
  )
  const jitter = Math.random() * 0.25 * baseDelay
  return baseDelay + jitter
}

Key retry decision rules:

  • 529 (overloaded): Only foreground query sources retry (user is waiting). Background sources — summaries, classifiers — bail immediately to avoid amplifying capacity cascades.
  • Opus fallback: After 3 consecutive 529s on a non-custom Opus model, throws FallbackTriggeredError which queryLoop catches and switches to fallbackModel.
  • OAuth 401: Forces a token refresh via handleOAuth401Error() before the next attempt.
  • Context overflow 400: Parses the token counts from the error message and computes a new maxTokensOverride.
  • Persistent mode (UNATTENDED_RETRY): Retries indefinitely with 30-min backoff cap, yielding heartbeat messages every 30s so the host doesn't kill the session for inactivity.
  • ECONNRESET/EPIPE: Stale keep-alive socket detected; disableKeepAlive() is called before the retry.
Deep dive: SSE stream → AssistantMessage reconstruction

The Anthropic streaming API sends Server-Sent Events in this sequence: message_start → one or more content_block_start / content_block_delta / content_block_stop pairs → message_delta (with final usage + stop_reason) → message_stop.

queryModel reconstructs a complete AssistantMessage object per content block and yields it. Usage is mutated in-place on the last message once message_delta arrives — the final stop_reason and token counts are not available until the stream ends.

// From QueryEngine.ts — usage tracking
if (message.event.type === 'message_start') {
  currentMessageUsage = updateUsage(EMPTY_USAGE, message.event.message.usage)
}
if (message.event.type === 'message_delta') {
  currentMessageUsage = updateUsage(currentMessageUsage, message.event.usage)
  if (message.event.delta.stop_reason != null) {
    lastStopReason = message.event.delta.stop_reason
  }
}

One subtlety: tool_use blocks include their JSON input via deltas. If a tool's backfillObservableInput method adds fields to the input (e.g., expanding a file path), only a clone of the message is yielded to observers — the original stays byte-for-byte identical for prompt caching.

6. Context Management & Autocompact

Before each API call, queryLoop runs a pipeline of context reduction strategies in a fixed priority order:

  1. applyToolResultBudget() Caps the byte size of individual tool results. Large results are stored externally and replaced with a reference stub.
  2. snipCompact (HISTORY_SNIP feature) Removes old messages from the middle of the history when they are provably not needed, freeing tokens without a full summarization pass.
  3. microcompact / cached microcompact Merges consecutive tool-result/user message pairs into condensed summaries. The cached variant uses API-side cache edits to avoid retransmitting deleted blocks.
  4. contextCollapse (CONTEXT_COLLAPSE feature) A read-time projection over the REPL's full history. Staged collapses are committed on each entry; the model sees a collapsed view while the UI retains the full history for scrollback.
  5. autoCompact When the context approaches the blocking limit, triggers a full summarization via a forked agent. If it fires, the loop continues immediately with the post-compact messages.
Deep dive: autocompact — thresholds, circuit breakers, and task_budget

The blocking limit check happens after all compaction strategies have run. If context is still over the limit, a synthetic PROMPT_TOO_LONG_ERROR_MESSAGE is yielded and the loop exits with reason blocking_limit — the user must manually run /compact.

Reactive compact is a fallback path triggered by a real 413 from the API (prompt-too-long). The engine withholds the error message during streaming, then attempts one reactive compaction. If that fails, the error is surfaced and stop hooks are skipped (to prevent a death spiral).

The task_budget feature tracks total context tokens consumed across compact boundaries. When the server summarizes history, it would normally under-count the pre-compact spend; taskBudgetRemaining carries the correct cumulative spend across boundaries.

// task_budget carryover across compaction (query.ts ~508)
if (params.taskBudget) {
  const preCompactContext =
    finalContextTokensFromLastResponse(messagesForQuery)
  taskBudgetRemaining = Math.max(
    0,
    (taskBudgetRemaining ?? params.taskBudget.total) - preCompactContext,
  )
}

7. Stop Hooks — Post-Turn Lifecycle

After the model finishes (no tool calls, no recovery needed), the engine calls handleStopHooks() in query/stopHooks.ts. Stop hooks are external shell scripts or commands configured by the user. They run after every turn and can:

  • Produce blocking errors — injected as user messages, triggering another loop iteration
  • Prevent continuation — the engine returns { reason: 'stop_hook_prevented' }
  • Fire background tasks — prompt suggestions, memory extraction, auto-dream — all fire-and-forget
Deep dive: Stop hooks, TeammateIdle, TaskCompleted, and fire-and-forget side effects

handleStopHooks() runs three categories of hooks in order:

1. Stop Hooks (always)

Registered via settings.json hooks configuration. Run in parallel; each result is collected as a hook_success, hook_non_blocking_error, or hook_error_during_execution attachment. A blocking error is any hook exit-code failure where the hook explicitly signals it should block.

2. TaskCompleted hooks (teammate mode only)

In teammate mode (multi-agent setups), hooks fire for each in_progress task owned by this agent. These mirror stop hook semantics (can block, can prevent continuation).

3. TeammateIdle hooks (teammate mode only)

Fire when this teammate transitions to idle. Can also block or prevent continuation.

4. Fire-and-forget background tasks

Skipped in bare mode (-p flag). Fired without await in interactive mode:

  • executePromptSuggestion — generates btw... suggestions
  • executeExtractMemories — extracts facts to MEMORY.md
  • executeAutoDream — autonomous background exploration
// --bare / SIMPLE: skip background bookkeeping
// Scripted -p calls don't want auto-memory or forked agents
// contending for resources during shutdown.
if (!isBareMode()) {
  void executePromptSuggestion(stopHookContext)
  if (feature('EXTRACT_MEMORIES') && isExtractModeActive()) {
    void extractMemoriesModule!.executeExtractMemories(...)
  }
  if (!toolUseContext.agentId) {
    void executeAutoDream(...)
  }
}

8. Token Budget — The Auto-Continue Feature

query/tokenBudget.ts implements an auto-continue feature for the SDK path. When a per-turn token budget is configured, the engine checks after each clean model stop whether the model has "used up" enough of its budget. If not, it injects a nudge message and loops again.

Deep dive: BudgetTracker, thresholds, and diminishing-returns detection
query/tokenBudget.tsconst COMPLETION_THRESHOLD = 0.9   // 90% used = done
const DIMINISHING_THRESHOLD = 500  // <500 new tokens = no progress

export function checkTokenBudget(
  tracker: BudgetTracker,
  agentId: string | undefined,
  budget: number | null,
  globalTurnTokens: number,
): TokenBudgetDecision {
  if (agentId || budget === null || budget <= 0) {
    return { action: 'stop', completionEvent: null }
  }
  const pct = Math.round((globalTurnTokens / budget) * 100)
  const isDiminishing =
    tracker.continuationCount >= 3 &&
    deltaSinceLastCheck < DIMINISHING_THRESHOLD &&
    tracker.lastDeltaTokens < DIMINISHING_THRESHOLD
  // Continue if under 90% AND not diminishing
  if (!isDiminishing && turnTokens < budget * COMPLETION_THRESHOLD) {
    return { action: 'continue', nudgeMessage: ... }
  }
  return { action: 'stop', ... }
}

The decision logic has two early-stop conditions:

  • Budget exhausted: turn tokens ≥ 90% of budget → stop
  • Diminishing returns: after 3+ continuations, if both the current delta and the previous delta are under 500 tokens → stop (the model is spinning)

The nudge message is injected as an isMeta user message so it doesn't appear in the REPL transcript, and the loop continues with transition.reason = 'token_budget_continuation'.

9. Key Takeaways

One loop, many exit reasons

The while(true) in queryLoop exits via a typed Terminal value. Every possible stopping condition — completion, errors, abort, stop hooks, budget — has a named reason.

Generators all the way down

submitMessage, query, queryLoop, queryModel, withRetry, handleStopHooks — all are async function*. This lets the entire stack compose cleanly with yield* and backpressure flows naturally.

Transcript-first reliability

The user message is written to disk before the API is called. Even a process kill between send and response leaves a resumable session.

Feature-gated dead code elimination

feature('HISTORY_SNIP'), feature('TOKEN_BUDGET'), feature('CONTEXT_COLLAPSE') etc. are evaluated at bundle time by Bun, eliminating unreachable code from external builds and preventing string leakage.

Background effects are fire-and-forget

Memory extraction, prompt suggestions, auto-dream — all are void promises. They must not block the response stream and must not run in bare (-p) mode where resource contention on shutdown matters.

Retry is smarter than exponential backoff

Foreground vs background source routing, fast-mode cooldowns, OAuth refresh, persistent keep-alive, Opus→fallback after 3×529, context-overflow token recalculation — the retry layer is a small state machine, not just a sleep loop.

10. Quiz

Five questions to check your understanding. Select an answer, then hit Check.

1What is the primary reason QueryEngine.submitMessage() writes the transcript to disk before calling query()?

2After exactly 3 consecutive 529 (overloaded) errors on a non-custom Opus model, withRetry() throws which error?

3What does the transition.reason field on the State object represent?

4In the token budget feature, when does checkTokenBudget() trigger a diminishing returns early stop?

5Why does the stop-hooks logic skip running hooks when the last assistant message is an API error (e.g., rate limit or prompt-too-long)?

Claude Code Course — Lesson 04 — Query Engine & LLM API