Thinking and Retrieval in the AI Stack

The modern AI stack can be viewed as three logical layers: Compute, Model, and Agent. While thinking (reasoning) and retrieval (fetching external information) may span multiple layers, separating them helps us reason about trade‑offs such as latency, observability, and cost.

Layer	Primary Function	Typical Primitive
Compute	Executes the heavy‑weight operations that power the stack.	GPU/TPU kernels for transformer forward passes; ANN‑search kernels; CPU‑based HTTP calls to external services.
Model	Performs core reasoning and, optionally, internal retrieval.	• Native chain‑of‑thought – the model generates step‑by‑step reasoning within a single forward pass (e.g., “Let’s think step‑by‑step…”). • Built‑in retriever – the model invokes a search tool (e.g., GPT‑4o browsing, Claude “search”, Gemini grounding) and conditions its output on the returned snippets.
Agent	Orchestrates complex workflows, decides when to call the model, and handles external data sources.	• Agent‑orchestrated reasoning – the agent decomposes a problem, builds prompts, may run meta‑reasoning loops, and determines when to invoke the model again. • External retrieval – the agent queries a vector store, a web‑search API, or any custom data source, then injects the retrieved passages into the next model prompt.

Whether thinking or retrieval happens in Model vs. Agent has some implications.

Dimension	Thinking – Model	Thinking – Agent	Retrieval – Model	Retrieval – Agent
Latency	One forward pass → minimal overhead (unless the model also does internal search).	Multiple orchestrated calls → higher latency, but sub‑tasks can run in parallel.	Single endpoint (e.g., `POST /v1/chat/completions` with built‑in tool) → low latency.	Two‑step flow (search → prompt → model) → added round‑trip time, but can parallelise search with other work.
Control / Policy	Model decides autonomously when to fetch external data → harder to audit or enforce policies.	Agent mediates every external call → straightforward throttling, redaction, logging, and policy enforcement.	Retrieval baked into the model → policy changes require a new model version.	Agent can enforce dynamic policies (rate limits, content filters) on each external request.
Resource Use	GPU must handle both inference and any ANN‑search kernels; higher compute density.	Retrieval can be off‑loaded to cheaper CPUs or dedicated search services; GPU used mainly for inference.	GPU handles only inference; no extra search kernels needed.	CPU or specialised search services handle retrieval, freeing GPU capacity for inference.
Observability	Reasoning is embedded in the token stream → debugging is indirect; limited visibility.	Agent logs each sub‑task, providing a clear, structured trace of why and when calls were made.	Limited visibility beyond token usage; retrieval is opaque to the caller.	Agent records search queries, responses, and any filtering applied, giving end‑to‑end traceability.

If you are building an agentic system, you’ll need to decide which responsibilities belong to the model and which to the agent.

If you are merely a user of such a system, the distinction is mostly invisible, showing up only as differences in answer quality, latency, and cost.

Leave a comment Cancel reply