The modern AI stack can be viewed as three logical layers: Compute, Model, and Agent. While thinking (reasoning) and retrieval (fetching external information) may span multiple layers, separating them helps us reason about trade‑offs such as latency, observability, and cost.
| Layer | Primary Function | Typical Primitive |
|---|---|---|
| Compute | Executes the heavy‑weight operations that power the stack. | GPU/TPU kernels for transformer forward passes; ANN‑search kernels; CPU‑based HTTP calls to external services. |
| Model | Performs core reasoning and, optionally, internal retrieval. | • Native chain‑of‑thought – the model generates step‑by‑step reasoning within a single forward pass (e.g., “Let’s think step‑by‑step…”). • Built‑in retriever – the model invokes a search tool (e.g., GPT‑4o browsing, Claude “search”, Gemini grounding) and conditions its output on the returned snippets. |
| Agent | Orchestrates complex workflows, decides when to call the model, and handles external data sources. | • Agent‑orchestrated reasoning – the agent decomposes a problem, builds prompts, may run meta‑reasoning loops, and determines when to invoke the model again. • External retrieval – the agent queries a vector store, a web‑search API, or any custom data source, then injects the retrieved passages into the next model prompt. |
Whether thinking or retrieval happens in Model vs. Agent has some implications.
| Dimension | Thinking – Model | Thinking – Agent | Retrieval – Model | Retrieval – Agent |
|---|---|---|---|---|
| Latency | One forward pass → minimal overhead (unless the model also does internal search). | Multiple orchestrated calls → higher latency, but sub‑tasks can run in parallel. | Single endpoint (e.g., POST /v1/chat/completions with built‑in tool) → low latency. | Two‑step flow (search → prompt → model) → added round‑trip time, but can parallelise search with other work. |
| Control / Policy | Model decides autonomously when to fetch external data → harder to audit or enforce policies. | Agent mediates every external call → straightforward throttling, redaction, logging, and policy enforcement. | Retrieval baked into the model → policy changes require a new model version. | Agent can enforce dynamic policies (rate limits, content filters) on each external request. |
| Resource Use | GPU must handle both inference and any ANN‑search kernels; higher compute density. | Retrieval can be off‑loaded to cheaper CPUs or dedicated search services; GPU used mainly for inference. | GPU handles only inference; no extra search kernels needed. | CPU or specialised search services handle retrieval, freeing GPU capacity for inference. |
| Observability | Reasoning is embedded in the token stream → debugging is indirect; limited visibility. | Agent logs each sub‑task, providing a clear, structured trace of why and when calls were made. | Limited visibility beyond token usage; retrieval is opaque to the caller. | Agent records search queries, responses, and any filtering applied, giving end‑to‑end traceability. |
If you are building an agentic system, you’ll need to decide which responsibilities belong to the model and which to the agent.
If you are merely a user of such a system, the distinction is mostly invisible, showing up only as differences in answer quality, latency, and cost.