Mind & AI – ://ewernli

Thinking and Retrieval in the AI Stack

The modern AI stack can be viewed as three logical layers: Compute, Model, and Agent. While thinking (reasoning) and retrieval (fetching external information) may span multiple layers, separating them helps us reason about trade‑offs such as latency, observability, and cost.

Layer	Primary Function	Typical Primitive
Compute	Executes the heavy‑weight operations that power the stack.	GPU/TPU kernels for transformer forward passes; ANN‑search kernels; CPU‑based HTTP calls to external services.
Model	Performs core reasoning and, optionally, internal retrieval.	• Native chain‑of‑thought – the model generates step‑by‑step reasoning within a single forward pass (e.g., “Let’s think step‑by‑step…”). • Built‑in retriever – the model invokes a search tool (e.g., GPT‑4o browsing, Claude “search”, Gemini grounding) and conditions its output on the returned snippets.
Agent	Orchestrates complex workflows, decides when to call the model, and handles external data sources.	• Agent‑orchestrated reasoning – the agent decomposes a problem, builds prompts, may run meta‑reasoning loops, and determines when to invoke the model again. • External retrieval – the agent queries a vector store, a web‑search API, or any custom data source, then injects the retrieved passages into the next model prompt.

Whether thinking or retrieval happens in Model vs. Agent has some implications.

Dimension	Thinking – Model	Thinking – Agent	Retrieval – Model	Retrieval – Agent
Latency	One forward pass → minimal overhead (unless the model also does internal search).	Multiple orchestrated calls → higher latency, but sub‑tasks can run in parallel.	Single endpoint (e.g., `POST /v1/chat/completions` with built‑in tool) → low latency.	Two‑step flow (search → prompt → model) → added round‑trip time, but can parallelise search with other work.
Control / Policy	Model decides autonomously when to fetch external data → harder to audit or enforce policies.	Agent mediates every external call → straightforward throttling, redaction, logging, and policy enforcement.	Retrieval baked into the model → policy changes require a new model version.	Agent can enforce dynamic policies (rate limits, content filters) on each external request.
Resource Use	GPU must handle both inference and any ANN‑search kernels; higher compute density.	Retrieval can be off‑loaded to cheaper CPUs or dedicated search services; GPU used mainly for inference.	GPU handles only inference; no extra search kernels needed.	CPU or specialised search services handle retrieval, freeing GPU capacity for inference.
Observability	Reasoning is embedded in the token stream → debugging is indirect; limited visibility.	Agent logs each sub‑task, providing a clear, structured trace of why and when calls were made.	Limited visibility beyond token usage; retrieval is opaque to the caller.	Agent records search queries, responses, and any filtering applied, giving end‑to‑end traceability.

If you are building an agentic system, you’ll need to decide which responsibilities belong to the model and which to the agent.

If you are merely a user of such a system, the distinction is mostly invisible, showing up only as differences in answer quality, latency, and cost.

Mentoring AI

Working with GitHub Copilot to develop software, I was struck by how surprisingly human AI can feel.

When you give Copilot a task, it does not produce a perfect answer in one step. It makes a plan, follows it, checks its own work, notices mistakes, and tries again. This loop of planning, acting, evaluating, and adjusting is the same way humans work.¹

For a long time we imagined AI as something that would make no mistakes. Early hallucinations challenged that idea. But with agent-style workflows, the problem becomes manageable in the same way it is with humans. We create checks for correctness, break big problems into smaller pieces, and work around limited memory or context.

We also like to think that humans reason from first principles. In reality, we mostly reuse ideas we have already heard. AI works in a similar way.

The main differences are speed, endurance, and focus. AI does not get tired or distracted.

Working with AI agents also feels similar to delegating work to coworkers. First you make sure you both understand the task. Then you set guardrails so things do not go in the wrong direction. You do not want to micromanage, but you also do not want to discover too late that everything has drifted off course. If you have ever delegated work to a junior colleague, you already have an advantage when working with AI.

In fact, working with AI is teaching techies a new skill: mentoring. What was once a soft skill is now a hard skill.

The unsettling part will come when AI is no longer the junior partner. When Copilot starts taking real initiative and becomes your mentor, what will that look like?

More

https://www.oneusefulthing.org/p/three-years-from-gpt-3-to-gemini?

In Promoting AI Agents, DHH referst to “supervised collaboration” – I’m fine with this wording too

https://www.oneusefulthing.org/p/management-as-ai-superpower

Note that it’s not that surprising. The agent mode was designed like this by human. The loop isn’t an emerging property of LLM. ↩︎

The Great AI Buildout

The ongoing AI buildout has similarities with the railroad expansion of the 20th century. Both are capital intensive undertakings with the potential to reshape the entire economy. Just as railroads transformed how we navigate physical space, AI is poised to transform how we navigate the information space.¹ It’s obvious that railroads were useful and AI is no different.

During the railway boom, railroads proliferated amid intense competition. Overcapacity was common, some companies went bankrupt, and the industry took years to consolidate. Eventually, railroads became commoditized.

The same dynamics may play out with AI. Semiconductors and datacenters are the tracks and rolling stock. AI applications are the railway companies operating the lines. The coming years will reveal which segments of the AI ecosystem are truly profitable.

At the peak of the railroad era, rail companies accounted for roughly 60 percent of market capitalization. Today, AI makes up about 30 percent of the stock market. Such valuations are only justifiable if AI adoption becomes widespread. For semiconductors and datacenters, this means continuing infrastructure buildout. For AI applications, this means acquiring enough users to finance that growth.

The investment in AI is enormous—around $220 billion per year. But it does not need to replace all labor to be justified. Global labor is about $60 trillion per year, and information work accounts for roughly 10–20 percent of that. By this math, AI only needs to replace 1.8 to 3.7 percent of information work per year to pay off the investment.

At the individual level, that is about one or two days of work saved per information worker per year. With AI agents, improving information work—searching, aggregating, writing, and generating information—is already within reach. This means the current investment is economically justified even if AI only captures a small portion of information work.

More

https://www.ubs.com/global/en/wealthmanagement/insights/chief-investment-office/250-years-of-us-innovation.html#railroad

https://unchartedterritories.tomaspueyo.com/p/is-there-an-ai-bubble

https://www.allianzgi.com/en/insights/outlook-and-commentary/is-ai-the-new-railroad

The metaphor is not as stretched as it seems. Large language models literally encode information in multi-dimensional vector spaces, computing distances between vectors to find similarities. ↩︎

Working with AI – Second Experiment

My first “experiment” is now 1.5 years old, which feels like a lifetime in AI.

To get a sense of where we are with Copilot today, I decided to revisit that project. The goal was to upgrade the one-page web app into a proper Angular application using TypeScript.

I also took on the “Copilot challenge“: use only Copilot – No manual edits.

Here’s what I learned this time:

Refactoring with AI works
I was able to split code into Angular components, convert interactions to RxJs, move logic around, add proper typing, convert promises to async functions, extract services, remove dead code, and clean up naming. Copilot handled all of this well.

It handles technical code reliably
Parsing data, building user interfaces, caching with local storage, customizing chart behavior, adding compression: Copilot handles this also very well. It got confused with API behavior across chart library versions, but a real developer might too as well. It made UI iteration fast and smooth.

Agent mode is far better than edit mode
Edit mode often produced code that wouldn’t compile or had broken imports. Agent mode fixed those issues automatically. Not having to think about the context that much is also a relief. Using only edit mode first helped me see how much better agent mode is for real-world use.

Feels like working with a junior developer
Copilot gets things done, but may take shortcuts. Sometimes it ties logic too closely to rendering or makes structural choices that aren’t ideal. It helps, but you still need to guide it.

The code quality is generally good
Its output is usually clean, readable, and idiomatic. Not always how I’d write it, but solid. It consistently handles edge cases like null values. Over time, though, consistency degrades. One feature might follow one style, another a different one. You still need to set and enforce coding standards. Comments appear inconsistently, sometimes helpful, sometimes missing.

Mixed experience with CSS styling
Copilot is good at suggesting layout ideas, but maintaining a consistent visual style across the app was difficult.

It can write basic business logic
If your specification is clear and specific, it can generate useful logic. But for the code to match your expectation, you need to put the effort to give a precise specification. I didn’t investigate the generation of the test accordingly, this is for sure a subject to investigate further.

Temporal data types remains a weak spot
Handling dates was frustrating. I started by converting strings to date objects, thinking it would be more robust. But JavaScript’s Date isn’t well suited for this. Copilot didn’t flag the issue. Only later, when I asked directly, did it suggest sticking with strings. It often confused timestamps and date objects.

Data manipulation is a strong point
Tasks like changing the structure of JSON files or merging them worked well. No major issues here.

Copilot enables much faster iteration, significantly lowering the cost of programming and shifting the work towards software design only. The improved reliability of the agent mode compared to the edit mode provides a major cognitive relief. Iterating through chat, or even “negotiating” a solution before asking Copilot to implement it, feels fundamentally different from classic development.

Programming is an activity that tax your short-term memory. Usually, if I have less than half an hour, I won’t engage in programming. It’s usually too short to switch context and produce some working code. Something interesting happened during this second experiment: even if I had on 15-20 min, I could quickly try out a new idea with copilot.

There’s no question that it’s a more productive way to work.

Stop Writing Code: The Copilot Challenge

What if the best way to master Copilot… was to stop writing code yourself?

Seriously. For one week, try this: don’t type a single line of code manually. Not a function, not a fix, not even a variable rename. Use only Copilot’s edit panel or inline chat. Prompt it. Guide it. Iterate with it until it gives you exactly what you want.

At first, it’ll feel terrible. You’ll see a bug and want to just fix it. You’ll want to write the obvious helper function. You’ll feel slower. But that discomfort is exactly what makes this powerful. It forces you to get better at communicating with Copilot. You have to break down problems into clear steps. You have to give proper context. You have to think like a teacher, not just a doer.

And without even realizing it, your coding style starts to shift. You begin naming things more clearly so you can refer to them easily in prompts. You start structuring logic in ways that are easier to explain and reuse. Just like writing testable code improves architecture, writing code that’s “promptable” improves clarity and design. You’re not just optimizing for the machine — you’re learning to write code that explains itself.

You can still use inline chat to nudge Copilot: “Add a loop that groups items by category.” You can even dictate code word-for-word if you really want to. But that’s tedious, and you’ll quickly realize it’s easier to just get better at prompting.

This constraint becomes a force function. You stop relying on your muscle memory and start building Copilot fluency. You learn when to rephrase, when to simplify, when to break things down smaller. And as you do, you become dramatically more productive — not just faster, but more thoughtful.

Now imagine turning this into a game. A leaderboard that says “It’s been 6 days since I last wrote a line of code myself.” Friendly team competition. Badges like “100% Copilot PR” or “Zero Typing Tuesday.” A culture shift that makes learning fun — and a little bit addictive.

This isn’t about replacing developers. It’s about training ourselves to think differently. To move from writing code to shaping it. To go from typing to directing. And in the process, to write code that’s not just correct — but clear, composable, and ready for collaboration with humans and machines alike.

So give it a shot. No code writing. Just Copilot. See how far you can go — and what you learn along the way.

Climbing the Abstraction Ladder with LLMs

Software engineering has always been about raising the level of abstraction.

We started with assembly language, then moved to procedural programming (Pascal), followed by structured programming (C). Object-oriented programming (C++) introduced encapsulation, while managed memory (Smalltalk) improved reliability. Later, portability became a focus with language runtimes like Java and C#.

This progression has been captured through the concept of programming language generations—1st, 2nd, 3rd, 4th, and 5th generations. The 4th generation was envisioned as a major shift: instead of manually managing implementation details like data structures and persistence, developers would write high-level specifications, and the language or environment would handle the rest. The 5th generation would rely on automated problem-solving. However, despite various attempts, mainstream programming remains at the 3rd generation, and 4th- or 5th-generation programming has yet to materialize—until now.

Large language models (LLMs) might be the first real step toward achieving this vision. For the first time, we can provide human-level specifications and let an AI generate working code. LLMs attempt to resolve ambiguities and infer intent—something they do surprisingly well. Interestingly, generating code that strictly adheres to a precise specification might turn to become the challenge.

Right now, with tools like Copilot, we use LLMs “in the small” to generate local sections of code. But what happens when we use them “in the large”—to generate most of a system? We may soon be able to integrate architecture documentation, business specifications, and UX mockups to produce functional software, but how such documentation should be structured remains an open question. As a colleague once put it, we will need an entirely new “theory of software documentation” for the age of LLMs.

This shift also echoes some of the ideas behind literate programming, introduced by Donald Knuth, which aimed to make code more readable by structuring it like natural language text. Throughout the history of software engineering, the primary medium for expressing and evolving programs has been text. LLMs, trained on vast amounts of textual data, take this evolution to its next logical step, blurring the lines between documentation and implementation.

Between small-scale and large-scale code generation, one particularly interesting application of LLMs is refactoring. Automating local modifications across large codebases has traditionally been handled by tools like OpenRewrite. However, writing precise transformation rules is tedious. Given LLMs’ ability to extrapolate from examples, they seem promising in the area.

With large-scale code generation, it’s still unclear how much of our code will be AI-generated in the future, and how much will require manual intervention. Also, it’s unclear how we will we manage and distinguish between the two. Traditional methods rely on inheritance or annotations to link generated and manually written code. LLMs, however, are not bound by these constraints. They can generate, rework, or extend code seamlessly. This isn’t just a technical problem but a methodological one.

The foundations—both technical and methodological—for engineering software with LLMs are still being developed. The best way forward is through experimentation and collective learning. The software industry is already embarking on this journey, and it’s exciting to be part of such a profound shift.

Understanding ChatGPT

ChatGPT has surprised everyone. We now have systems that produce human-like texts. Without much fanfare, ChatGPT actually passed the Turing test.

While we don’t fully comprehend how and why large language models work so well, they do. Even if we don’t fully understand them, it’s worth building an intuition about how they function. This helps avoid misunderstandings about their capabilities.

In essence, ChatGPT is a system that learns from millions of texts to predict sentences. If you start a sentence, it tries to predict the next word. Taking the sentence augmented with one word, it tries again to predict the next word. This way, word after word, it can complete sentences or write whole paragraphs.

Interestingly, the best results are achieved when introducing randomness in the process. Instead of always selecting the most probable word each time, it’s best to sometimes pick alternatives with lower probabilities. It makes the sentences more interesting and less redundant.

What’s also kind of amazing is that this approach works to answer questions. If you start with a question, it tries to predict a reasonable answer.

Thinking of ChatGPT as a text predictor is useful, but it’s even more useful to think of it as a form of compression. When neural networks learn from examples, they try to identify regularities and extract recurring features. The learning is lossy: the neural network doesn’t remember the examples it was fed with exactly. But it remembers the key features of them. When ChatGPT generates text, it “interpolates” between the features.

Impressive examples of “interpolation” are prompts that mandate an answer “in the style of,” for instance, “in the style of a poem.” ChatGPT not only gives a coherent answer content-wise but also applies a given style.

But ChatGPT is, in essence, interpolating all the time. It’s like a clever student who didn’t study a topic for the course but has access to the course material during the exam. The student may copy-paste elements of the answer and tweak the text to sound plausible, without having any real understanding of the matter.

What ChatGPT shows us is that you can go very far without a true understanding of anything. And I believe that this applies to how we behave too. On many topics, we can discuss based on facts we heard, without actually understanding the underlying topic. Many people who sound smart are very good at regurgitating things they learned somewhere. They wouldn’t necessarily be particularly good at reasoning on a topic from first principles. To a certain degree, we conflate memory with intelligence.

At the same time, ChatGPT can do some reasoning, at least some simple one. It probably has extracted some feature that captures some logic. It works for simple things like basic arithmetic. But it fails when things become more complicated.

Fundamentally, when predicting the next word, ChatGPT is doing one pass of the neural network, which is actually one function. A pass of the neural network cannot do a proper computation that would involve, for instance, a loop. It fails the prompt “Starting with 50, double the number 4 times. Do not write intermediate text, only the end result.”, giving 400 back. But asked to write the intermediate steps, it computes correctly 800. You can help ChatGPT into multi-step computation by asking him to write the intermediate steps because then it will go through the neural net several times. This pattern is known as “chain of thought prompting.”

We don’t fully understand ChatGPT yet—how it works and what it really can do. But clearly, it can do more than we expected, and it will bring all kinds of exciting insights about cognition.

References:

What Is It Like to Be a Robot?

In “Metazoa“, Peter Godfrey-Smith explores the rise of consciousness in animals – from simple multicellular organisms to invertebrates like us.

Consciousness is concept that’s not so easy to capture. It’s about a sense of self, about a perception of the environment and oneself, about a subjective experience of the world. When does an animal qualify as conscious? Godfrey-Smith postulates that consciousness is a spectrum, not something one has or doesn’t. The analogy he uses for this is sleeping, or the state right after waking up. We are conscious, but with a different level of consciousness as when fully awake.

The nature of consciousness can be explored by taking extreme positions:

can you be conscious without any perception of the environment (a “pure mind”)?
does reacting to what happens around you without any emotion qualify a conscious?
do you need to have a nervous system and feel pain to be conscious, or is having a mood enough?
could you be conscious, but act indistinguishably from as an unconscious animal?

I would have described consciousness as being aware of one’s own existence, something related to mortality, and rather binary. Godfrey-Smith equates consciousness more to having a sense of self and feelings, which makes it something less demarcated. He’s using consciousness more like “awareness“, whereas I would use it more like “self-awareness“. (That said, even self-awareness isn’t maybe so binary. Between being aware of deadly dangers and being aware of your own existence, it’s hard to say when we transition from instinct to consciousness.)

The book focuses on the relationship between senses and consciousness. Godfrey explains in the book how various animals sense the world and which kind of consciousness they might have. Some animals have antennas (Shrimps), some have tentacles (Octopus), some feel water pressure (fish). Many animals have vision, but the eye structure can differ. Some animals feel pain (mammals, fishes, molluscs) , but some don’t (insects) – it’s however not so clear to define when pain is felt or not. Not feeling pain doesn’t mean the animal is unaware of body damage, just like you don’t feel pain for you car but notice very well when something is broken when driving.

The book reminded me of “What it’s like to be robot?” from Rodney Brooks. This article, unsurprisingly, references the previous book from Godfrey-Smith “Other Minds”. The article from Rodney makes parallels between the perception of octopus and artificial intelligence systems. Many of the questions raised by Godfrey-Smith about the animal world can indeed be translated directly to the digital world. Computer systems have sensors, too. The have rules to react to inputs and produce outputs. They can learn and remember things, and develop an individual “subjective” perception of the world. They don’t “feel” pain, but can be aware of malfunctions in their own system. Does this qualify as a very limited form of consciousness?

The book touches at the end on the question of artificial intelligence, but very superficially. Rather than wondering whether an artificial intelligence could be conscious, he focuses on refuting the possibility of human-like artificial intelligence. His argument is basically that neural networks do model only a subset of the brain’s physical and chemical processes and can’t thus match human intelligence (there are other physical and chemical processes at play in the brain besides synapse firing). He also argues that an emulation of these processes still wouldn’t cut it, since it wouldn’t be the real stuff.

Artificial intelligence will not have a human-like intelligence, though. Each system (biological or digital) has its own form of intelligence. Because of his anthropomorphism of artificial intelligence, Godfrey-Smith doesn’t explore the alley of consciousness in AI systems much deeper. This is unfortunate, because with his consciousness-as-spectrum approach, it would have been an interesting discussion.

More

Wording Matters: Principles vs Practices

It struck me when reading Scaling the Practice of Architecture that people often use the term “principle” in a sloppy way:

There is a great deal I could write here about bad architectural principles but I’ll stick to the key aspects. Firstly, they are not practices. Practices are how you go about something, such as following TDD, or Trunk Based Delivery, or Pair Programming. This is not to say that practices are bad […] they’re just not architectural principles.

I’ve probable been using the term in a wrong way more than once. Principles don’t tell you exactly how to do something. They are just criterions to evaluate decisions. All things being equal, take the decision that fulfills the principle the most. Examples of well-known design principles are for instance

Single-responsibility principle
Keep it simple, stupid
Composition over inheritance

A practice, on the other hand, is a way of doing something. Examples of practices are:

Pair Programming
Shift left with CI/CD
Limit Work in Progress (WIP)

A lot of documents confuse the two. For instance, the SAFe Lean-Agile principle are actually mostly practices.

It could look like principles are for software design and practices are for software delivery. But you can have principles for software delivery, too. For instance, “maximize autonomy” could be a delivery principle. It doesn’t tell you how. It just tell you that if you have two options to design the organization, you should go the the one that maximizes autonomy. On the other hand, a software design practice could be to “model visually”.

Another confusion in this area come with another term similar to principles and practices: values. A value is a judgment of what we consider important. Usually they define behaviors and are then adjective (but “profit” could be a value yet isn’t an adjective). “Autonomy”, could be for instance a value. A value embodies implicitly the principle of favoring this value over others. For instance, if you value “autonomy”, you will automatically follow the principle “maximize autonomy”. If you adhere to a value, the corresponding principle comes for free.

Finally, there are “conventions” and “guideline”. Conventions tell you how to do things exactly and are mandatory. You can check if you adhere to a convention or not. This is unlike principles or practices, which have room for interpretation. A guideline is like a convention, but optional. Examples of convention or guidelines are:

Interfaces are versioned
Sanitize all inputs
Limit WIP to 3

Using a full example of value/principle/practice/guideline with in one area, we could have

value: resilience
principle: tolerate failures
practice: chaos testing
guideline: use tolerant reader

Granted, no matter how we try to distinguish the terms from one other, there will be some overlap in some cases. Natural language is messy. But I think it’s worth using the terms in the most appropriate ways if possible. It helps create a mental model that works. If you mix practices, principles, value and guidelines together, people might not notice immediately, but it creates a cognitive friction that makes it harder to actually apply underlying ideas.

The Superpower of Framing Problems

Some problem we work on a concrete. They have a clear scope and you know what has to be solved exactly. Sometimes, problems we need to address are however muddy, or unclear.

When something used to work, but doesn’t work any more, the problem is clearly framed: the thing is broken and must be repaired. However, if you have someting like a “software quality problem”, the problem isn’t clearly framed. Quality takes many form. It’s unclear what you have to solve.

To explore solutions you need first to frame the problem in a meaningful way. With this frame in place, you can explore the solution space and check how well the various solutions solve the problem. Without a proper frame, you might not even be able to identify when you have solved your problem, because the problem is defined in such a muddy way.

The “quality problem” mentionned previsouly could be reframed more precisely for instance as a problem or reliability, usability, or performance. It could be framed in terms of the number of tickets open per release, or about the time it takes to resolve tickets.

Depending on how you frame your problem, you will find different solutions. Using the wrong frame limits the solution space, or in the worst case, means you will solve the wrong problem. It’s worth investing the time to understand the problem and frame it correctly.

If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and five minutes thinking about solutions.– Albert Einstein

I’ve talked up to now about framing problems. Framing does however work even in a broader sense and can be used each time there is a challenge or an open question. Each time you should come up with a solution, there is some framing going on.

Something interesting about framing is, that in itself, it isn’t about proposing a solution. It’s about framing the solution space. As such, people are usually quite open to reframing problems or explore with new frames. Whereas if you propose solutions, you can except heated discussions, when it’s only about framing, usually the friction with other people is pretty low. While framing in itself is not a solution, it does however impact the solution that you will find. When people don’t agree on some solution, usually, people have different implicit frames for the problem. Working on understanding the frames is sometimes more productive than debating the solutions themselves.

A second thing interesting about framing is that you don’t need to be an expert in the solution to help framing problems. You need to be a an expert in the solution space, but not the actual solution. Going back the the example of “software quality problem”, you can help with framing if you know about software delivery in general. You don’t need to be a cloud expert or or process expert. This means that good framing skills are more transferable than skills about specific solutions.

I wrote long time ago about using breadth & depth to assess whether a thesis we good. In essence, this is a specific frame for the problem of thesis quality. Finding good frames for problems helps in many other cases. Framing problems is a great skill to learn.