On The Surprising Power of Self-reflection in Large Language Models

In which we ponder the ability of LLMs to correct themselves when told their mistakes and how this forces to change how we approach them as tools.

Jan 27, 2024

I use LLMs for coding almost every day now and I feel like I’ve started to understand their strengths and weaknesses. One thing that I don’t do is any prompt engineering: I just ask the LLM to write me some code to accomplish a specific task in the least amount of words I can use to describe it. I do it like I would if I were communicating to a human assistant that was there at my command and had read the entire internet, (even didn’t understand much of it and had no idea of its own poor understanding).

With this mental model, it is often necessary to steer the conversation toward what I need because the first interaction is often not good enough to get me what I needed. This usage style comes, admittedly, out of laziness on my part but it has forced me into a posture of dialog with the LLM which has revealed surprising advantages.

Specifically, it seems that LLMs generally benefit from being told they were wrong about something. They will generate some code but most often than not it doesn’t work out of the box. I noticed that if I feed back criticism or information on why this code doesn’t do the job (or even things like compiler errors, or runtime stack traces), the LLM is able improve its output.

prompt: “a cartoony large language model self reflecting on its mistakes”

This process gets called “self-reflection” these days although I really don’t think any “reflection” is happening: I believe what’s happening is that we are augmenting the context with more information and this appears to be nudging the LLM into a separate output valley. We could call it “scrappy interactive RAG”. The surprising part is that this seems to be working better than any prompt engineering that I can do ahead of time.

The other day, for example, I wanted to see if GPT4 was able to emerge the schema from a JSON object that represents the player state in our game. This is data that has been growing rather organically over time so writing a schema by hand to bring some post-hoc validation is a tedious and repetitive task. LLM seem made specifically for tedious and repetitive tasks involving big sequences of text, so I pasted a big chunk of this state into GPT4 (no private information was included!) and asked it to make me a JSONSchema that would validate it.

It did what it was asked but I didn’t trust it: I wanted to verify it that it worked.

So I asked it to generate some Python code that would allow me to pass the JSONSchema and the JSON objects and have it validated. The script worked out of the box but the schema did not: the data wasn’t valid according to the schema.

At this point, I feel that a lot of people using LLMs in “oracle” mode would have assumed that GPT4 wasn’t smart enough for this kind of task, give up and move on. Instead, armed with past experience and personal curiosity of whether “self-reflection” could nudge it into a more useful output, I asked it to modify the Python script so that it would output the parts of the JSON object that failed the validation (with the intent to then feed this information back to it).

The data failed validation and I fed that the fragment that didn’t work back to it. I asked it to modify the schema to make sure that this error would go away.

I had to do it a few more times, but each time the schema was modified and the error was different. We were making progress.

In the end, in less than 5 minutes, not only I had a fully working schema, but I had a recipe that I could use to come up with more schemas from semi-structured data once the need emerges.

There are many different things that feel significant in this story:

GPT4 (the current state of the art) wasn’t able to perform the task at hand (creating a schema out of semi-structured data) on its first attempt.
I know I could do it myself but in practice I wasn’t going to because it felt like a slog and it feels like way too much work for something that was experimental and we had to scale to potentially tens of other object formats.
The combination (LLM + me) was able to achieve an excellent result in less than 5 minutes.
And it gave me the confidence to believe this approach could scale, semi-linearly, to as many other similar tasks I would face.

This kind of interaction might be able to be automated in code with LLM orchestration (say, with Langchain, Breadboard or even possibly Autogen) although I didn’t try because we went a different way ultimately to deal with potential inconsistencies in this data.

Overall, I can’t shake the feeling that (LLM + me) is a far better software engineer than (me) alone. OpenAI is pushing hard on these personalized GPTs (with specialized prompts and minimal RAG augmentation) but they don’t feel very appealing to me right now because interaction and self-reflection supply all “context augmentation” that I feel I need. If that’s their attempt to build a moat, it’s not resonating much with me.

Things might change as the task changes from exploration to exploitation and more consistent behavior is needed out of these stochastic machines but for now I’m very happy chatting back and forth with these LLMs to get my work done even if it seems stupid and wasteful from afar. They seem to expand my reach and expand the kinds of tasks that I feel I can attempt to complete given the amount of skill, time and energy I have at my disposal and it is no small thing.

Stefano’s Linotype

On The Surprising Power of Self-reflection in Large Language Models

In which we ponder the ability of LLMs to correct themselves when told their mistakes and how this forces to change how we approach them as tools.