What’s On My Mind: Improve ChatGPT Performance By Understanding How It Works

Let’s take some time to deconstruct the architecture of a large language model like InstructGPT/GPT-3. These models, which power useful tools like GoCharlie and ChatGPT, at first seem like magic to the end user. However, understanding how they work will help you be more effective in their use. In 1957, linguist John Rupert Firth said in a paper titled “A Synopsis of Linguistic Theory” the following:

“You shall know a word by the company it keeps.”

This single sentence summarizes the entirety of how large language models work. Every natural language processing model in artificial intelligence is built on this axiom, mainly because language itself is built on this axiom. We understand a word based on the context we use it in.

For example, if I talk about brewing some tea, I’m talking about a literal beverage made from the camellia plant. If i talk about spilling some tea, I’m no longer talking about the beverage; I’m talking about gossip. The word changes in relation to its meaning.

But it’s not just the words immediately adjacent to the word in question. It’s all the words in relation to each other. Every language that’s functional has some kind of word order, a structure that helps us understand words.

I’m brewing the tea.
There’s a clear subject, me. There’s a verb, to brew. And there’s an object, the tea.

The tea I’m brewing.
This word order changes the focus. It’s still intelligible, but conversationally, the focus is now on the tea instead of me.

Brewing I’m the tea.
Now we’re so out of order that in English this doesn’t make much sense - verb, subject, object. Yet this sentence would be perfectly appropriate in Arabic, Gaelic, and a few other languages.

The structure of a language is a matter of probabilities.

I’m brewing the { } could be tea, coffee, beer, or some other object, but if you widen the window of words around it, the context becomes more clear. If the immediate preceding sentence talks about a coffee shop, then probabilistically, beer is unlikely to be the next word.

What does this have to do with ChatGPT? The underlying model, InstructGPT (which itself is a sister to GPT-3), is built by taking massive amounts of text and converting that text into mathematical probabilities. If we look at the seminal paper “ Attention is All You Need ” by Ashish Vaswani et. al., this paper explains exactly how the transformer architecture - which is how ChatGPT works - operates.

First, you start with a huge amount of text.
Next, you convert every word and part of words into essentially a very large table, with the probabilities of each word appearing next to another word assigned to each cell in the table. Imagine taking a sentence and putting each word in a column in a spreadsheet. Then take the same sentence and put each word in a row in the same spreadsheet. Then calculate the number of times one word appears next to another word. Now do this over and over again for every sentence in your sample of text.

That’s the embedding part. After that, there’s a step called positional encoding. This is where word order is taken into account - the position of a word in relation to other words is given mathematical weight, so that in English, “I’m brewing the tea” has one value and “Brewing I’m the tea” has another value. Because you’ll see “I’m brewing the tea” far more times than “Brewing I’m the tea”, the former phrase and its positions will have more weight in the model, meaning that when it’s time to generate text, the probability that ChatGPT will spit out “Brewing I’m the tea” is fairly low, while “I’m brewing the tea” will be fairly high.

Here’s what makes the transformer-based models like GPT-3 and InstructGPT different and better than their predecessors: these positional encodings don’t just take into account the words immediately around a word. They take into account a LOT of text around each word - up to several thousand at a time. That’s how they know to generate “I’m brewing the tea” and not “I’m brewing the beer” in the context of whatever we’re prompting it to do.

Now that we understand the underlying technology, what do we do with this information? It should be clear that words matter in our usage of these tools. Large language models can only make probability-based guesses from what they’re provided, what we prompt them to do.

Which means that in general, the more relevant words in your prompt, the better the model will do. Let’s look at an example. I gave ChatGPT this prompt:

Complete this sentence:
“I’m brewing the
It responded:
perfect cup of coffee to start my day.”

That’s a reasonable response to a terribly vague prompt. What does it tell us about the underlying model? By its word choices, coffee is associated with brewing, in the absence of any other context, and thus it created that response.

Now, let’s add some relevant, contextual words to the prompt.

Complete this sentence:
“The tea house was packed with customers. I’m brewing the

It responded:
last pot of tea for the day, and soon we’ll have to close up shop.”

Just by giving it a relevant starter sentence with a more relevant word - tea - it narrowed the window of probability and gave more weight to tea, ending up with a sentence that matched the context I intended.

What happens if I add even more context?
Complete this sentence:
“The tea house was packed with customers. We’d been busy all day, and the golden rooibos was almost sold out. We still had plenty of yerba mate, but the oolong was starting to run low. As Karina retrieved the last box out of storage, she turned to me. I said,”I’m brewing the

It responded:
oolong now, so we can offer it while it lasts. Could you please start serving the yerba mate to the customers?” Karina nodded and quickly went to the counter, deftly pouring the hot yerba mate into gourds and adding just the right amount of sugar.

I added very specific types of tea (yes, yerba mate is a tisane, I know) and got even better results because the inclusion of tea names - a type of jargon - is highly specific and changes the probabilities of what words should come next.

The more relevant words in your prompt, the better the model will do.

Working with prompts - prompt engineering - is a lot like working with keywords in SEO. The more relevant words you use, the better the results you get. Think of it like prompt optimization instead of search engine optimization.

This is why I tell interested folks that these models are good at generation but GREAT at transformation. They rewrite like pros because they don’t have to guess what the words are, just what they should be using known probabilities.

If you want them to perform better, write out longer prompts with relevant words that help the model quickly understand the context of your ask. How long? My best performing prompts are over a page of text long. They’re highly specific, they contain a ton of detail, a fair amount of jargon when appropriate, and specific instructions which yield repeatable, reliable results.

In software development, this is requirements gathering. In creative work, this is the creative brief. In cooking, this is the recipe. You would never hand someone a two sentence recipe for baking a loaf of bread. You would never hand a creative team a two sentence brief, not if you want the result to match a vision you already have in mind.

Not coincidentally, humans work the same way, too. In general, you’ll get better results with overcommunication than insufficient communication, for both machines and humans.