How does ChatGPT work? | Zapier

How does ChatGPT work? | Zapier

ChatGPT has been a household name for less than two years, so it’s no surprise that lots of people are still unsure how it works. If I didn’t have to write about it so much, I probably would be too. So let’s dial things back: to understand how ChatGPT works, we need to start by talking about the underlying language engine that powers it.

The GPT in ChatGPT is mostly three related algorithms: GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o. The GPT bit stands for Generative Pre-trained Transformer, and the number is just the version of the algorithm. GPT-4o is a little different because it’s multimodal, which means it can work with text, images, and audio—but we’ll get to that a little later.

The GPT models were developed by OpenAI (the company behind ChatGPT and the image generator DALL·E 3), but they power everything from Bing’s AI features to writing tools like Jasper and Copy.ai. In fact, many of the AI text generators available at the moment use GPT models, as well as similar models from other companies—though they tend to keep quiet when they use each one.

ChatGPT brought GPT into the limelight because it made the process of interacting with an AI text generator simple and—most importantly—free to everyone. Plus, it’s a chatbot, and people have loved a good chatbot since SmarterChild.

While GPT is the most prominent large language model (LLM) right now, there are now plenty of others. Google has its Gemini models and chatbot; Anthropic has Claude; Meta has Llama 3, which powers its Meta AI chatbot. And that’s before you dive into the models aimed at large companies, like Writer’s Palmyra LLMs, or open models like Mixtral 8x22B. Still, at least for now, OpenAI’s offerings are still the most powerful widely available and the de facto industry standard.

So the answer to “how does ChatGPT work?” is basically: GPT-3.5, GPT-4, and GPT-4o. But let’s dig a little deeper.

Table of contents:

What is ChatGPT?

ChatGPT is an app built by OpenAI. Using the GPT AI models, it can answer your questions, write copy, generate images, draft emails, hold a conversation, brainstorm ideas, explain code in different programming languages, translate natural language to code, and more—or at least try to—all based on the natural language prompts you feed it. It’s a chatbot, but a really, really good one.

The latest version of ChatGPT is also multimodal, at least if you use the GPT-4o model. In addition to text prompts, it can respond to images and audio. This opens up a wide range of real-world uses, like translating a conversation in real time or helping you identify a restaurant dish from a photo. 

Since it launched at the end of 2022, ChatGPT has gotten a lot more powerful and useful. It can search the web to find answers to your prompts, interact with other apps through custom GPTs (what OpenAI calls its extension framework), and create images using the DALL·E 3 image model.

Of course, ChatGPT is also a way for OpenAI to get a lot of real-world data on how its models perform from actual users and serves as a fancy demo for the power of GPT, which could otherwise feel a little fuzzy unless you were deep into machine learning.

Right now, ChatGPT offers three GPT models. GPT-3.5 is less powerful but available to everyone for free. The more advanced GPT-4 is limited to ChatGPT Plus subscribers, and they only get a limited number of questions every day. GPT-4o is available to everyone, though ChatGPT Plus subscribers get five times as many prompts per day.

One of ChatGPT’s biggest features is that it can remember all the context from the conversation you’re having with it. If you tell it something in your initial prompt, it can recall it much later in the conversation. You’re also able to ask it to rework things and correct any mistakes. It makes interacting with the AI feel like a genuine back-and-forth.

If you want to really get a feel for it, go and spend five minutes playing with ChatGPT now (it’s free!), and then come back to read about how it works. 

How does ChatGPT work?

This humongous dataset was used to form a deep learning neural network […] modeled after the human brain—which allowed ChatGPT to learn patterns and relationships in the text data […] predicting what text should come next in any given sentence. 

ChatGPT works by attempting to understand your prompt and then spitting out strings of words that it predicts will best answer your question, based on the data it was trained on. While that might sound relatively simple, it belies the complexity of what’s going on under the hood. 

Supervised vs. unsupervised learning

Let’s actually talk about that training. The P in GPT stands for “pre-trained,” and it’s a super important part of why GPT is able to do what it can do. 

Before GPT, the best performing AI models used “supervised learning” to develop their underlying algorithms. They were trained with manually-labeled data, like a database with photos of different animals paired with a text description of each animal written by humans. These kinds of training data, while effective in some circumstances, are incredibly expensive to produce. Even now, there just isn’t that much data suitably labeled and categorized to be used to train LLMs.

Instead, GPT employed generative pre-training, where it was given a few ground rules and then fed vast amounts of unlabeled data—near enough the entire open internet. It was then left “unsupervised” to crunch through all this data and develop its own understanding of the rules and relationships that govern text. 

GPT-4o was seemingly trained in the same way, though in addition to text, its training data also included images and audio. This way, it could learn not only what an apple was, but what one looks like too.

Of course, you don’t really know what you’re going to get when you use unsupervised learning, so GPT is also “fine-tuned” to make its behavior more predictable and appropriate. There are a few ways this is done (which I’ll get to), but it often uses forms of supervised learning. 

Transformer architecture

All this training is intended to create a deep learning neural network—a complex, many-layered, weighted algorithm modeled after the human brain—which allowed ChatGPT to learn patterns and relationships in the text data and tap into the ability to create human-like responses by predicting what text should come next in any given sentence. 

This network uses something called transformer architecture (the T in GPT) and was proposed in a research paper back in 2017. It’s absolutely essential to the current boom in AI models. 

While it sounds—and is—complicated when you explain it, the transformer model fundamentally simplified how AI algorithms were designed. It allows for the computations to be parallelized (or done at the same time), which means significantly reduced training times. Not only did it make AI models better, but it made them quicker and cheaper to produce.

At the core of transformers is a process called “self-attention.” Older recurrent neural networks (RNNs) read text from left-to-right. This is fine when related words and concepts are beside each other, but it makes things complicated when they’re at opposite ends of the sentence. (It’s also a slow way to compute things as it has to be done sequentially.)

Transformers, however, read every word in a sentence at once and compare each word to all the others. This allows them to direct their “attention” to the most relevant words, no matter where they are in the sentence. And it can be done in parallel on modern computing hardware. 

Of course, this is all vastly simplifying things. Transformers don’t work with words: they work with “tokens,” which are chunks of text or an image encoded as a vector (a number with position and direction). The closer two token-vectors are in space, the more related they are. Similarly, attention is encoded as a vector, which allows transformer-based neural networks to remember important information from earlier in a paragraph. 

And that’s before we even get into the underlying math of how this works. While it’s beyond the scope of this article to get into it, Machine Learning Mastery has a few explainers that dive into the technical side of things.

Tokens

How text is understood by AI models is also important, so let’s look a little deeper at tokens. GPT-3 was trained on roughly 500 billion tokens, which allows its language models to more easily assign meaning and predict plausible follow-on text by mapping them in vector-space. Many words map to single tokens, though longer or more complex words often break down into multiple tokens. On average, tokens are roughly four characters long. OpenAI has stayed quiet about the inner workings of GPT-4 and GPT-4o, but we can safely assume it was trained on much the same dataset since it’s even more powerful.

Block of text broken down into GPT-3 tokens and characters.

All the text tokens came from a massive corpus of data written by humans, at least for GPT-3. That includes books, articles, and other documents across all different topics, styles, and genres—and an unbelievable amount of content scraped from the open internet. Basically, it was allowed to crunch through the sum total of human knowledge to develop the network it uses to generate text.

Now, researchers are running out of human-created training data, so GPT-4 and later models may also be trained on synthetic—or AI-created—training data.

Based on all that training, GPT-3’s neural network has 175 billion parameters or variables that allow it to take an input—your prompt—and then, based on the values and weightings it gives to the different parameters (and a small amount of randomness), output whatever it thinks best matches your request. OpenAI hasn’t said how many parameters GPT-4 has, but it’s a safe guess that it’s more than 175 billion and less than the once-rumored 100 trillion parameters. Regardless of the exact number, more parameters doesn’t automatically mean better. Some of GPT-4’s increased power probably comes from having more parameters than GPT-3, but a lot is probably down to improvements in how it was trained.

GPT-4o is even harder to draw conclusions about. In addition to text, it was trained on images and audio—which can also be broken down into discrete tokens—so the neural network must have billions of additional parameters to deal with those additional modalities. Unfortunately, the corporate competition between the different AI companies means that their researchers are now unable or unwilling to share all the interesting details about how their models were developed.

Reinforcement learning from human feedback (RLHF)

Of course, GPT’s initial neural network was entirely unsuitable for public release. It was trained on the open internet with almost no guidance, after all. So, to further refine ChatGPT’s ability to respond to a variety of different prompts in a safe, sensible, and coherent way, it was optimized for dialogue with a technique called reinforcement learning with human feedback (RLHF). 

Essentially, OpenAI created some demonstration data that showed the neural network how it should respond in typical situations. From that, they created a reward model with comparison data (where two or more model responses were ranked by AI trainers) so the AI could learn which was the best response in any given situation. While not pure supervised learning, RLHF allows networks like GPT to be fine-tuned effectively.   

This process has continued with each subsequent release of GPT and is part of what has allowed the later models like GPT-4 and GPT-4o to be safer and more reliable.

Natural language processing (NLP)

All this effort is intended to make GPT as effective as possible at natural language processing (NLP). NLP is a huge bucket category that encompasses many aspects of artificial intelligence, including speech recognition, machine translation, and chatbots, but it can be understood as the process through which Al is taught to understand the rules and syntax of language, programmed to develop complex algorithms to represent those rules, and then made to use those algorithms to carry out specific tasks.

Since I’ve covered the training and algorithm development side of things, let’s look at how NLP enables GPT to carry out certain tasks—in particular, responding to user prompts. 

It’s important to understand that for all this discussion of tokens, ChatGPT is generating text of what words, sentences, and even paragraphs or stanzas could follow. It’s not the predictive text on your phone bluntly guessing the next word; it’s attempting to create fully coherent responses to any prompt. This is what transformers bring to NLP.

In the end, the simplest way to imagine it is like one of those “finish the sentence” games you played as a kid.

In the end, the simplest way to imagine it is like one of those “finish the sentence” games you played as a kid. ChatGPT starts by taking your prompt, breaking it down into tokens, and then using its transformer-based neural network to try to understand what the most salient parts of it are, and what you are really asking it to do. From there, the neural network kicks into gear again and generates an appropriate output sequence of tokens, relying on what it learned from its training data and fine-tuning.

For example, when I gave ChatGPT the prompt, “Zapier is…” it responded saying:

“Zapier is a web-based automation tool that allows users to connect different web applications together in order to automate repetitive tasks and improve workflows.”

That’s the kind of sentence you can find in hundreds of articles describing what Zapier does, so it makes sense that it’s the kind of thing that it spits out here. But when my editor gave it the same prompt, it said:

“Zapier is a web-based automation tool that allows users to connect different web applications and automate workflows between them.”

That’s pretty similar, but it isn’t exactly the same response. Asking “What is Zapier?”, “What does Zapier do?”, and “Describe Zapier” all get similar results too, presumably because they occupy similar positions in vector space. GPT understands that the most salient word here is Zapier, and that all the others are just asking for a short summary in slightly different ways.

That randomness (which you can control in some GPT apps with a setting called “temperature”) ensures that ChatGPT isn’t just responding to every single response with what amounts to a stock answer. It’s running each prompt through the entire neural network each time, and rolling a couple of dice here and there to keep things fresh. Its understanding of natural language also allows it to parse the subtle differences between “What is Zapier?” and “What does Zapier do?” While fundamentally similar questions, you would expect the answer to be slightly different. Whatever way you ask things, ChatGPT is not likely to start claiming that Zapier is a color from Mars, but it will mix up the following words based on their relative likelihoods.  

Multimodality in ChatGPT: Images, audio, and more

ChatGPT analyzing an image and creating a new one

While natural language processing is a huge part of ChatGPT, over the past year, the chatbot has become increasingly multimodal. That means that, using GPT-4o (the “o” stands for “omni”), ChatGPT can also understand images and audio as part of the same prompt.

If you’re thinking “couldn’t ChatGPT do that before?” you’re kind of right. The crux is how it happens. 

Before GPT-4o, ChatGPT was able to understand images using a separate AI model that created a text description of the image. Similarly, it could “listen” to audio input, but what was really happening was a speech-to-text AI model was converting the spoken words to text. In both cases, the text was what was passed to GPT, which then responded with text or, in some cases, by prompting DALL·E 3 to create an image.

With GPT-4o, the same AI model is able to both understand the text, audio, and image inputs and output a response using text, audio, or images. It’s a big step forward.

Spitting out words, not knowledge

For all my talk of “understanding” and “comprehending,” you have to remember that these are simply the most useful words we have for describing how AIs operate. GPT doesn’t truly understand English, but it has a very detailed map of how many concepts relate to each other. The makers of ChatGPT are the first to say that it can produce incorrect (and possibly harmful) information, though they’re working hard to fix it.

Here’s a tame example. When I asked ChatGPT using GPT-3 and GPT-4 to reply to “Harry Guinness is…” GPT-3 first replied by asking which Harry Guinness. Great question—there are literally two of us! So I said, “Harry Guinness the writer” (that’s me). GPT-4 guessed who I was straight away, though both responses were fascinating:

GPT-3: Harry Guinness is a freelance writer and journalist based in Ireland. He has written for a variety of publications, including The New York Times, The Guardian, The Huffington Post, and Popular Mechanics. He covers topics ranging from technology and photography to travel and culture.

GPT-4: Harry Guinness is an Irish writer, photographer, and technology expert. He has written for various publications, including How-To Geek, Lifehacker, and Tuts+. His work primarily focuses on technology, tutorials, and tips, as well as photography and image editing. Guinness has built a reputation for creating detailed guides that help readers understand complex subjects or solve problems in a straightforward manner.

GPT-3’s first and last lines are pulled almost verbatim from my various websites and author bios around the web (although I normally list myself as a freelance writer and photographer, not a journalist). But the list of publications is basically made up. I’ve written for The New York Times, but not for The Guardian, The Huffington Post, or Popular Mechanics (I do write regularly for Popular Science, so that might be where that came from).

GPT-4 gets the photographer part right and actually lists some publications I’ve written for, which is impressive, though they’re not the ones I’d be most proud of. It’s a great example of how OpenAI has been able to increase the accuracy of GPT-4 relative to GPT-3, though it might not always offer the most correct answer. 

But let’s go back to GPT-3 as its error provides an interesting example of what’s going on behind the scenes in ChatGPT. It doesn’t actually know anything about me. It’s not even copy/pasting from the internet and trusting the source of the information. Instead, it’s simply predicting a string of words that will come next based on the billions of data points it has.

For example: The New York Times is grouped far more often with The Guardian and The Huffington Post than it is with the places I’ve written for, like Wired, Outside, The Irish Times, and, of course, Zapier. So when it has to work out what should follow on from The New York Times, it doesn’t pull from the published information about me; it pulls that list of large publications from all the training data it has (or really, considers where they’re mapped in vector space). It’s very clever and looks plausible, but it isn’t true.

GPT-4 does a much better job and nails the publications, but the rest of what it says really just feels like plausible follow-on sentences. I don’t think it has any great appreciation for my reputation: it’s just saying the kind of thing a bio says. It’s far better at hiding how it works than GPT-3, though it’s actually using much the same technique.

I’ve tested this with GPT-4o, too, and the results are much the same. Though ChatGPT can now search the web—at least if you’re a ChatGPT Plus subscriber—which means it can find more up-to-date and accurate information rather than just relying on its training data.

What is the ChatGPT API?

OpenAI doesn’t have a just-us attitude with its technology. The company has an API platform that allows developers to integrate the power of ChatGPT into their own apps and services (for a price, of course).

Zapier uses the ChatGPT API to power its own ChatGPT integration, which lets you connect ChatGPT to thousands of other apps and add AI to your business-critical workflows. Learn more about how to automate ChatGPT, or take a look at these examples to get you started.

Zapier is the leader in workflow automation—integrating with 6,000+ apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization’s technology stack. Learn more.

What’s next for ChatGPT?

ChatGPT has gone from a novelty to an increasingly useful productivity tool over the past year. It’s clear that multimodality is going to be the next big feature for chatbots and AI models, so expect it to get increasingly good at responding to image, audio, and maybe even things like video prompts. 

Right now, GPT-4o is pretty bad at generating images, so ChatGPT tends to use DALL·E 3. Maybe the next version will have a more powerful image generator built into the base model.

Otherwise, OpenAI is already training the next version of GPT. We’ll have to wait and see what features that brings to ChatGPT.

Related reading:

This article was originally published in February 2023. The most recent update was in June 2024.

by Zapier