What is multimodal AI? Large multimodal models, explained

What is multimodal AI? Large multimodal models, explained

Large language models (LLMs), like OpenAI’s GPT-4, are the extremely capable, state-of-the-art AI models that have been generating countless headlines for the past couple of years. The best of these LLMs are capable of parsing, understanding, interpreting, and generating text as well as most humans—and are able to ace many standardized tests. 

But there are still plenty of things LLMs can’t do by themselves, like understand different forms of inputs. For example, LLMs can’t natively respond to spoken or handwritten instructions, video footage, or anything else that isn’t just text. Of course, the world isn’t just made up of neatly formatted text, so some AI researchers think that training large AI models to be able to understand different “modalities”—like images, videos, and audio—is going to be a big deal in AI research. 

We’re already seeing the first of these new large multimodal models or LMMs. Google, OpenAI, and Anthropic, the makers of Claude, are all hinting at how powerful their latest AI models are across different modalities—even if the features aren’t widely available to the public just yet. 

So, if large multimodal models are the next frontier of AI, let’s have a look at what they are, how they work, and what they can do.

What is multimodal AI?

Large multimodal models are AI models that are capable across multiple “modalities.”

In machine learning and artificial intelligence research, a modality is a given kind of data. So text is a modality, as are images, videos, audio, computer code, mathematical equations, and so on. Most current AI models can only work with a single modality or convert information from one modality to another.

For example, large language models, like GPT-4, typically just work with one modality: text. They take a text prompt as an input, do some black box AI stuff, and then return text as an output. 

AI image recognition and text-to-image models both work with two modalities: text and images. AI image recognition models take an image as an input and output a text description, while text-to-image models take a text prompt and generate a corresponding image.

When an LLM appears to work with multiple modalities, it’s most likely using an additional AI model to convert the other input into text. For example, ChatGPT uses GPT-3.5 and GPT-4 to power its text features, but it relies on Whisper to parse audio inputs and DALL·E 3 to generate images. 

But that’s starting to change. 

Multimodal AI models go mainstream: Gemini, GPT-4V, and Claude 3

When Google announced its Gemini series of AI models, it made a big deal about how they were “natively multimodal.” Instead of having different modules tacked on to give the appearance of multimodality, they were apparently trained from the start to be able to handle text, images, audio, video, and more. 

Of course, if you’re a keen ChatGPT user, you’ll have noticed that it’s been able to handle image inputs since last year. That’s because, in addition to the LLM GPT-4, OpenAI also developed a multimodal model called GPT-4 Vision or GPT-4V. It can only handle text and images, but it seems to be able to do it very well.

Similarly, Anthropic claims that Claude 3 has “sophisticated vision capabilities on par with other leading models.” So, while large multimodal model is a fancy new term, it’s basically describing the direction the major LLMs have been going. 

How do large multimodal models work?

Large multimodal models are very similar to large language models in training, design, and operation. They rely on the same training and reinforcement strategies, and have the same underlying transformer architecture. This article on how ChatGPT works is a good place to start if you want a breakdown of some of these concepts. 

We’re in the era of rapid AI commercialization, so a lot of interesting and important information about the various AI models is no longer released publicly. The broad strokes have to be pieced together from the technical announcements, product specs, and general direction of the research. As a result, this is more of an overarching view about how these models work as a whole, rather than a detailed breakdown of how a specific LMM was developed.

In addition to an unimaginable quantity of text, LMMs are also trained on millions or billions of images (with accompanying text descriptions), video clips, audio snippets, and examples of any other modality that the AI model is designed to understand (e.g., code). Crucially, all this training happens at the same time. The underlying neural network—the algorithm that powers the whole AI model—not only learns the word “dog,” but it also learns the concept of what a dog is, as well as what a dog looks and sounds like. In theory, it should be just as capable of recognizing a photo of a dog or identifying a woof in an audio clip as it is at processing the word “dog.”

Of course, this pre-training is just the first step in creating a functional AI model. It’s likely to have incorporated some pretty unhealthy stereotypes and toxic ideas—mainlining the entire internet isn’t good for human brains, let alone artificial networks based on them. To get a large multimodal model that behaves as expected and, importantly, is actually useful, the results are fine-tuned using techniques like reinforcement learning with human feedback (RLHF), supervisory AI models, and “red teaming” (to try to break it).

Once all that is done, you should have a working large multimodal model that’s similar to a large language model, but capable of handling other modalities, too. 

What can a large multimodal model do?

ChatGPT describing a picture of a Pekingese dog uploaded by the user

Other than Gemini, most LMMs or LLMs with multimodal capabilities are limited to just text and images, so that’s what I’m going to focus on here. If you want to see an idealized, kind-of-faked, demo of how LMMs could work in the future, check out Google’s Gemini product demo. While Gemini has the capabilities shown, they don’t quite work the way Google presents them

Still, right now, LMMs or LLMs with some multimodal capabilities have some pretty neat features. With Gemini, ChatGPT, and Claude 3, you can do things like:

  • Upload an image and get a description of what’s going on, as well as use it as part of a prompt to generate text or images.

  • Upload an image and ask questions about it, as well as follow-up questions about specific elements of the image.

  • Translate the text in an image of, say, a menu, to a different language, and then use it as part of a text prompt. 

  • Upload charts and graphs and ask complicated follow-up questions about what they show.

  • Upload a design mockup and get the HTML and CSS code necessary to create it. 

Katie's hand-drawn logo in ChatGPT

And as LMMs get more widely available, what they’re capable of will likely expand. A multimodal medical chatbot, for example, would be able to better diagnose different rashes and skin discolorations. Or more simply, while I couldn’t get ChatGPT or Gemini to solve a Sudoku puzzle, the fully multimodal versions of both would be well able to.

Multimodal AI models available now

Multimodal features are rolling out to most major LLMs. Both GPT-4V and Google Gemini have more powerful multimodal features that are not yet widely available through their chatbot frontends. Similarly, it remains to be seen how powerful Claude 3 is when it comes to handling image inputs. 

You can get a sense for how these will feel with ChatGPT Plus or with the Data Analyst GPT.

Either way, over the next year or two, we’re likely to see significantly more multimodal AI tools—and even some legitimate large multimodal models—capable of working with text, images, video footage, audio, code, and other modalities we probably haven’t even considered.

Automate your multimodal AI models

Even if you can’t interact with most multimodal models easily via chatbot right now, you can still use it in your everyday workflows. With Zapier’s Google Vertex AI and Google AI Studio integrations, you can access Gemini from all the apps you use at work. Here are a few examples to get you started.

Zapier is the leader in workflow automation—integrating with 6,000+ apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization’s technology stack. Learn more.

Related reading:

by Zapier