Google Gemini is a family of new AI models from Google. Despite Google being a leader in AI research for almost a decade and developing the transformer architecture—one of the key technologies in large language models (LLMs)—OpenAI and its GPT models are dominating the conversation.
Gemini Nano, Gemini Pro, and Gemini Ultra are Google’s attempt to play catchup. All three versions are multimodal, which means that in addition to text, they can understand and work with images, audio, videos, and code. Let’s dig in a little deeper and see if Google can really get back in the AI game.
What is Google Gemini?
Gemini is also the name of Google’s AI chatbot (formerly known as Bard). This article is about the family of AI models by the same name—the one that powers the chatbot.
Google Gemini is a family of AI models, like OpenAI’s GPT. The major difference: while Gemini can understand and generate text like other LLMs, it can also natively understand, operate on, and combine other kinds of information like images, audio, videos, and code. For example, you can give it a prompt like “what’s going on in this picture?” and attach an image, and it will describe the image and respond to further prompts asking for more complex information.
Because we’ve now entered the corporate competition era of AI, most companies are keeping pretty quiet on the specifics of how their models work and differ. Still, Google has confirmed that the Gemini models use a transformer architecture and rely on strategies like pretraining and fine-tuning, much as other LLMs like GPT-4 do. The main difference between it and a typical LLM is that it’s also trained on images, audio, and videos at the same time it’s being trained on text; they aren’t the result of a separate model bolted on at the end.
In theory, this should mean it understands things in a more intuitive manner. Take a phrase like “monkey business”: if an AI is just trained on images tagged “monkey” and “business,” it’s likely to just think of monkeys in suits when asked to draw something related to it. On the other hand, if the AI for understanding images and the AI for understanding language are trained at the same time, the entire model should have a deeper understanding of the mischievous and deceitful connotations of the phrase. It’s ok for the monkeys to be wearing suits—but they’d better be throwing poo.
While this all makes Google Gemini more interesting, it doesn’t make it entirely unique: GPT-4 Vision (GPT-4V) is a similar multimodal model from OpenAI that adds image processing to GPT-4’s LLM capabilities. (Although it did fail my “monkey business” test.)
Google Gemini comes in three sizes
Gemini is designed to run on almost any device. Google claims that its three versions—Gemini Ultra, Gemini Pro, and Gemini Nano—are capable of running efficiently on everything from data centers to smartphones.
-
Gemini Ultra is the largest model designed for the most complex tasks. In LLM benchmarks like MMLU, Big-Bench Hard, and HumanEval, it outperformed GPT-4, and in multimodal benchmarks like MMMU, VQAv2, and MathVista, it outperformed GPT-4V. It’s still undergoing testing and is due to be released next year.
-
Gemini Pro offers a balance between scalability and performance. It’s designed to be used for a variety of different tasks. Right now, a specially trained version of it is used by the Google Gemini chatbot (formerly called Bard) to handle more complex queries. In independent testing, Gemini Pro was found to achieve “accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo” model.
-
Gemini Nano is designed to operate locally on smartphones and other mobile devices. In theory, this would allow your smartphone to respond to simple prompts and do things like summarize text far faster than if it had to connect to an external server. For now, Gemini Nano is only available on the Google Pixel 8 Pro and powers features like smart replies in Gboard.
Each Gemini model differs in how many parameters it has and, as a result, how good it is at responding to more complex queries as well as how much processing power it needs to run. Unfortunately, figures like the number of parameters any given model has are often kept secret—unless there’s a reason for a company to brag.
Google claims the smallest model, Nano, has two versions: one with 1.8 billion parameters and another with 3.25 billion parameters. While Google doesn’t reveal how many parameters the larger models have, as a ballpark, GPT-3 has 175 billion parameters, while Meta’s Llama 2 family has models with up to 65 billion parameters. Presumably the two larger Gemini models have parameter counts in the same sort of range.
Google Gemini is designed to be built on top of
Almost every app now seems to be adding AI-based features, and many of them are using OpenAI’s GPT, DALL·E, and other APIs to do it. Most AI writing generators, for example, are powered by GPT.
Google wants a piece of that action, so Gemini is designed from the start for developers to be able to build AI-powered apps and otherwise integrate AI into their products. The big advantage it has is that it can integrate them through its cloud computing, hosting, and other web services.
While Google will use Gemini to power its own products like the chatbot formerly known as Bard (now called Gemini), developers can access Gemini Pro through the Gemini API in Google AI Studio or Google Cloud Vertex AI. This allows them to further train Gemini on their own data to build powerful tools like folks have already been doing with GPT.
How does Google Gemini work?
According to Google, before Gemini, most multimodal AI models were developed by combining multiple separately trained AI models. The text and image processing, for example, would be trained separately and then combined into a single model that could approximate the features of a true multimodal model.
With Gemini, they set out to create a natively multimodal model. It was pretrained on a dataset with trillions of tokens of text, as well as images (along with accompanying text descriptions), videos, and audio from the start—and at the same time. It was then further fine-tuned through techniques like reinforcement learning with human feedback (RLHF) to get the model to create better and safer responses.
While Google doesn’t say where all this training data came from, it likely includes archives of websites like Common Crawl, image-text databases like LAOIN-5B, as well as proprietary data sources like the entirety of Google Books.
By training all its modalities at once, Google claims that Gemini can “seamlessly understand and reason about all kinds of inputs from the ground up.” For example, it can understand charts and the captions that accompany them, read text from signs, and otherwise integrate information from multiple modalities. (For what it’s worth, GPT-4V, a not-yet-fully-released version of GPT-4, also seems to have been trained in the same way, but only on text and images.)
This all allows the Gemini models to respond to prompts with both text and generatively-created images, much like ChatGPT can do using a combination of DALL·E and GPT.
Aside from having a greater capacity to understand different kinds of input, actual text generation works much the same with Gemini as it does with any other AI model. Its neural network tries to generate plausible follow-on text to any given prompt based on the training data it’s seen in the past. The version of Gemini Pro fine-tuned for the Gemini chatbot, for example, is designed to interact like a chatbot, while the version of Gemini Nano embedded in the Pixel 8 Pro’s Recorder app is designed to create text summaries from the automatically generated transcripts.
How does Google Gemini compare to other LLMs?
As a family of multimodal models, Gemini is hard to compare on a one-to-one basis. Roughly speaking, though, its models understand and generate text that is as good as the equivalent GPT models, and thus just ahead of Llama, Claude, and most other available LLMs.
For example, Gemini Ultra outperforms GPT-4 and GPT-4V on most benchmarks—although it’s not yet available—while independent research found that Gemini Pro trails GPT-3.5 turbo across many of the same benchmarks.
Still, it’s Gemini’s multimodality that makes it most interesting, though how effective that is in the real world remains to be seen. Gemini models just aren’t widely available yet, and Google kind of flubbed the launch.
A much-hyped demo video that supposedly showed Gemini Ultra responding to live video in real time was essentially faked, although the accompanying blog post was a little more transparent about what was going on. Instead of responding to actual video and audio prompts, Gemini was responding to more detailed text and image prompts, and taking a lot longer to do it than was apparent in the demo. The video makes an impressive case for how multimodal AI models could be used in the future, though it doesn’t really represent any Gemini model’s current capabilities. (Gemini also reportedly has some issues getting facts straight.)
How to access Google Gemini
A specially trained version of Gemini Pro is available to some users through Google’s Gemini chatbot. I haven’t got it yet, but you might. Everyone has to wait until next year for Gemini Ultra, the most powerful model, though it will be available both to developers and through the Gemini chatbot (formerly Bard).
For now, developers can test Google Gemini Pro out through Google AI Studio or Vertex AI. And with Zapier’s Google Vertex AI and Google AI Studio integrations, you can access Gemini from all the apps you use at work. Here are a few examples to get you started.
Zapier is the leader in workflow automation—integrating with 6,000+ apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization’s technology stack. Learn more.
Related reading:
This article was originally published in January 2024. The most recent update was in March 2024.