What is Sora? OpenAI’s text-to-video model

What is Sora? OpenAI’s text-to-video model

Sora is DALL·E for video—it’s even built by the same people—and it’s now available to ChatGPT Plus and Pro members.

With Sora, you type a text prompt, and the AI model does its best to generate a matching video. It makes sense that once AI models could generate decent images, the next challenge would be getting them to generate good video footage—and that’s what Sora is getting at.

While Sora is impressive, its results can have a surreal video-game-like quality to them. Sign up and give it a try if you want to judge for yourself whether they’re truly realistic or not. But first, let’s have a look at what Sora is, how it works, and how it could be used going forward.

Table of contents:

What is Sora?

The Sora landing page

Sora is a generative text-to-video AI model developed by OpenAI, the makers of ChatGPT and DALL·E 3. OpenAI claims that it “can create realistic and imaginative scenes.” I’d argue that “realistic” might oversell things a touch—and they also lack sound—but the videos it generates from written prompts do look great. 

In addition to using text prompts, Sora can also take an image and turn it into a video, or take a video clip and extend it forward or backward in time. This has the potential to be even more useful, though it can highlight the model’s questionable grasp of physics.

The current version of Sora is Sora Turbo. It’s the first version of the model that’s been available to the general public. At launch, Sora Turbo can create videos in up to 1080p resolution and up to 20 seconds long. The videos can have multiple characters, camera motion, and (somewhat) persistent and accurate details. Thanks to its training (which I’ll dive into below), it has a surprisingly good understanding of how things exist in the real world—if not always how they physically interact.

Crucially, OpenAI has also developed a basic video editor that allows you to do more than generate one-off video clips:

  • Remix allows you to change elements of any AI-generated video with a written prompt. 

  • Recut allows you to pull out the best sections of an AI-generated video and create a new video.

  • Storyboard allows you to combine multiple AI-generated clips into a single video.

  • Loop allows you to create seamlessly repeating AI-generated videos.

  • Blend allows you to combine elements from two different videos.

OpenAI has also borrowed a few ideas from more community-oriented generative AI apps like Midjourney. For example, the Recent and Featured tabs highlight other users’ creations. You can click on any video to see the exact prompt used to create it and even use it as the basis for your own generations.

How does Sora work?

Sora is built on the ideas behind Open AI’s other models, as well as plenty of novel innovations. 

Without getting into the technical details: Sora was trained on an unspecified amount of video footage that appears to include everything from selfie videos to movies, TV shows, real world footage, video game recordings, and lots more. All this training footage was captioned, mostly by AI, so that Sora could develop a deep understanding of natural language and how it relates to the physical world.

Tokenizing visual data with patches

In the technical report released in February 2024, the OpenAI researchers explain that they were inspired by how large language models (LLMs) like GPT are able to become incredibly competent at a wide variety of tasks just by being trained on massive quantities of data. 

A big part of this is because LLMs model the relationships between individual “tokens”—fragments of meaningful text roughly four characters long—across different domains, including multiple languages, mathematics, and computer code. Feed in billions of web pages, and they have a structure they can use to sort things out and incorporate it. 

To achieve some of the same benefits with video, OpenAI uses “spacetime patches.” In essence, every frame in a video is broken down into a series of smaller segments called patches. How each segment changes through the length of the video is also encoded in the spacetime patch—hence the name, spacetime. Crucially, this allowed Sora to be trained on a wide variety of different visual data, from vertical social media videos to widescreen movies, as each clip didn’t have to be cropped or compressed to a specific set of dimensions. 

It gets really complicated really quickly, so if you want to learn more, check out the technical report or this article from Towards Data Science, or keep reading for a few more details.

Generating patches with a transformer diffusion network

To generate videos, Sora uses the same diffusion method as DALL·E, with a transformer architecture similar to GPT, enabling it to generate long, detailed, multi-composition clips. 

Diffusion starts with a random field of noise, and the AI repeatedly edits it so that it gets closer and closer to the target prompt. It sounds wild, and I explain it in more detail in Zapier’s roundup of the best AI image generators, but it works really well with modern image models. It’s how Stable Diffusion, Midjourney, DALL·E 3, and every other AI art generator is able to create such interesting results. 

Sora’s biggest development is that it doesn’t generate a video frame by frame. Instead, it uses diffusion to generate the entire video all at once. The model has “foresight” of future frames, which allows it to keep generated details mostly consistent throughout the entire clip, even if they move in and out of frame, are obscured by other objects, or the virtual camera moves through 3D space.

Check out some of Sora’s sample videos to see it in action, or just sign up (if they’ve unpaused signups) and give Sora a go. For a deeper dive into the technology behind AI, here are some resources:

How good is Sora?

Sora is incredibly impressive. While the videos it generates won’t always look realistic, plenty of them look realistic enough to pass a casual inspection. 

In particular, things like landscapes, abstract patterns, and cartoons and stop-motion-style animation can look great. Videos of people and animals can also look good if they aren’t moving too much, but as soon as you add in lots of movement, things tend to fall apart.  

OpenAI is transparent about this. According to the announcement, Sora “often generates unrealistic physics and struggles with complex actions over long durations.” Look closely at any action-packed video created with Sora and you’ll quickly see that this is true. Objects change slightly or even vanish across the whole video, people and animals walk weirdly, and physics just seems off. Of course, this isn’t surprising—but it is by far the biggest limitation of Sora as it currently stands.

While AI-generated images haven’t replaced photographers and other artists, they’re definitely being widely used—especially online. I can see Sora reaching the same kind of prominence as it becomes more widely available.

How safe is Sora?

Of course, there’s the potential for deepfakes. While existing video editing and AI tools already make them easy to create, text-to-video AI models could supercharge the ability of unscrupulous people to generate them with little to no effort. The video quality isn’t quite convincing yet, but that doesn’t mean it always won’t be—or that some people won’t try to pass off AI videos as real anyway. 

Right now, OpenAI is doing everything it can to keep Sora getting positive press. This means they’ve put in major guardrails to prevent people from generating any kind of deepfakes. Currently, uploads of videos with people are limited—though they intend to roll the feature out to more users as they improve their ability to mitigate against deepfakes. Similarly, Sora has DALL·E 3 and ChatGPT’s limitations on creating violent, abusive, and otherwise objectionable content, as well as very strict limitations on creating anything using potentially copyrighted or trademarked material. 

And while OpenAI is creating those guardrails to make it hard to misuse and abuse their models, the same can’t necessarily be said about other services built using similar open source models. We’re certainly looking at the next few years being weird while society as a whole comes to terms with fake videos being easier and cheaper to produce.

While there are always going to be techniques to skirt these kinds of guardrails, Sora is as safe as any AI tool can reasonably be expected to be. There are other text-to-video generators without the same guardrails that are simply much easier to use if you want to create videos of Mickey Mouse duking it out with Batman.

How to try Sora

Sora is currently available to ChatGPT Plus and ChatGPT Pro subscribers. ChatGPT Plus users can generate a limited number of watermarked videos up to 720p and five seconds in length. ChatGPT Pro users can generate an unlimited number of unwatermarked videos up to 1080p and 20 seconds in length.

But as I’m writing this (the day after it was released to the public), Sora is already not accepting new user activations—even for ChatGPT Plus subscribers. While this is likely to change as OpenAI ramps up Sora as a commercial product, you should expect some limits in place for a while.

If you can’t sign up for Sora and want to check out a text-to-video AI model today, you have a few other options. Runway Gen-2 is the big name, but Google’s Lumiere and Meta’s Make-a-Video are both available as PyTorch extensions if you have the technical chops to run them. Or you can check out Zapier’s list of the best AI video generators.

Related reading:

This article was originally published in March 2024. The most recent update was in December 2024.

by Zapier