AI Video

Feb 22 2024

Published by Steven Novella under Technology
Comments: 0

Recently OpenAI launched a website showcasing their latest AI application, Sora. This app, based on prompts similar to what you would use for ChatGPT or the image creation applications, like Midjourney or Dalle-2, creates a one minute photorealistic video without sound. Take a look at the videos and then come back.

Pretty amazing. Of course, I have no idea how cherry picked these videos are. Were there hundreds of failures for each one we are seeing? Probably not, but we don’t know. They do give the prompts they used, and they state explicitly that these videos were created entirely by Sora from the prompt without any further editing.

I have been using Midjourney quite extensively since it came out, and more recently I have been using ChatGPT 4 which is linked to Dalle-2, so that ChatGPT will create the prompt for you from more natural language instructions. It’s pretty neat. I sometimes use it to create the images I attach to my blog posts. If I need, for example, a generic picture of a lion I can just make one, rather than borrowing one from the internet and risking that some German firm will start harassing me about copyright violation and try to shake me down for a few hundred Euros. I also make images for personal use, mostly gaming. It’s a lot of fun.

Now I am looking forward to getting my hands on Sora. They say that they are testing the app, having given it to some creators to give them feedback. They are also exploring ways in which the app can be exploited for evil and trying to make it safe. This is where the app raises some tricky questions.

But first I have a technical question – how long will it be before AI video creation is so good that it becomes indistinguishable (without technical analysis) from real video? Right now Sora is about as good at video as Midjourney is at pictures. It’s impressive, but there are some things it has difficulty doing. It doesn’t actually understand anything, like physics or cause and effect, and is just inferring in it’s way what something probably looks like. Probably the best representation of this is how they deal with words. They will create pseudo-letters and words, reconstructing word like images without understanding language.

Here is a picture I made through ChatGPT and Dalle-2 asking for an advanced spaceship with the SGU logo. Superficially very nice, but the words are not quite right (and this is after several iterations). You can see the same kind of thing in the Sora videos. Often there are errors in scale, in how things related to each other, and objects just spawn our of nowhere. The video of the birthday party is interesting – I think everyone is supposed to be clapping, but it’s just weird.

So we are still right in the middle of the uncanny valley with AI generated video. Also, this is without sound. The hardest thing to do with photorealistic CG people is make them talk. As soon as their mouth starts moving, you know they are CG. They don’t even attempt that in these videos. My question is – how close are we to getting past the uncanny valley and fixing all the physics problems with these videos?

On the one hand it seems close. These videos are pretty impressive. But this kind of technology historically (AI driving cars, speech recognition) tend to follow a curve where the last 5% of quality is as hard or harder than the first 95%. So while we may seem close, fixing the current problems may be really hard. We will have to wait and see.

The more tricky question is – once we do get through the uncanny valley and can essentially create realistic video, paired with sound, of anything that is indistinguishable from reality, what will the world be like? We can already make fairly good voice simulations (again, at the 95% level). OpenAI says they are addressing these questions, and that’s great, but once this code is out there in the world who says everyone will adhere to good AI hygiene?

There are some obvious abuses of this technology to consider. One is to create fake videos meant to confuse the public and influence elections or for general propaganda purposes. Democracy requires a certain amount of transparency and shared reality. We are already seeing what happens when different groups cannot even agree on basic facts. This problem also cuts both ways – people can make videos to create the impression that something happened that didn’t, but also real video can be dismissed as fake. That wasn’t me taking a bribe, it was an AI fake video. This creates easy plausible deniability.

This is a perfect scenario for dictators and authoritarians, who can simply create and claim whatever reality they wish. The average person will be left with no solid confidence in what reality is. You can’t trust anything, and so there is no shared truth. Best put our trust in a strongman who vows to protect us.

There are other ways to abuse this technology, such as violating other people’s privacy by using their image. This could also revolutionize the porn industry, although I wonder if that will be a good thing.

While I am excited to get my hands on this kind of software for my personal use, and I am excited to see what real artists and creators can do with the medium, I also worry that we again are at the precipice of a social disruption. It seems that we need to learn the lessons of recent history and try to get ahead of this technology with regulations and standards. We can’t just leave it up to individual companies. Even if most of them are responsible, there are bound to be ones that aren’t. Not only do we need some international standards, we need the technology to enforce them (if that’s even possible).

The trick is, even if AI generated videos can be detected and revealed, the damage may already be done. The media will have to take a tremendous amount of responsibility for any video they show, and this includes social media giants. At the very least any AI generated video should be clearly labels as such. There may need to be several layers of detection to make this effective. At least we need to make it as difficult as possible, so not every teenager with a cellphone can interfere with elections. At the creation end AI created video can be watermarked, for example. There may also be several layers of digital watermarking to alert social media platforms so they can properly label such videos, or refuse to host them depending on content.

I don’t have the final answers, but I do have a strong feeling we should not just go blindly into this new world. I want a world in which I can write a screenplay, and have that screenplay automatically translated into a film. But I don’t want a world in which there is no shared reality, where everything is “fake news” and “alternative facts”. We are already too close to that reality, and taking another giant leap in that direction is intimidating.

No responses yet