Happy Monday and welcome to another week of Artificial Insights – brought to you from a rainy Amsterdam. I have lived here on and off the past decades due to work and study, and am thrilled to be living here full-time again. Part of the reason for moving back relates to some of the subjects discussed in this very newsletter, and will be sharing a lot more in the coming weeks and months.
All news this week revolves around AI being used for simulation. A few days ago OpenAI showcased Sora, their video generation model, which is nothing short of amazing. The most interesting aspect of Sora is how existing generative models and architectures can be applied to creating 99% convicing high-fidelity videos.
By learning from existing video data, Sora generates “visual patches”, which are the equivalent of textual tokens. These patches are then used to generate video snippets from a text or image prompt by simulating entire artificial worlds. My understanding of the research paper is that Sora renders the output of a simulated physical or digital process, which means the model is tapping into entire simulated realities.
We talked briefly about generative environments a couple of weeks ago and now more than ever I am convinced that that particular intersection is currently among the most interesting areas of emerging technology. I half-expected MidJourney to get there first, but it seems the underlying technology is more available than I thought. Immersive virtual worlds are on the horizon, whether we are ready for them or not.
Until next week,
MZ
P.S. to celebrate 40 issues of Artificial Insights, I am finally launching a group chat for readers to connect and discuss. Free to join on WhatsApp 💬.
“Every single pixel will be generated soon. Not rendered: generated”
Jensen Huang, CEO of NVidia
Demo video by OpenAI
MKBHD explains the impact of Sora better than anyone else IMO
Jim Fam on Twitter explains Sora’s simulation model and its implications
If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.
I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!
Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee."
- The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space.
- The 3D objects are consistently animated as they sail and avoid each other's paths.
- Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations.
- Photorealism, almost like rendering with raytracing.
- The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe.
- The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect.
Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.
&
Apparently some folks don't get "data-driven physics engine", so let me clarify. Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos.
Sora is a learnable simulator, or "world model". Of course it does not call UE5 explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.
Don’t miss the Sora announcement research paper
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
Dream Machines
Check out
on Sora:Interview with Yann LeCun at the World Government Summit
All about open-source models and why they are the path forward at Meta and beyond.
“It’s like having a staff of really smart people working for you. We shouldn’t feel threatened by this.”
Human tasks that machines could automate →
Via Greg Kamradt on Twitter
I thought this was a cool question/tweet from @yoheinakajima
Then I saw this diagram which made me think of it As the dark area grows (more tech is created)...
1. The dark area consumes more white space (it eats up jobs)
2. The white space grows into the grey (more jobs get created)
Groq.com - hyperfast LLM responses
Give it a try. Responds in milliseconds instead of seconds, running on custom GPUs.
Welcome Groq® Prompter! Are you ready to experience the world's fastest Large Language Model (LLM)? We'd suggest asking about a piece of history, requesting a guide on how to achieve your new year resolution, or copy and pasting in some text to be translated by prompting, "Make it French."
If Artificial Insights makes sense to you, please help us out by:
Subscribing to the weekly newsletter on Substack.
Joining our WhatsApp group.
Following the weekly newsletter on LinkedIn.
Forwarding this issue to colleagues and friends.
Sharing the newsletter on your socials.
Commenting with your favorite talks and thinkers.
Artificial Insights is written by Michell Zappa, CEO and founder of Envisioning, a technology research institute.
You are receiving this newsletter because you signed up on envisioning.io.