Google's VEO3, NBA Finals Ad made with AI, Creating Video Consistency
Pierson Marks (00:00.377)
Nice. Cool. Well, this is episode one of the yet to be named media, generative media podcast. It's super exciting. I think we talked about this for a little bit on our weekly Wednesdays. So to give the audience a little bit of background, my name is Pierce Edmarks. I'm the co-founder and CEO of JellyPod and this is Belal.
Um, there's our founding product engineer and, uh, we were talking a little bit about, Hey, we, uh, Jellypot is an AI podcast studio and there's a lot of interesting stuff going on in the world of AI. And a lot of people are covering, uh, con or a lot of people are covering, um, large language models, very technical details, technical news, but we didn't think there was enough coverage around.
Bilal Tahir (00:31.534)
to meet you guys.
Pierson Marks (01:00.057)
the creative space and how AI is impacting generation of media, audio, speech, video, images, and how those could be really used in people's day-to-day lives, whether that's in your business or personally, coming from people that are in the trenches every day. So we're building a creative platform called JellyPod that allows people to define hosts and create AI-generated podcasts.
And we're experiencing this really cool change in how people are creating content. And wanted to share a little bit more about it weekly, about what's been out, what's come out in the last week, what have we been using, what are people in the industry excited about in the generative media space. And yeah, this is our first podcast, so it's funny.
Bilal Tahir (01:56.312)
So the rough edges, we're figuring out the format ourselves. It's exciting. There's so many cool tools out there. So I'm excited to talk about it. And even experiment and clear and share stuff with you. like person said, we use these tools day in, out. So we're very familiar with it. And we want you guys to get the benefit of our expertise as well.
Pierson Marks (01:58.029)
Rough edges.
Pierson Marks (02:20.982)
Right, totally. And I know today we're going to really try to focus on two of the big things that happened in the last week or so. Google's release of VO3, which is a text to video model with audio native generations. So the video also includes audio. And then 11 Labs V3, which is a text to speech AI model that's very expressive. It's probably the best one that at least I've seen so far.
and we're gonna jump into those really shortly. But I thought it'd be kind of interesting, before we even jump into this, I wanna give the audience a little bit background of why you like media, like AI media, and just give a little intro.
Bilal Tahir (03:07.65)
Yeah, I think we're both and you know, I think we all of us have been are super passionate about gender. That's why we, you know, we love building Jelly Pot, which Jelly Pot, you know, if anyone's doesn't know is a, you know, we are in AI podcast platform. We let you generate AI podcasts, but in your own voice, we can let you customize it. You can, you know, create the perfect script, you know, the host that sounds just like the way you want it to.
or hosts and we let you generate the audio and the video in different formats, et cetera. So it's an end publisher to Spotify, YouTube, et cetera. So we're an end to end solution for that. I think the reason we love building the space is because it's just so exciting. there's just, I firmly believe we are going through this Gambian explosion of ideas because what in the last few years has happened is like.
we've gotten these tools that started with like, let's say stable diffusion with images. Now obviously up to VL3 where you can generate these, we're gonna talk more about it, but you can generate high quality, amazing, realistic video, you know, and just with a prompt. It just enables you to just go wild, just go crazy with your ideas and, you know.
I don't know, it's like, it's hard for me to even describe like, you know, how excited I am because like, I feel like we're gonna see this insane surge of ideas and creativity and content coming out in the next few years that is just gonna...
be so insane because there are so many people like me especially who I'm not an artist, I can't draw, I can't paint, but I have ideas in my head and they're, you know, for the longest time, you know, I just like, it just stays in my head. But now I feel like I have an avenue where I can like explore that, you know, and so I'm excited for it as a consumer because I'm like, I want to use these tools and create art, know, my own show or whatever. And on the other side, as a tech geek, I'm excited for taking the engineering behind it and building
Pierson Marks (04:40.824)
Certainly.
Pierson Marks (04:48.546)
Mm-hmm.
Bilal Tahir (05:10.128)
solutions that enable other people to do it more easily. What about you, like, excites you about gendered AI?
Pierson Marks (05:17.048)
Totally. mean, you just mentioned something. So to give a little background, when I was young, I was really passionate about... One of my first experiences in computing was building video games and websites. So I liked Club Penguin. I really liked this game and this is a story that maybe you've heard, but yeah, I liked Club Penguin. And in the third grade, I learned a lot about...
the game and there were these things called pins and I would, every month there was a new pin somewhere located in the map and I'd go find it. It's like, like how do I share this with the world, like where this pin was found? And I was really young and I was like, how do I create a blog, a basic blog? And I created something like that. And I would post about once a month about where the new pin was and some secrets about the Penguin. And I was like, I really like video games. And I spent a lot of my time
either playing video games, and then I was like, well, why can't I just try to make a video game? And there was software out there like Blender. I forget what the physics engine was back then. It wasn't Unreal. There's another engine that was used. I forget what it was. But yeah, it was Unity. Yeah, it was Unity. And playing around with that. And it was pretty hard. mean, very complicated software, just even to render some basic, I don't know, like,
render a building or a coffee mug. You had to design extrude edges and the amount of shortcuts you had to learn to do these things. You had, I just want to create this really sweet race car in Blender, make a 3D model of it. And you had to go through hours and hours of tutorials just to be able to create the wheel. And it was frustrating as a kid. It was rewarding, but it was really slow. was a slow process of creation where I had something in my head.
that I could describe pretty well, but the medium of which to convert it from this idea in my head to something that's actually rendered, I could see it on screen, I can make changes to it, was difficult. And because of Generated AI, mean, the ability to explain clearly the thoughts that are in your head and no longer have to be constrained by the medium of the software application to be able to create the thing. So with
Pierson Marks (07:42.84)
generative video, generative image, like you can just explain, like, hey, I want an image where there is a race car that's red. The wheels need to look like this. It needs to be tinted. It needs to have a spoiler. It needs to do all these things. And you can iterate on that very easily, but you can also just one shot prompt it and it probably can do it pretty well as long as you're describing the thing, you know, good enough. That's what kind of makes me excited. I mean, I've always just for as long as I can remember just.
love to create from Photoshop, Adobe Illustrator to like 3D modeling and everything. It's just always fun. mean, it really should be fun too.
Bilal Tahir (08:21.816)
Yeah, no, 100%. I mean, it's just such a cool, interesting space. Because we are basically in the business of making your ideas come true. think when I was younger, I always watched the documentary on the Imagineers in Disney. Their whole job is literally just come up with cool ideas. And I'm like,
Pierson Marks (08:43.159)
Totally.
Bilal Tahir (08:44.014)
I wouldn't want to get paid just to sit around and think of what kind of animatronic robot should we make for the Disney park. That's such a cool job, right?
In a way, JNEI kind of lets us do that in a smaller way for ourselves. We can just be our own imagineers. So maybe that should be the podcast name. don't know. Maybe we should do like people can comment if anyone sees the first one. Please tell us if you have any ideas for the podcast name. Yeah.
Pierson Marks (09:01.943)
Right. Imagineers. Yeah. Don't sue us, Dizzy. Don't sue us.
Pierson Marks (09:16.779)
Yeah, no, please do. That'd be awesome. I'm a huge Star Wars fan, and I know there's this big lawsuit right now going on with Disney and Majourney, which will be very interesting to talk about. I don't think we'll get into the copyright stuff today, but this will be a theme, think, especially as this case unfolds.
Bilal Tahir (09:34.274)
Yeah, mean, because it starts with my journey, but I think it'll impact all the labs because they've all trained on that data. So it's interesting.
Pierson Marks (09:38.264)
Totally, totally. Yeah, it'd be really interesting. And I think that there's some, I know that there's valid arguments on both sides, I think, and I'll be really reshaping the landscape when it comes to like model training, inference. Yeah, so I don't think we'll touch on it today, but if the audience kind of wants us to track of kind of what that lawsuit's happening in that lawsuit, it'd be interesting. And maybe we'll touch on it.
Bilal Tahir (10:06.254)
Yeah, hopefully we make it a few episodes before we have to add not financial advice as a disclaimer.
Pierson Marks (10:12.599)
Yes, no, sure. Yeah, totally. Well, I was thinking, okay, if this is episode one of, this is episode one and just for anyone listening, we're gonna really focus on media creation. Not gonna be talking about the latest, know, 30 billion parameter small models or any of the above, like very focused on higher level leveraged.
Activities around you know, what's new in the media space creation space how people using it? product releases and things you can use today and we'll break them down as somebody as people that are using these data day in day out And staying up to date with all the announcements across this space So I think the first thing we wanted I think we should just jump right into with VO3. We talked about it a little bit so You want it like what is VO3?
and watch People Care.
Bilal Tahir (11:10.606)
Yeah, mean, Vio3 was a step change. It was crazy. We've had text to video models for a while.
kinda setting the stage I think before Vio3. would say probably the state of the art, well first of all what is a text to video model? Basically you put in a prompt and you get a video, usually a five to 10 second video out. That's been the flow. And I would say probably the best models, it started with Sora I guess. it's funny people like, even at that time I wasn't that impressed. I don't know, maybe I didn't quite get it but I was like okay Sora's cool. But it was crazy, like this five second video of a woman walking blew people's minds.
because of how real the physics were but it cost like a hundred dollars apparently they had no access it was closed source and then and this is why love the space people were like all right we're gonna make our own open source models that we're gonna catch up and that's what they did and so a lot of the particularly in China companies caught up really quickly and now I would say the state of the art before BO3 was Kling and Kling 2.1 which is on fall exclusively that's where you are their own app
but I recommend just checking out on fall because you don't need a subscription you just need to put in 20 bucks or whatever and this thing is really good it would give you a 5 to 10 second video but no sound just a video and then what Vio3 did was Google had Vio2 before, Vio2 was decent but nothing crazy or anything Vio3
added sound which was a whatever it was also like really good as well so it basically amassed and exceeded in terms of video quality which is you know I mean you expect that with every version bump so up you're like yeah but then the big thing that people were like
Bilal Tahir (12:54.082)
excited about was the sound. So you can actually have people talk in the video or have sound effects like you know someone walking and you can have like the leaves were rustling as they walked and it's actually really good like you know it actually nails it. Not always like sometimes you just get weird you know mismatches but it does do a really great job and people have been generating videos that you know we'll talk about the Kalshi example which are basically as good as a hundred thousand dollar commercial you know and you can like kind of get there.
now for the first time people see a direct path especially if you know what you're doing if you if you understand you know how to direct a video you know cinematography etc so I think this is why people are so excited they're like wow this this is going to be a huge change for people who understand how to create a you know a good shot etc
Pierson Marks (13:45.142)
So before VO3, were there no models out there that generated audio natively?
Bilal Tahir (13:54.138)
Not natively, I think what people did was they would generate a sound from a Level Labs or something and then kind of stitch them together. But as far as I know, I did not hear. There was this other one which was interesting where you could give a video and you could generate a sound in the video. So that was another step and I wonder under the hood if they kind of do that. But I forget.
the model name, but it's called effects or something, but you can, let's say a horse is running. You can give the input video and then you can say add.
the horse running effect and it adds that. But that was another post-processing step. So this combines it and makes it obviously much simpler. It is expensive though, just to kind of set the state. So VO3, I think for a five second video, or eight second video, which is what you get on for, you have to pay $6 and that's cents per second. So it is not cheap. And...
It really depends on where you're coming from. If you're coming from creating a $100,000 20-minute commercial, you spend $1,000 for maybe it's cheaper for you, but for hobbyists, because we tried playing around and we were suddenly down the whole $200 in a couple of days. were like,
Pierson Marks (15:10.55)
I know I got the notification. I'm like your account balance is critically low. Please refresh or refill your credits. I'm like, whoa, it's crazy. I put a hundred bucks on that today.
Bilal Tahir (15:15.638)
Right. Yeah. Yeah, yeah. So yeah, but I do think they already have a VO3 flash or whatever, I think, on their app. So it's also available on Google Ultra or they have their own subscription tier. I haven't checked that out, but I think they're, or sorry, you told me about what's called Flow or Floor Whisper or Floor, what's the Google platform called?
Flow 2? Flow, I think it's Flow. Yeah, so you can have Vio3 there. And I think they are introducing Vio3 Flash, and I assume we'll get that hopefully next week. Yeah.
Pierson Marks (15:44.906)
flow.
Pierson Marks (15:56.18)
Right. So VO3 flash, that's going to be cheaper, faster than VO3.
Bilal Tahir (16:00.984)
Hopefully, yeah. There's usually, mean, you distill it down. So you have a premium model and then you distill it down and the quality drops a bit, but then you save money. And just to kind of not to dump too much information all at once, but VO3 was not the only release. Google also released ImageIn 4. ImageIn is Google's image model. And so they've been routinely releasing a better model with that. And so they released ImageIn 4, which is really good.
and but they also released Imogen for ultra and fast so you you can get a cheaper image standard image and really good image but even the really good image is only 0.075 which is like you know almost like a penny per image and it's really good and so 75 yeah per image it's really good
Pierson Marks (16:49.878)
Wait, wait, .075?
Which is like 10 cents, almost.
Bilal Tahir (16:58.808)
Point, is it point? Yeah, yeah, yeah, you're right. yeah, point zero eight. So eight cents, eight cents per image picture. You know, I mean, but for the detail you get is pretty good. I will, and we'll share, guess we'll share all these links in the comment section once we post, but yeah.
Pierson Marks (16:59.766)
Was it .075 or .00?
Okay, okay. Gotcha.
Totally.
Bilal Tahir (17:17.678)
They released that and so they did the fast and ultra and I think they're gonna do the same with the video model. I don't know if they'll do ultra but they'll definitely do a faster version. And Kling does the same. Kling has a master which is their premium video model. Then they have a pro which is I guess, you know, middle and then a flash version which is called standard. So.
Pierson Marks (17:39.126)
Gotcha. So a lot of...
Bilal Tahir (17:40.418)
So it's interesting, depending on your view. think it's funny, you always get the three tier system. It's a log nature or something, the three tier system appears everywhere.
Pierson Marks (17:46.678)
It's funny.
Right. It gets frustrating. mean, like the other day, I remember I saw something on Reddit was talking about, you know, open AIs models and naming is very hard. And you can see this if you're in this space and you're watching this video, you know that naming is tough. You'll get 04, you'll get 4.0, you get mini, high, mini, high, ultra, flash. You're like, for any standard person that's not paying attention, I mean, even people that are paying attention, you're like, ooh.
What is going on? Which is better than what? What do I care about? Is 4.5 better than 4.1? But no. And then all this craziness. So I mean, just to wrap up, just to summarize these kind of releases here. So Google, I think it's now two weeks ago or so released V03, which was the first video model that was able to create audio.
Natively so eight seconds you get eight seconds of video video is high quality. It's really good The lips are synced with the audio you can make you can just prompt it to generate audio as well So you had VO3 and they have VO3 fast that just came out that would be cheaper quicker model and then Google also has an image model imagine imagine for and that's just like
The same thing as OpenAI's chat GPT image sync? Is it like the same for people? Pretty much. Right.
Bilal Tahir (19:19.874)
Yeah, yeah, it's an image text to image model. But yeah, I would say it's pretty good. It's hard to, in the image space, to see which one's the best because now it used to be either stable diffusion or then it was mid-journey and mid-journey is still, I would say, the most artistic. it's very hard to evaluate which is the best in images because I would say right now the biggest image in four is up there. Obviously, chatgbd image one, I think is a technical name for chatgbd image.
it just takes 30 seconds to load. mean, it's probably the slowest one, but the quality is really good there. And then there's Recraft, which I think is really good, which I would say was pretty much the best outside these, is, forget the exact name of the company.
But then, and then there's also Black Forest Labs, which is probably one of my favorite companies because they release Flux and Flux was a dream changer. There was, and they did the classic three tier. They have Flux Snell, which is the smallest cheapest and it's ridiculous. I think you can generate like an image for 0.0003 or something like that. There's Flux Dev, which is a standard model. And then there's Flux Pro, which is the best highest quality. And they actually released another one called Flux Context. I know I'm throwing a lot, but Context is another one which
Pierson Marks (20:30.975)
You
Bilal Tahir (20:33.152)
you can give an image and you can basically get an edited image back which is similar to ChatGPT but it is basically I think I would say it's probably better than ChatGPT in terms of editing and maintaining the original quality because ChatGPT has this glean thing almost like a you know it's the Ghibli effect is somehow bleeds over to I think even normal photos in my at least in my experience so I feel like there's a there's a tinge to ChatGPT
Pierson Marks (21:00.637)
You're right. Now that you mention it, every photo that I generated through chat GPT kind of feels like a polished apple. know, when you get an apple from the supermarket, it kind of has like, it's not shiny, but it's like polished in a kind of way. I don't know.
Bilal Tahir (21:14.478)
Yeah, yeah, yeah, exactly. So, so Flux context is really good. I, Ideogram, if you want text, would say Ideogram is probably up there. They released their V3 model, was a two, I think it's probably a month now, it's been a while, but they are focusing more on if you want to generate text accurately, you know, that's their, I guess, use case is really good and cheap, cheapish. And,
Pierson Marks (21:40.469)
Totally.
Bilal Tahir (21:41.144)
There was another one I forgot, another model came out that caused a lot of stir. They had, it was called re-something, not remix, but yeah, that's the thing. I mean, there's so many competitors, it's crazy too. But yeah.
Pierson Marks (21:55.221)
So many. It's hard to, it's, I like, I don't know, like for me at least, you see once in a, once a week, once every other week, like, oh, new image model came out. What does it do? Everyone gets really excited. Like when Flux, the Flux, the one you just mentioned, what was the one way you could do the text to editing? Context, yes, Flux context. So it was just really good at taking an image, putting text in there, maintaining the context.
Bilal Tahir (22:14.83)
context.
Pierson Marks (22:24.116)
throughout generation. So I could generate image one, and then image two needs to make a small change somewhere. And it was really good at taking some base image and applying a change and generating the new one with that change.
Bilal Tahir (22:39.532)
Right. And if you've ever done that in charge video, you know how frustrating because you take an image, you edit it, but then it would just, it's like, you know, whack-a-mole because it fixed that, but then it causes another and you're like, no, I'm so close to the perfect image. Just give me.
Pierson Marks (22:46.898)
Right.
Pierson Marks (22:51.7)
Totally.
Bilal Tahir (22:52.654)
the original image, go back, go back, and you can't really go back at that. Because once the model goes down a different path, it's very hard to set it back to the original course. I know. found it, and the 30 second load time obviously makes it even more frustrating. So I've had a frustrating time editing images. So I think context is very interesting. And I actually think we haven't even talked, we've talked about text to video models, but there's a whole new class, some of those called image to video. And they're actually, most people
Pierson Marks (23:02.708)
Totally.
Bilal Tahir (23:22.608)
before Vl3 and Vl3 is only available in text to video I imagine image to video will come out soon but if you actually look at the people who produce the most you know they're like professionally doing this or doing it day in day out they almost always like the image to video models and the reason is because you have a lot more control because with text to video you can describe a prompt in a frame and it may get it or may not but if you
generate the, let's say the first image first, you can have a lot more control because you know, you control a lot on how the scene starts. And so people like that flow a lot more. And so they'll generate the first image using mid journey. A lot of times I see the mid journey plus clink combo because mid journey gives you the most, I mean, beautiful, know, expressive images. So they'll get that and then they'll generate the five second shot from that. Very interesting.
Pierson Marks (23:57.685)
Right.
Bilal Tahir (24:16.462)
And the reason I mentioned that in this context was context is because of context. with context, a lot of people and me included are excited by is you can have consistent characters through a story because you can take a base character image, generate different first frames, and then you can generate the five second shots and stitch them together. And it actually will basically give you the same character throughout the story. And that's been a huge challenge, keeping consistency across more than, if you want to generate
Pierson Marks (24:16.99)
So.
Bilal Tahir (24:46.596)
even a one to two minute short, that's a 20, 30 videos. And so you quickly lose that consistency even across a short story.
Pierson Marks (24:55.252)
It's super interesting. mean, like this just went down this other rabbit hole of thought. So, you know, like when you're creating something with any of these models, they're non-deterministic. You give them some prompt, they'll give you back some version of that prompt in an image. So you say, create an apple on a table. It's going to create an apple on the table, but the table might be black or brown. It'll infer kind of.
what's not being spoken and maybe it makes a mistake as well. And so if you can kind of focus efforts, it seems like, you know, take some prompt, get it to an image that you actually like using something like context, one of these image generated image generation models. And you kind of get that first frame right. Like, hey, I have this apple on this table and there's some flowers in the background. The wall is cream colored.
and there's a window and that frame is perfect. And you iterate on that first frame. And then you take that and you throw it into an image to video model. And so it of provides this at least like a grounding, like some post where that first frame is this and now you have an image.
Bilal Tahir (26:03.138)
Right.
Bilal Tahir (26:13.154)
Yeah, and it's much cheaper and faster too because you can iterate 30, 50 times. Whereas if you do that with a video, it'll cost you like 10 times more money. And also, you can't just press a button and you have to go wait for 30 seconds, which that latency adds up. I think it disrupts your flow a lot as a creative. Whereas with images, it's almost instant. And there's something magical about when you can just have instant edits because you can just be in the flow and go, never mind, change that, change that, change that, boom, boom, boom, boom. And.
Pierson Marks (26:39.092)
Totally.
Bilal Tahir (26:40.238)
I actually think, I mean, this is a mid-curve thing because I think we'll get to a point where we just get real-time video and then we won't worry about the first one. We'll just generate, regenerate the five-second video or 10-second or hopefully just the one-minute video and you won't care about credits or, I have to go make a coffee and come back until this process is finished, right? We'll just have that. So it's exciting.
Pierson Marks (26:58.356)
Totally.
It'll be, yeah, because like, least for like the way that I think I work is that if I can just pin, you know, you pin a frame where that first frame it's pinned and regenerations just changed the remainder of that video. Like if you can pin the first frame and the last frame of the video, say, Hey, I want the first frame to be the apple on the table with the window in the background. And I want the next frame, the end of that, that scene to be.
the Apple rolling off the table on the ground. And so you say, hey, you generate that first image really well, you know, make it like perfect. I don't know why this this thing happens with a thumbs up. I don't know. Maybe I need to keep my thumbs away from the screen for anyone listening. Just a thumbs up appeared on my screen. But yeah, you have you pin that first thing, the first image frame, and you have that last image frame and then the video model kind of
Bilal Tahir (27:40.098)
It always happens to you, I feel like.
Pierson Marks (27:58.982)
expands on how did you go extrapolate from image one to image two. And you can kind of pin different frames, like maybe right in the middle, the Apple like did this weird, weird thing rolling off the table, you could pin that frame to make sure that when you regenerate, maybe between those frames change, but the frames themselves that you really liked, stay consistent. I think we're not there yet. Maybe Google flow, I think
From the small time I've spent there, I think that's kind of their goal is like, hey, can we make a creative studio for video creators where they work in this sort of mental model where you generate scenes, you can expand on scenes. If you have a scene one that's eight seconds, you can just expand that scene for another eight seconds and maintain consistency.
Bilal Tahir (28:53.656)
Yeah, and that's a very common flow where a lot of people generate the first five, but then they extract the last frame and use that as the first frame for the next shot. so, once you're good with that, you can do that. Obviously limited, but yeah, it's interesting because I feel like...
That's like the, you know, what's the joke? Like everything becomes a time editor. Cause you think of this and you immediately think shot, drag, can like, you know, stretch the image or whatever. Then you take that, you generate that, them together. but I mean, it does make sense. I mean, there's a reason why it's a standard. Cause you know, you think of, and I wonder if that basically gets, becomes a CapCut or whatever, like a Photoshop feature where you just generate the images and you know, you just like drag into your timeline as it's not fair. you know, it's like an import basically. It just becomes a button. So.
Pierson Marks (29:38.228)
Right, right, right. I think that, yeah, I think it's something that we've always talked about. It's like, we want to declare what we want versus like be prescriptive. I mean, it's much easier to say, hey, I want this versus how I'll show you how to do that thing. And like a timeline editor, because it's based in time, it's like, hey, I need the audio to be synchronized to the person's website at the exact right time.
Bilal Tahir (29:39.101)
Makes sense.
Pierson Marks (30:05.063)
or I want the bird flying over the scene when it squawks, you and it's like matching up audio. But if it can kind of just do that because it can infer based on the context, we live in a much easier way where world where it's just understands what you're trying to achieve versus sliding and dropping. And so a time, maybe time isn't the right medium. It's more of like a frame where it's like this frame itself is kind of that post.
Bilal Tahir (30:30.862)
you
Pierson Marks (30:35.121)
where is it a few, it's like after this frame, before this frame and the granularity of changes no longer has to be measured in time or like in frames, but just more of like in concepts. Like you have this one scene and you have another scene and those are concepts versus measured in seconds or minutes. So.
Bilal Tahir (30:56.222)
interesting way to break it down yeah
Pierson Marks (30:58.227)
And I know we just mentioned a lot of these models. And you mentioned something earlier called Fall. So Fall is a API provider, a model inference provider that runs these models.
Bilal Tahir (31:13.538)
I would say the two biggest are the replicate and fall, which basically are platforms where they host a bunch of these models and you can go and play around them. I always recommend using these platforms over a lot of times people like non-techno people, like they'll use like a service which under the hood uses these models. And so obviously you have a markup there, usually have a subscription. so with fall and replicate, they have the latest models. go up.
just pay for what you use and that's it. And it's just so much easier. also, it's primarily for technical people because they have an API and you can use them, but they have a playground. So even if you're not technical, you can like go in, put in a prompt and hit the button and get the video. Like Vio3, it's just a page. It's easy to use. They have a nice UI, basic UI. So even if you're not technical and you're hearing this, you're like, it's not, I don't know anything about fall or replica. Just go sign up and use it. It's very simple.
Pierson Marks (32:09.395)
Totally. So like all these, so, right.
Bilal Tahir (32:10.6)
to use and they have hundreds of models hundreds of models, you know, I've given them black about Making the discovery better. They it's not the best, you know, I actually built a website called replica
that was doing a better job of using their API to discover the models. And I had to deprecate that because I got drugged by PlanetScale, which is a whole different story. But like, yeah, it can be challenging to find the models and stuff. And so you have to look for it, but there's a lot of gold there. And it's like almost like hidden goals. There's like people have found models which, this guy uploaded this like open source model has like 15 runs, because you can see how many people have run it. And then, you know, he used it. It's a hidden trick almost right, know.
Pierson Marks (32:33.169)
Hahaha.
Pierson Marks (32:52.657)
Right. I was looking today, I I saw some models on there that were just like cartoonify and posterize. And like there's these things where essentially they're filters almost, you you can imagine them like filters, but it was super cool where it was actually generating these new like cartoon, like Marvel comics. So like yesterday you showed something where it was, you know, the four, like the two by two square comic book style image where, you know, you might have a model in there that is just trained to do that specifically. So you can make these comic book.
Bilal Tahir (32:58.541)
Yeah.
Pierson Marks (33:22.099)
And that can be really cool for like memes. mean, like if you're a meme generator account or you just want to grow your business through memes, like comic book memes, maybe that's like the right way to do it versus going to chat to BT and hoping that it can just generate that consistent comic book style thing.
Bilal Tahir (33:25.026)
Right.
Bilal Tahir (33:39.438)
No, 100%. I mean, they're literally people that build businesses where
they take a general editing model, they have a perfect prompt to maybe cartoonify or whatever, and they build a cartoonify app and you people actually will pay for that, you know, it sounds like why wouldn't they just edit themselves because people, you know, they just want the solution and they're like, I don't want to figure out the prompt, you know, or, they probably don't even know it's a, there's a model that can do everything. They're like, he probably has their own model, but in the under the hood, nine or 10 times, they have a generalizing, they've just customized it to this one or two niche and give you a one button or a dropdown of cartoonify.
gibbify, black and whiteify, make a manga of it, right? And then, and you're like, yeah, I'll pay the full premium on that. So, lots of opportunity. A lot of people just literally take these models and build a nice wrapper on top of them.
Pierson Marks (34:15.42)
Right, totally. for sure.
Pierson Marks (34:21.874)
Totally, totally.
Pierson Marks (34:28.018)
Right, right. is super interesting. And I mean, just to wrap up on the video model side too. So I know that we talk about VO3, the differences between VO3, Flux, all the other imagined models. They can be run on FAL for, you said, $0.08 essentially for an image. then $6.
Bilal Tahir (34:49.55)
Yes, $6 and 8 seconds, I would say.
Yeah, would say I mean just to kind of give you guys this I would definitely say if you guys are budget conscious Kling is probably the best it's 25 cents for the standard one which for five seconds reasonable another one which doesn't get love is LTX and LTX is awesome. It's very basic. It's it'll give you crappy video like, you know, you can't do something complex with it but if you just want like simple image of an astronaut getting up from the ground which is their example or you know a zoom in effect or something you can generate that for four cents a
4 cent video for 5 seconds. That's the distilled version model 13 billion. And so I use that a lot because that instant feedback. So I like play out the video. And if I like something approximately and I can see, if I use just a bare model now, I can actually get the shot. And so I'll then take that prompt and go to cling. And so.
Pierson Marks (35:23.152)
Mm. Wow.
Pierson Marks (35:43.368)
interesting.
Bilal Tahir (35:44.428)
So it's like we talked about image, know, perfect the first frame and then going to video, but you can also go perfect the first image, then use a crappy video model, spend the four cents or let's say 20 cents a day or five iterations, perfect the video shot, then go to cling or VO3, you know? So it saves you a lot of money.
Pierson Marks (35:59.58)
Wow.
That's a super interesting workflow.
Bilal Tahir (36:02.762)
So I actually have thought about I'm sure we can like build up somebody can just build that wrapper You almost go like choose an image. What do you like choose this image choose a video? Like and then go right like you can build that pipeline almost for people so
Pierson Marks (36:13.522)
Right, that's super cool. It's like an upscaler where it's just like you go from image to video, low quality video to high quality video. That's interesting. Wow, no, that's cool.
Bilal Tahir (36:20.983)
Yes.
Yeah, sorry, not to pile on, but you said upscale. People then take the video from Clink, which is 720p, and there's also a mall called Topaz. I mean, they're a service, they're an OG service, but they're on fall or replicate too, I think. And you can upscale it to 4K. So that's another step people do who generate the high quality videos and stuff. So it's really exciting. And just to wrap up, guess, the video, because we haven't mentioned call sheet. So the potential of this, there's a guy called PJ Asset.
Akatoru I don't know how to pronounce his last name, but he's a very talented videographer He's worked with big companies to commercials and he he's like pretty big on Twitter because he's been making Amazing viral videos using cling and other video models He made a one of his early videos was he got viral because he took Princess Mononoke Which is a Studio Ghibli movie the two-minute trailer and he made took the shots and he generally went to mid-journey Generated a real live action shot of the of the
Pierson Marks (36:56.7)
PJ, let's just do PJ. Right.
Bilal Tahir (37:22.096)
animated frame then went to cling and generated the video and then he stitched it together it was a very frame very one frame by frame live-action trailer very good and he got like I think 20 million views on that it blew up got actually death threats he had took it down because he got death threats from people because they were like you're you know it's blasphemy and you know Miyazaki is like crying right now or whatever blah blah blah
But so he took it down, mean, but then, you know, he was like, screw it. And he's like doing other stuff too. And the latest thing he did was he did a commercial for Kalshi. Kalshi is a prediction market company. It's blown up a lot in the last couple of years. And they did an NBA Finals ad, which would have cost him $400,000. He did it for $500.
Hopefully pocketed the change or a bunch of it and and it's a really good commercial, know, it's like one of the first I think commercials where a big company has you know, taken a slot out and one of the you know, like an NBA finals or whatever and I've done it I've done a fully almost fully AI generated thing I think we're probably going to see the next Super Bowl You're gonna see a lot of these AI ads. I think that's where we go to now So so yeah, the potential is like, you know crazy so if you guys you know get in now and perfect these
Pierson Marks (38:32.689)
for sure. That'd be nuts.
Bilal Tahir (38:40.56)
tools, know, I mean, there's a, it's like the wild west days right now, the early days of creativity and art and few perfect, you know, just developed prompts prompting skills because it does require skills to get the most out of these tools. There's a lot of demand out there, you know, for people who understand these tools and can get the most out of them.
Pierson Marks (38:58.563)
No, it's super interesting. We'll put the link to the Cal-SHI-AD in the show notes. And I think you just also just talked about these really cool best practices and the workflow that you go through. I mean, going from image to video, low quality video to high quality video, and what PJ does also. If you don't.
Bilal Tahir (39:02.573)
Yeah.
Bilal Tahir (39:06.604)
Yeah.
Bilal Tahir (39:20.716)
Right, yeah, PJ has a, he did an interview with Greg, I think it's a very good interview where he talks more. He also has tweeted about it, about how does he come up with the shots, know, his prompts are very descriptive. They're also available. And yeah, I would say like as any good art say, great art steal. So I would just go on Google, see, you know, what's popping like.
take either, you can take their prompt or you can even take their screenshot and go to chat GPD and be like describe, reverse engineer the prompt out. Like how would you describe this image? Take that and replicate and follow. They also have guidance on how do you best prompt view three. Like so they have that. Take that.
Give it to chat GPD and get the shot. somebody said the best way to prompt AI is to ask AI how to prompt it. So just go meta on it and start there. And so I would start there. And then I feel like once you use it, you develop your own style. You're like, I know I like the doll shot or this. This is what makes it work. the other co-founder of who's not here with us right now, he's on vacation. But Jason, he's been playing around with Vio3.
Pierson Marks (40:11.313)
Call it.
Bilal Tahir (40:31.92)
He was making this really cool notebooks, like talking notebooks kit in Vio3. And so, I was seeing his props, he's developing his own style. Okay, this shot works there. This is how you get the lip sync working. So, as you get experience, you get better at it.
Pierson Marks (40:48.009)
And I think the big thing to take away from all this is that it's so cheap now to be creative, to generate something that's in your head and getting into real life, whether it's an image or video, it's cheap. It maybe takes 30 seconds and a few cents to do it. Years ago, if you wanted to be a graphic artist, you had to really learn these programs and you had to be able to understand the UIs and...
Bilal Tahir (40:54.851)
Yes.
Pierson Marks (41:15.313)
And to get something, it was impossible for most people. And now people have like this tool that's so easy just to explain, you know, hey, this is what I want, generate this for me. And then allow your creativity to be spent in the story. And, you know, to be able to think of, hey, I want to generate this NBA Finals ad that has this guy sitting in a pool of eggs or, you know, like, and those concepts that allows, you know, to spend more time on
How can I make the content itself better, higher quality, better storytelling? And it doesn't take away from that stuff. And it allows us more people to have access to these tools through something like Fowl, through something like Google Flow, that are very easy to use. And I don't think enough people are aware that these tools exist, or they're scared, they don't know that they exist, or they're intimidated by.
the platforms, which they shouldn't be at all because they're so easy. they're literally, Google has this whole page called Google labs. They have experiments. There's like 30 experiments there of like the coolest applications of AI of like music generation, video, Google flow is one of those. Like there's just so many cool things that are so user like user friendly. It's nuts.
Bilal Tahir (42:36.654)
as yeah, I mean, we've talked about a lot of the models. I'm curious, like, because you've played around, what are some models on Google flow that people like aren't, because they have their own custom stuff there, which you feel like most people don't know about, you know, because like, like Vio3, maybe it's made it way to more mainstream circles, but what are some hidden gems there?
Pierson Marks (42:56.143)
Yeah, I haven't spent that much time there on Google flow. I think most of it is around V03 and V02, if they have V02 as well. But it's more of just like, can you bring this super cool model that's in the cloud that most people have to go like replicate or foul to use and bring it into like a more familiar interface that looks something more like Premiere Pro or CapCut or
Bilal Tahir (43:04.044)
Right, yeah.
Bilal Tahir (43:15.991)
Right.
Bilal Tahir (43:22.072)
Yeah.
Pierson Marks (43:25.583)
TikToks, like native video editor. So I think that's kind of where you're seeing this merge between high-end production tools to user-friendly, consumer-based tools. And I think that's the most exciting part of this whole thing. It's like, we can now generate things that are just awesome. So.
Bilal Tahir (43:27.95)
Mm-hmm.
Bilal Tahir (43:48.418)
Yes, yes. Yeah, it's a good first rate. recommend people to I know Google products can suck but the flow. I remember I checked it out a while back, but it's a pretty nice UI very clean.
Yeah, so lots of models. We didn't even touch on actually is I just remember there's actually a third model, the actual release, which is Liroid 2, which is the music, because you said music generated triggered the memory. But yeah, it's actually a pretty good model. mean, generating like instrumental stuff, especially you can generate 30 second beats. It's also on fall. Not quite kind of backup. I would say probably the best music AI music generators are Suno and Udio are the two ones. mean, and they're great. I mean, you check it out, but you have to have subscription. They don't open their API.
Pierson Marks (44:01.466)
Totally.
Bilal Tahir (44:30.992)
But layer two is pretty good. You can just have an API for it. You can generate different sounds and combine it with the video part, et cetera.
Pierson Marks (44:32.944)
Those are nuts.
Pierson Marks (44:42.192)
If you're a music fan and you just like music, mean, you have to check out Suno. I haven't checked out the Google model that you just mentioned, if you don't like music, which you're crazy anyway, if you don't like music at all, so just turn this off anyway. But I mean, if you just enjoy music for anything and wish maybe you were a musician or you know the theme of a song, if you're a house music fan.
Bilal Tahir (44:48.309)
yeah.
Pierson Marks (45:09.872)
and you like EDM house, like that type of music. You go to something like Suno, I mean, half the stuff produced today sounds identical to the music that AI can produce. So I was always thinking, I used to DJ a little bit in college and it was just super fun. I'm a pianist and I have a keyboard in my room over there. But I can explain pretty well, like, hey, I want these types of drums.
and this type of keyboard and maybe a saxophone and this is like the bass and all these things and putting it on the Suno just blew my mind. I was like, wow, I could have spent 10, 20, 30 hours to get something half as good as this. And it helped me visualize it or audioize it better.
Bilal Tahir (45:57.806)
No, because you play keyboard and you're more into music than me and understand the thing. I'm curious because I've talked to other musicians. It sounds like you use, because a lot of people like me, I'll go and generate me a song, it sounds like you do more like presets and sample. You're like, me the bass or the drums. And then you, do you just take that and combine it in Ableton or whatever? Like you'll create your own unique blend of it?
Pierson Marks (46:22.734)
It's hard. I I think that the challenge when you export the tracks into something like Logic or Ableton, it's the consistency that isn't there. I think it's still much better if you one shot. I haven't actually gotten to play around with in that much depth. I know you can export the different instrumental tracks, I think, now in Suno. I haven't really got to do that, but...
At least for the one-shot generations where you get like a three minute long song. Most songs are three minutes anyway. And the consistency is hard if you want to make a strategic change.
Bilal Tahir (46:55.896)
Yeah. I mean, no, it's super.
There's so many things I'm surprised they haven't done that because there's they used to be models even before generation. They were models where you can give it a music and you can get the bass and it was able to just split it into multiple things. Because I remember I did this. started this channel where I was doing acapella. So I would take old tracks like the Man Against the World, like Nirvana, and I would just do acapella versions of it. It actually still is up there. I think it got decent amount of reviews because you just run it. I forgot the name of the it was an open source GitHub repost. So they put it on like, yeah, this is awesome.
So I had a Jupyter notebook. was putting definitely, you know, the studios would come after me. It was like, again, it's copyright. I made acapella versions of a, also I'm like, nobody's doing this. There's like a business right there. You know, people would like, would want this.
Pierson Marks (47:32.912)
Do it.
Pierson Marks (47:42.468)
I think they are doing it. just haven't really, I personally haven't gotten to play around with it much. So maybe that's a weekend, something fun to do. I used to do that all the time. Just like mess around and like mix music and it's really fun and relaxing. And I haven't been able to spend that much time around it. Cause there's just so much stuff to even do. I mean, we just spent what, like 40 minutes talking about all the video and image models and time. And honestly, I think that like,
Bilal Tahir (48:04.834)
I know we haven't even gone to 11 labs yet. I know we were gonna talk about it.
Pierson Marks (48:09.102)
We'll just probably on episode one, we'll stick with image and video, V03 and all these stuff. We have a lot of gold here. And next week we'll talk about Labs V3. We'll get another week to kind of play around with it in the playground, in the API hopefully. 11 Labs, if you're watching this. API, yes.
Bilal Tahir (48:13.678)
Right.
Bilal Tahir (48:20.334)
Yes, yes, they have not released the APIs of this recording of this video, which is a tease. So hopefully they will do that.
Pierson Marks (48:29.786)
So, but, also just know that as soon as 11 Labs API is out there, we'll be covering it, but we'll also cover just the studio product next week. And if you don't know what 11 Labs is, check it out. It's pretty cool. But yeah, I mean, this was awesome. I think for episode one, talking about media and everything in the AI generative space.
Hopefully that people find this helpful. I even learned right now just about like your workflow and how you're going from image to low res video. I thought that was the coolest part. Honestly, think if I was to take one thing away from this whole thing, it was like that workflow that you just mentioned to create video.
Bilal Tahir (48:57.847)
Yeah, yeah.
Bilal Tahir (49:11.744)
And we'll share, because I want us to do.
I mean, because you're both technical and stuff, I feel like, you I know maybe the audience, maybe some of you are technical, not to care, but we do want to share some code or tooling. So hopefully in the future, because I actually have scripts and stuff that does this. So I will be sharing this with the audience. We'll start an open source repo and stuff. And sometimes the script, might just be GUIs, simple GUIs as well, like apps and stuff, which you guys can play around and drag and drop stuff and play buttons with your own API keys.
You know to make it easy or whatever. So yeah, so be on the lookout for that You know that we want this to be a hands-on podcast mergers. We yap, but we also you know Actually play with these tools. Yeah
Pierson Marks (49:56.944)
All right, hopefully. Yeah, totally. Hopefully you learned something. And then this weekend, I'm trying to get this out today. Today's Friday the 13th. So we can get this out so people can play around with it. Yeah, I know, I know, I know. So yeah, but cool. OK, let's wrap this one up. But we'll see you next week. Cool.
Bilal Tahir (50:06.228)
how many days did it
