Exploring Fei-Fei Li's World Labs, Image Upscaling, and ChatGPT's Most Common Use Cases

Pierson Marks (00:02.552)
Well, welcome to Creative Flux. Hey, everybody. I'm Pearson. And yes, we are. And Creative Flux is the podcast for generative AI media. We talk about images, video, music, 3D worlds, all of the above. So we just kind of chat.

Bilal Tahir (00:05.683)
Hello, hello.

and I'm Bilal. We're getting our timing down with this now.

Pierson Marks (00:28.718)
and discuss everything in the generative media landscape. It moves very quickly. We were talking about Nano Banana in the past. That one's a big release that kind of brought Gemini and Google media capabilities to the forefront for, I think, a lot of people. It was kind of like a chat GPT moment for the mainstream in terms of image editing. That and Studio Ghibli, I think, on the chat GPT back a few months ago.

Bilal Tahir (00:53.683)
Mm-hmm. Yep.

Pierson Marks (00:56.536)
But yeah, we cover all this stuff. We talk about the new things that we're into, what project we're working on on the side, things that we just want to follow the space. And blow it into this super hard. It's great.

Bilal Tahir (01:10.95)
I did cast that part.

Pierson Marks (01:12.384)
Yeah, I mean, you're in this space, like going hard in it. Yeah, we're both. We're both.

Bilal Tahir (01:14.835)
Wow, I mean, you are too, you are too. feel like you're both passionate about it, yeah, no, it's, I'm super passionate about it as are you. And it's just so exciting. I don't know. mean, it's, I feel like it's a cheat because I feel like it's saying like, I, you know, I work as an Imagineer or something. It's so cool. My day job is so cool. I'm like, yeah.

Pierson Marks (01:35.704)
bet not.

Bilal Tahir (01:35.783)
You know, it's easy when you just get to go and create these amazing, interesting worlds, you know, which we'll talk about speaking of worlds. But for that, do want to, cause you mentioned Nano Banana. It's interesting, you know, cause you were talking about the chatty bit image moment. Cause when that happened, apparently like the, was a crazy insane moment where their servers just couldn't handle the, like Sam Altman, I remember tweeted about what they added a million users in like an hour, some insane figure that, you know, mind boggling figure like

they were adding those users. then Nano Banana, apparently Gemini, the Gemini app became number one in the app store because of Nano Banana, which is, mean, I don't know. I maybe I shouldn't be surprised because in my head, think, image editing and image iteration, it's cool. But it's been cool for like two years, I feel at this point. But still, I feel like every couple of months, people suddenly, they're like, wow, you're saying I can take an image and edit it and make it into a Ghibli or a cartoon. They still like, you know, love that.

Pierson Marks (02:12.664)
Right.

Bilal Tahir (02:35.737)
they're still are wooed by that transformation. It's interesting. And maybe at this point, we should just lean in on it and be like, people will always have this insatiable demand to take themselves and either make them look pretty, add filters and stuff. It goes back to Instagram and even before that, with the artists and stuff. So yeah, very cool.

Pierson Marks (02:41.112)
Right.

Pierson Marks (02:59.17)
No, totally. Did you see this on X the other day about the usage for ChatGPT? There was an image here. and here it is.

Bilal Tahir (03:09.944)
about the Sanke, how many people use it for coding versus images and stuff? Yeah.

Pierson Marks (03:14.39)
Yeah, it was interesting. Let me pull this up here and just show the screen because I thought it was interesting to discuss real quick because it just proves how much of a bubble I think we are in in terms of familiarity with this stuff. So let me see. Let me pull this up. Might not be the best, highest quality image here, but if you could see this.

and it needs to be higher res. But here's the breakdown of what people use ChatGPT for. It looks like maybe the bars are more accurate this time versus GPT-5, if you remember that. Where like the 60 % was... Yeah, yeah, the Y axis is actually accurate. But like, these are the topics that people use ChatGPT for.

Bilal Tahir (03:54.515)
yeah, they figured out the y-axis

Pierson Marks (04:09.39)
And so going from largest to smallest here, 28 % of people use ChatGPT for practical guidance. And that's creative ideation, health, beauty, fitness, self-care, how to advise, tutoring or teaching. And so that's about 30 % of people use ChatGPT as like a mentor in some way, or form.

And then right next to it, another 30 % use Chapit GPT for writing. summaries, critique, or personal writing and communication. So that makes sense. That's two thirds about, so 60 % use it for writing and guidance. And then another 20 % seeking information. So looking for information about somebody or something. so 80 % is very common use cases of just mentorship, looking for content, seeking

information is like Google, essentially, rather than going to Google, they're going to chat GPT. So 80 % is there. And then across the other board, you got technical help, which seems like that's kind of like practical guidance, but seems more of in the terms like expert coding math, which is surprising to me. mean, 4.2 % being in computer programming, maybe just because X were blinded, but I feel like

Bilal Tahir (05:18.909)
Right, yeah.

Bilal Tahir (05:30.087)
Yeah, I mean, I also think most programmers use cursor or like other IDs and tools to access LLMs rather than going to chat GPT and copy pasting. I bet it was different two years ago when there was no other equivalent.

Pierson Marks (05:40.398)
That's right.

Pierson Marks (05:46.19)
Totally, Yeah, absolutely. And this is a sample from 1.1 million conversations. And this is probably, I don't know. I haven't read the privacy policy in ChatGPT.

I don't know if you're a paying customer, if they can use aggregated analytics. I'm assuming you could. I'm assuming on enterprise plans they don't use that maybe. I'm not sure. we'll see how this is biased, if it's more biased towards free users versus paying users or not. I don't know. But the interesting thing here is being a generative media podcast, 6 % is multimedia.

And 4 % create an image, 1 % generate or achieve other media. I don't know what you can really create besides an image.

Bilal Tahir (06:33.863)
I think it lets you create charts and stuff. don't know that, mean, charts are still images, but I know they were trying to get into like Excel or tables and stuff.

Pierson Marks (06:37.991)
right.

Pierson Marks (06:44.268)
Right. And I could see them also generating like music and this will segue into that because they have an audio model for their advanced voice mode. And like we discussed in the past, like how difficult if they can do audio and voice is music that much more difficult?

Bilal Tahir (06:48.403)
Mmm.

Bilal Tahir (06:53.735)
Yeah. Yeah.

Bilal Tahir (07:00.315)
Right. You know, that was always surprising to me that they never integrated Sora into ChatGPD. I they just like create a video right there because they had that capability. And as a ChatGPD user, you do get Sora credits. So I just don't know why they didn't like pull that in, you know.

Pierson Marks (07:08.172)
Yeah, right.

Pierson Marks (07:13.41)
Really.

How is Sword doing? Have you followed that?

Bilal Tahir (07:19.409)
No, not really. I don't think, I mean, the latest update they had wasn't like...

I mean, it was good, was like the cling was out. There were so many other models at that point. So I don't think it was, it was a leader anymore. They, kind of fumbled that with their initial, like, you know, they, they came out with so, all right, everyone was blown away, but then they just didn't do anything with it for a year or because I think suspect they just could not get the cost of generation down. think that's, I've noticed that consistent theme with both their image generation and their widget generation. It's good, but it just takes so many tokens and other companies that just figured out to do it.

Pierson Marks (07:48.718)
Excuse me.

Bilal Tahir (07:58.297)
to do it in a more efficient way. And so speed and cost are important.

Pierson Marks (08:00.621)
That's interesting.

Yeah, I want do you think if you were to make a bet, do you think that opening eyes still working like on improving Sora, either like iteratively improving Sora, or creating Sora to or do think they're kind of moving away from that and letting go? I don't know. Do think they're still like actively working on video, Jen?

Bilal Tahir (08:22.583)
I'm pretty sure they are because opening eye for me in my head, I think of them as a consumer product company. know, I think that totally makes sense that they would like be focusing on that. And again, I mean, they saw what happened with Jap, the GPT image one, like that was an insane viral moment for them, you know, and I think.

to make sense like that. And that conceivably can happen with a, if they have like a Sora type of like an equivalent video moment like that, I mean, that would be insane. So I'm sure they're pursuing it under the hood. They're kind of doing everything. In fact, they're doing too much. Like they're doing their own chip apparently. I won't be surprised if this company does their own LeBoubou dolls at this point, because they're just all over the place. know, he's like Sam Altman's doing all these deals and stuff. It's sometimes, you you want to focus more. It's like one of the reasons I'm a little

Pierson Marks (08:51.084)
Mm-hmm.

Pierson Marks (08:58.157)
Right.

Pierson Marks (09:05.004)
You

Bilal Tahir (09:12.507)
kept bearish on opening up because I feel like they don't sometimes show the discipline to just focus on one thing one or ten things you know and just do those

Pierson Marks (09:19.596)
Right. That's a fair critique. mean, I think what you have to see when you look at it is does the company, like I think it's good for them to pursue a lot of different things and do they cut? Like, do they do something and then they cut their losses because like you can't do everything, but it's fun as a startup to explore a little bit because some things might be more well received than others and then they invest more. And seems like maybe

Sora

is either they like cut back funding or they're just they're working on it and maybe they're doing too much but also I think it's good for them to try and explore especially if they can raise the money if they like if they explore and they don't sacrifice quality I think that's a key too because if they're spread too thin and they're starting to release products that aren't fully baked they feel like beta experiences and they're not clearly beta like not labeled as beta experiences like MCPs now in chat GPT I think that's beta

or it's under developer settings, something. But like, I think that is a sign when you start seeing a company like that, the quality to grade because they're doing too much. But I haven't seen that yet. Maybe the IO with Johnny Ive, like whatever product that they're doing there. Really?

Bilal Tahir (10:39.537)
I don't even know that's happening anymore. They had the whole thing where the blog post got removed and then people were like, no, that was a mistake or something. and I haven't, we haven't heard anything. So I don't know. mean, some of these partnerships, I'll believe it when I see it. From what I did here though, the thing they were making was similar to the product, the startup release we talked about last week where you could.

Pierson Marks (10:51.49)
Right.

Bilal Tahir (11:05.106)
talk with your thoughts. I think that was what they were aiming, they're going for as well, where you can just like, you know, think. I saw a couple of tweets about it that that was the form factor that, you know, Johnny Adams was going for, which makes sense, you know, because I've never understood the whole pin, like, or wearing like a pin thing on your shirt. I'm like, why? But if it was like a

Pierson Marks (11:06.893)
Right.

Pierson Marks (11:10.638)
Really?

Interesting.

Pierson Marks (11:19.96)
to like read it.

Bilal Tahir (11:28.037)
If I'm just thinking and it's recording or talking to me via like some, you know, under the hood subconsciously, then that makes more sense, I guess, in a way.

Pierson Marks (11:36.78)
So just, mean, for people that like just joining the thing that we're talking about was a device that sits on your ears, kind of, right? Like glasses, but it was like behind your ears. This was like, this was a different form factor. we have, this is all speculation in terms of what OpenEyes is working on, but it kind of just like you wore it and it wasn't a chip in your brain, but it could read your thoughts.

Bilal Tahir (11:40.637)
Yeah.

Bilal Tahir (12:01.435)
Right. Yeah, yeah. And you can send it like to, so if you're wearing like a receiver, like I can think and you, it'll trans, I don't know if translate is the right word, but whatever, it just takes that, decodes it, encodes and decodes it, and then sends it to your device. And you can basically hear what I'm thinking in a way, which is crazy. It's like basically telepathy, right?

And I wonder, I was thinking about that. was like, you know, it's funny because we're basically going towards, everyone's going to become schizophrenic in a way because you're just going to get a voice in your head. you know, you'll just be, why talk? You'll just be walking around with like a headphone piece in our like ears and just like we're thinking. And it's like, it's like her, but like.

Pierson Marks (12:35.192)
You're a man.

Bilal Tahir (12:45.331)
It's like one of those classic like futuristic movies where they get so much right, but then they miss the most basic thing. Like why are these people talking to their AI on a phone? That's so basic. No, you just think thoughts and then it thinks it, you you get the sound back in your ears, you know.

Pierson Marks (13:01.518)
It's so wild. always just, I, cause I've had a lot of concussions in my life and doing a lot of like brain testing and it was interesting when I was in high school. And so I always was envisioning, I mean, your brain produces waves when you're thinking and like they record that and they do testing, but the science is still very young there.

But if you can train an algorithm with a lot of data, like if we're just sitting in front of a TV and you're wearing a hat or like a device that just records your emotions and your thoughts when you're watching something, like if you show somebody an apple and they have like a thing on their head and like in your head you're saying apple, but you're not actually saying it, but you're just like thinking apple, because you're seeing the apple, you're thinking apple and like what...

Bilal Tahir (13:41.235)
Mm-hmm.

Pierson Marks (13:52.01)
areas of your brain are firing, what brain waves are everything. I you scale that up in terms of millions and millions of billions of data points and parameters. Can you actually train a model that is able to understand rather than like tokens, like waves, and then convert that into text?

Bilal Tahir (14:10.001)
It's crazy. Yeah. Hey, maybe that's the future of the human race. We just gradually lose the ability to speak language because why do you need languages? We're just a thought species. We're just thinking. It's just brainwaves.

Pierson Marks (14:20.027)
Yeah, this is exactly what we were talking about the other day with the book where they can't lie to each other.

Bilal Tahir (14:26.365)
culture or I thought you meant the culture because the culture series has a species that has it's basically a species that just is just brainwaves they don't even have a body they're just like floating brainwaves of part like and that's the whole species it's it's already bizarre I recommend the culture series to anyone it's a one of the few not dystopian depending on how you see it like sci-fi novels where it basically we reach AGI and the machines take over but it's great people love it

Pierson Marks (14:40.418)
That's wild.

Pierson Marks (14:53.198)
Right. Well, talking about like brain waves and craziness, mean, did you see the meta demos? Because I don't really.

Bilal Tahir (15:03.187)
Yeah, yeah, I thought it was a pretty cool even though apparently it flopped like twice which you know, I I don't I think a lot of people like, you know actually gave him credit that zuck actually did a live recording and you know, It flopped so that's like, you know, hey he was trying something, you know ambitious so kudos but It was the price point was what I think what stood out for a lot of people I've been Tom I think it was like 700 or 800 dollars for like the flat chip model which I mean compared to three thousand dollars for like a much thicker

Pierson Marks (15:18.04)
Totally.

Pierson Marks (15:23.95)
800 bucks?

Bilal Tahir (15:33.233)
two-year-old Apple glasses, know, I you can see the price going down and so I think I can totally see glasses being a great form factor for LLMs. We talked about obviously the

like having a little device that sits on top, in the side of your glasses, which you could like listen in on, but you can also overlay information on the glasses itself. And I think that's gonna be super common going forward where, like you pull up to a bar or a restaurant and someone comes and you suddenly get all their information like, this blah, blah.

Pierson Marks (16:07.182)
that's scary. Yeah, I mean, yeah, you're on a meeting with your with like your boss or something and you're just like watching football, you know, you're watching NFL or something. You're not in the moment. I don't know. Yeah, yeah, we'll see what happens. But.

Bilal Tahir (16:20.069)
race.

Pierson Marks (16:27.734)
Yeah, I know we were just talking right before this too about HDR and like worlds and things I wanted to touch on that. So I didn't even know what HDR stands for. High definition something. Dynamic range, right.

Bilal Tahir (16:39.821)
dynamic range. So apparently what it is a new technology that pops the colors and contrast pop more. So compared to SDR which is standard driving range before I mean this the latest tech apparently you know I

Pierson Marks (16:48.238)
Mm.

Bilal Tahir (16:55.729)
I mean, back in the day, was used to be, I remember when I was young, it was like HD, man, HD is the shit. And then at some point it became 4K, but those are resolutions. then HDR is more the technology itself. it's like, I remember there was a Sony technology as well as the Bravia had something, which I forget, they had, no, was it the display? Because there was a display as well, LCD versus all that. basically, HDR, just better software to create videos that...

pop more, looks better. And the first AI-generated model, apparently, which supports it, was Ray 3. Ray 2 is out already. We've talked about Ray 2 in the show before. I don't if the company is called Ray 2, but it's a decent model. I think it's up there. I've used Ray 2 Flash, which is the cheaper version of that model. And they have a version of it called Ray 2 Modify, which is it takes a video and you can

modify it, like you can try to transfer it. So, and now...

Pierson Marks (17:57.806)
We have right here is a video model. Video model.

Bilal Tahir (18:01.095)
It's a video model, video and video editing. And now Ray 3 is the latest iteration of that line and supposedly, you know, very good, supports HDR. Also, it is the first reasoning model, video model. like, you know, in text reasoning, you know, chain of thought where, you know, you ask a question and the model thinks for a while. And apparently in this version in video, you can put in a prompt and it thinks about it and generates a scene, which...

Pierson Marks (18:14.958)
Mm.

Bilal Tahir (18:31.079)
For me, I don't know how much of that is PR versus like an actual breakthrough because.

I mean, I just assumed that would, that's just a pipeline thing, right? I mean, you could just always pipe through a model and optimize the prompt, but apparently I guess it's baked in to the model itself and that helps me with that. So I'm not as familiar with the architecture of the model. So, but I did see some examples where the video generator was, it asked a prompt to do something and it thought through, it created the first two seconds and then it judged.

Pierson Marks (18:48.684)
interesting.

Bilal Tahir (19:05.263)
the first two seconds and it was like I'm going down the wrong direction this is probably not it and then it created a new video so that's actually kind of cool I guess if you can self-correct a video generation because video is too expensive and you know takes a while for you to get that feedback loop going

Pierson Marks (19:09.134)
Mm.

Pierson Marks (19:23.406)
So correct me if I'm wrong here. I need to read some papers and I need to understand this space a little bit more technically because I don't have the best grasp on video gen models. know image, well, I kind of. So image is generated as the diffusion model. kind of like you generate all at once. It's kind of just like that.

Bilal Tahir (19:49.831)
Right, yeah.

Pierson Marks (19:50.606)
For video, is it a full diffusion model where every frame is generated at the same time? Or is it iterative where you're generating frame one and then frame two as reference to frame one?

Bilal Tahir (20:03.633)
Yeah, no, that's a good question. I know I'm less familiar with the video as well. I have seen language of like eight step inference for some video models. So it makes me think there some diffusion is also going on there. I don't think it's that simple as generating 24 frames for one second of video. Just because I feel like there's something and I remember I read something about it that.

Pierson Marks (20:18.573)
Right.

Pierson Marks (20:24.022)
I don't think.

Bilal Tahir (20:30.157)
go you can't generate it independently because you you know one scene needs to seamlessly move to the other and so you kind of have to keep for the consistency to be there you kind of that's why video is like so much more expensive than 24 times five images which you know it's not as simple as just generating that that amount of images yeah

Pierson Marks (20:35.894)
Right. Right.

That makes sense.

Pierson Marks (20:48.332)
Right, right. That makes so much sense because, I mean, think about it. If it was generating frame one and then generating frame two, one that would be very slow to generate all those frames, but also the consistency between frames, because right now most video gen models are limited by a few seconds. And so if you could actually just.

sequentially generate like that, then you wouldn't have an issue. Because in the four or five seconds, you have consistency between frames, but you can't scale that. And so that makes sense. it's like the multi-dimensionality of the vectors in that space are very interesting because linear algebra was

Bilal Tahir (21:13.405)
Right.

Bilal Tahir (21:18.279)
Right, yeah. Yeah.

Bilal Tahir (21:24.989)
me.

Bilal Tahir (21:29.009)
Yeah.

Pierson Marks (21:33.218)
kind of mind blowing to me when you start dealing with like, you know, a thousand dimensions space and now even much bigger for these. like, it blows your mind because we live in 3D space, but then you're dealing with like numbers and.

Bilal Tahir (21:46.067)
Yeah. Oh, yeah. I mean, it's the math actually is very fascinating because I don't know the details, but I know there are two methods of interpolation, is how do you go from frame one to frame two? there's like a...

a couple of methods that you can use. forgot the name, think is RIFE is one, there's one, but it's like, it comes down to linear. It's like, can, do we take the stuff and divide by two? Do we weight it? you know, and so it's just like simple mathematical decisions, which can, it's fascinating to see the output versus like, this, using this equation will create this experience where, you know, maybe you have a seamless transition. Maybe you have a different, it's like, kind of like, you know, obviously in web development, we have the,

has your like the linear curves animation curves where you can have ease in ease out ease in ease out you know I mean there's a whole it's all mathematical curves you know and it's very the Bezier curve is very fascinating to see like how the animation itself changes based on the the curve you're using so similar to I guess a similar thing happens for video but it is cool though I actually have have thought about just the hacky way of like because back in the day I mean if you in the 20s the way movies were made like animated

Pierson Marks (22:46.894)
Right.

Bilal Tahir (22:59.473)
movies, writers literally drew 24 panels, like it was such a painstaking product, they literally drew 24 pages and they would just flip it through in front of the camera. Like that was, you know, it was choppy but that was the first movies and I've actually thought about one of the projects I wanted to do is what if you just generate like use dana banana or something and you generate like 24 images and then you kind of do a flipping like

kind of like those old school cameras where it's like kind of twirling effects, like a cartoon like that. I think that would be a pretty fun experiment to do.

Pierson Marks (23:29.976)
Totally.

Pierson Marks (23:33.654)
The interesting thing that I like what.

We'll look back at these conversations and we'll be like, wow, look how crazy we were thinking. Like video was hard to get beyond four seconds. It was hard to consistency because when we look at like Genie 3, for example, like the world models, right? And you could walk around in a world that's consistent and you see the tree on the wall or the paint on the wall and then the paint stays and you can actually have long running minute plus kind of memory where you put something on the wall. It's going to stay there. Even if you look away.

Bilal Tahir (23:51.591)
Right. Right.

Pierson Marks (24:07.704)
And when you look back, it's still there in the same spot because to me Why is why are we training a video model? That's so different than a world model when the world model like literally a video should just literally have a camera where you have some space that is recording the world model and if you can get a world model to be consistent over time and

Bilal Tahir (24:23.571)
Mmm.

Pierson Marks (24:32.16)
you know, just have that space. Like why are we even dealing with specific video models when you have the world and then just create the world to be what you want the video to be. And then you have camera control. have, you can like literally like.

Bilal Tahir (24:45.117)
Wow. So you're literally going to have a cast and it's like, okay, this is the studio, lights, camera, action, and just, you know, right. Yeah.

Pierson Marks (24:49.4)
Kind of, I mean, the interface and how you interact with the world, because like, let's say you go to your VO3 in some program and you want to say, generate me a horse running across this African landscape and put a tree over in the distance. Right now what it's doing, it's actually generating that and you have that and there's no concept of beyond the frame. And so, you know, you have the tree in the background, you have the horizon, you have the African plains, you have the horse, but...

Like you can't say pan that camera to the left and show the other horses behind it because it just generated that thing. But if you generated, instead of generating just the video, you're generating the whole world where then you're like actually, you know, like move the camera, like zoom the camera out or zoom the camera in, you're actually kind of.

Bilal Tahir (25:35.443)
you

Pierson Marks (25:38.248)
controlling the camera in the world versus generating just a video. And I think we're going to move towards that because I listened to a conversation between Logan Kilpatrick and Demis. Is that how you pronounce his name? Yeah, the CEO of DeepMind.

Bilal Tahir (25:57.401)
Yeah, yeah, I said this.

Pierson Marks (26:00.61)
And they were talking about this, like, there's pretty much saying, like, they're gonna see VO3 and whatever merge, very similarly, like, kind of between Genie and VO.

Bilal Tahir (26:08.381)
Yeah. No, I mean, it's fascinating. And it's something I confess, I still struggle to really wrap my heads around world models because I do think they are the next iteration. you know, I think levels had this tweet today with like, was like 2022 was the year of photos. And then 23 was the year of like avatars.

or something and then 2024, 2025, 2025 basically was the year of videos and 2026 will be year of world models, which I guess tracks, mean, it kind of makes sense, but at the same time I'm like, okay, what does that look like? It comes to, I guess, we'll share the news about world apps in a second, but it...

I can see like what you were saying, like it makes me start thinking about, you know, the different angles. Maybe that's one use case because it's funny because I think about like every time somebody shares a real life TikTok, you know, of an event that happened, you know, which is like crazy wire. Suddenly on your feed, get the same video, but sometimes somebody is like, but some guy recorded it from this angle, like the instant app, right? And then it's like, wow, it's a whole new perspective, right?

And so I wonder maybe that's like, you know, that'll happen, be a simulated event where somebody will be like the horse riding, you know, maybe it goes viral for some reason. And then somebody says, well, I took the shot from the side, you know, so if you want it from the side, you can do that. I don't know, I mean, it's kind of contracted in my head because I'm like, well, why would anyone care? But maybe they will care, you know, it's something that's where you got to be a little more optimistic, techno-optimistic, that people will care.

Pierson Marks (27:29.77)
Right.

Pierson Marks (27:37.143)
Right.

Pierson Marks (27:42.286)
Yeah, and even if, like, know, I mean, 2025 was going to be everyone's like, it's the year of agents. I mean, yeah, right. It's a decade of agents more so, you know, it's going to be the decade of generative media. It's going to be a decade of AI and

I think as developers and people in this space, kind of over index for some technology kind of getting released and then you're like, everyone's going to use it right away. It's going to be everything like it's going to be everywhere. And then the reality hits you're like, people are very, it's just, harder to change and it's harder to integrate new technology into, into the world much more so than just, Hey, throw it out into the world and let people put, sometimes you can like new paradigms, like AI in general, you threw it out into the world.

It's kind of embraced. But then when you get a little more finer granularity, like world models, who are the users? What are the use cases? Is it for physics simulation in like science labs and design and architecture and engineering? You have to integrate that into the existing systems because you're not just going to go to those, like the world models and say, design a bridge. then you're like, okay, now I have a bridge in this world. what am I going to do with that? Or self-driving cars and vehicles. Like you have to really think like how

bridge the gap from this really cool technology into something that's useful for people, except in the creative space. The creative space is kind of cool. throw out the creative tool and you're like, let's see what people make with it. so, yeah.

Bilal Tahir (29:08.157)
Yeah, stakes are lower in that sense. I mean, I guess you're gonna be more experimental. But yeah, mean, I'm sure I'm gonna be very surprised at like how people end up using these models, but we know for sure like they will be using some very interesting ways that will become clear. And of course we're gonna look back, we're like, of course, yeah, that was gonna happen. But for now it's like the future is murky.

Pierson Marks (29:31.598)
Totally. All right. There was something I want to circle back to the HDR and the dimensions of video and image real quick, because there was a paper that I didn't read fully, but it was very interesting to me, because it was about how to scale up the resolution of image gen.

to be resolution agnostic. So right now when you go to, right, and it's pretty cool because right now when you go to ChatGBT and you're generating an image or whatever, it's gonna generate like a JPEG or PNG or whatever at some fixed dimensions. And I think they're not that high depending on what model you use, like mid-journey versus like, you're not gonna get super high, ultra high depth.

Bilal Tahir (30:21.767)
Right, think the best is like Flux actually released a more called SRPO, which I think is just 4K or like just really realistic high res. sorry, there's one called Dreamina as well. I think Dreamina is by, it might be Cdance. They have one, which is the, I think the 4K version, but most, you're right. Most images are like thousand by thousand. So, know, basically.

Pierson Marks (30:29.826)
Interesting.

Pierson Marks (30:44.75)
Right. And because it's interesting because it scales quadratically because it makes sense. You have one pixel and you need to scale that density up. So it's going to like that one pixel going to turn to four to keep that square resolution. But there was this paper was pretty much saying how they invented a new layer, I guess, that is able to let me me.

It replaces one of the steps in ImageGen, not getting too deep into too much depth, which is able to scale up any image generator, any image model to whatever resolution. So it becomes resolution agnostic and doesn't scale with, you don't need to use more compute to generate a higher resolution image. And so it's called Infinigen and replaces with a new generator.

Bilal Tahir (31:16.552)
Right.

Bilal Tahir (31:28.179)
Hello.

Bilal Tahir (31:37.715)
Wait so you're saying it doesn't you can upscale an image and it doesn't require any compute at all like it just

Pierson Marks (31:44.878)
Maybe it's linear compute versus quadratic, I think.

Bilal Tahir (31:47.565)
OK. Yeah, because there are upscalers, but they're ridiculously expensive and slow, like going from like, I think, 1,000 by 1,000 to like 2,000 by 2,000 image, it's, which is a 4, that's a 4x increase that takes, I mean, it costs money and yeah.

Pierson Marks (31:58.818)
Right.

Pierson Marks (32:03.182)
It costs money and time too. And so I'll read the abstract real quick because it's interesting. And I'll link this back to why it comes back to TVs and HDR in a second. the abstract says, arbitrary resolution image gen provides consistent.

visual experience across devices, having extensive applications for producers and consumers. So current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. So to generate a 4K image usually takes a minute plus, I guess, according to this. To solve this, we explore the second generation upon latent diffusion models, where the fixed latent

Bilal Tahir (32:33.053)
Wow.

Right.

Pierson Marks (32:44.782)
The fixed latent generated by diffusion models regarded as content blah blah blah. Okay skipping through this So they replaced we present inf gen replacing the The decoder with a new generator to generate images at any resolution from a fixed size latent without retraining diffusion models reducing conflict computational complexity and can be applied to any model using the same latent space so it's it's

capable of improving models into high resolution era while cutting down the generation time to under 10 seconds. So.

Bilal Tahir (33:20.307)
That's amazing. That's like basically converting every image to an SVG.

Pierson Marks (33:26.284)
Yeah, no, it's super interesting because if you have like these good models and then essentially you commoditize the resolution aspect. So you replace whatever layer is whatever part in that pipeline is actually creating the resolution. And then you just, you train your image and then you put this thing there. And then now you can just generate whatever resolution you want without having to retrain the model for higher res.

Bilal Tahir (33:33.651)
Hmm.

Bilal Tahir (33:50.547)
I mean, I've actually, I've seen like, I remember I saw some experiments with AI compression and I, the problem with AI compression was the, it wasn't a deterministic like a JPEG, which is a lossy compression, but at least you know, like, you know, it's inputs, outputs, but I do think there will, there'll probably be a world where the next JPEG or WebP or whatever, it could be an AI algorithm like this, maybe just stores it as a 120 by 128, you know, latent or whatever.

Pierson Marks (34:06.669)
All right.

Pierson Marks (34:18.62)
that's very interesting.

Bilal Tahir (34:19.559)
And you can save it as kilobytes and then instantly upscale it to like 10 megabyte image or whatever. It saves a lot of storage and compute. So yeah.

Pierson Marks (34:26.986)
No.

Yeah, no, that's super interesting. And the reason why I thought it was cool though, because just going back to HDR stuff is and like we were talking about how the different dimensions of televisions, like we had HD and then you got 4K and you got like ultra HD and now you have OLED and like there's all these different dimensions to play with when you talk about quality of an image. so resolutions one, you have how dark are your blacks? How light are your lights? What's like the bleed between your light pixels and your dark pixels? HDR contrast?

all these dimensions and they all go towards some of them are trade-offs but they all go towards creating good image but in this space it'd be really cool because the problem with resolution right now it's like there like 4k is the where every TV is every TV is pretty much 4k now and 8k never happened is just because there wasn't enough content filmed at 8k and it's just too expensive and that you're not filming at 8k but if you can actually upscale it

Bilal Tahir (35:23.667)
I you know, but I remember 4K was a thing, think 10, 15. I so, I mean, I don't know why we never, it's just kind of a ceiling we hit. Maybe discount is good enough that we don't need that. You can't really distinguish between four and eight game. Maybe it's not worth the trade off or something, but it is.

Pierson Marks (35:44.29)
Right, right, especially if you're out of TV and you're sitting pretty far away from it.

Bilal Tahir (35:49.009)
Yeah, and I've heard you need the TV needs to be bigger than 65 inches or something, I think, for it to matter or something.

Pierson Marks (35:54.668)
Interesting. And it could get expensive with all the pixels. Super dense, you know.

Bilal Tahir (35:58.033)
Yeah, it is interesting like how we basically had this, it's like a scaling all we, we had like these small cable TVs and we went to HD and then 4K and then if you had extrapolated from 2010 onwards, you'd probably be at 16K now, but we're still pretty much at the same.

Pierson Marks (36:13.23)
All right. All right. What are the retina displays on your phones? Are those like 8K or?

Bilal Tahir (36:20.435)
that's a good question. I thought retina display was more of a, isn't that an Apple specific technology or something?

Pierson Marks (36:25.782)
Yeah, it is, but I just a retina iOS. Let's see iPhone.

Bilal Tahir (36:28.499)
Yeah. But I do think, I mean, the AI will play a huge part in these upscaling. And to your point, like, you can create amazing, crisp images. I mean, that's...

just is gonna make the quality of ASLob even better. Actually one hack, it kind of reminded me, one thing I discovered one model was called re-graph vectorize, which is so cool, especially for certain images. You can just give it any PNG JPEG or any image format, it'll convert it into an SVG and it does such a good job for, depending on the image, like, you know, obviously if you give it like an image of yourself, it'll kind of do a blurry, you know, output of it. But if you do like an anime,

Pierson Marks (36:51.146)
Mm. Right.

Bilal Tahir (37:11.029)
or I gave it a manga panel and it just did such a great, you could not tell the difference. If anything, the manga was way crisper cleaner and I could zoom into the character's face and the speech bubble and the writing. And I was like, wow.

Pierson Marks (37:16.238)
That's cool.

Pierson Marks (37:20.771)
Right.

So this is.

Bilal Tahir (37:24.083)
This model is being very undersold right now, feel like, because you can imagine using a lot of, if you have animated assets for your website or something, you could just throw all your images, which are like two, three megabytes, just throw them, convert them into an SVG and just put them in your page and your page instantly, because images tend to be the heaviest part assets on a webpage. Instantly you can lose like 80 % of the weight of a webpage using SVG.

Pierson Marks (37:48.918)
Right. Is this reCraft v3? Next, right. Gotcha.

Bilal Tahir (37:52.711)
ReGraph vectorize, V3 is their text generation model. It's on replicate. I think replicate is the only one that has them for now, but.

Pierson Marks (38:01.262)
This is this is a kind of what we talked about last week and so if you weren't listening the other week Like SVGs are vector based graphics that can scale easily instead of being in pixels they're in just Equations essentially and you can just scale up and you can zoom in indefinitely right and so that's what's super cool about this because when you take a photo of

your backyard or whatever, it's like a, it's a bunch of pixels. And when you zoom in, gets blurry and you can't zoom in. But imagine if you could zoom in indefinitely and like actually keep zooming until like the whole screen is one big color of a pixel without any blurriness. It's pretty cool.

Bilal Tahir (38:41.211)
Yeah. Yeah. And it's like, it's like kilobytes versus me. It's crazy. Like how,

Pierson Marks (38:47.726)
It's so much smaller.

Bilal Tahir (38:47.827)
small lightweight SVGs, but very powerful. I think there's tons of good ideas. One idea I threw out there, I actually think if you want to create your own handwriting font, this is like, cause it actually does such a great, if you give it like your own handwriting or even old letters, like I gave it some like old 18th century letters, it basically, you know, converted the whole area into an SVG and it did such a good job of capturing the handwriting. And from there, it's about just going to another AI that can then categorize the A's and B's and C's

And if you give it enough of a distribution, you can create a handwriting font of it. The way a font is created is basically right now, if you want to do it, you basically have to write every letter, uppercase, lowercase, and then there's certain characters and you literally write them one by one in a box and you upload that file. Like it's literally a file for each character and then you create your own font. So it's a very painful process. takes either services out there, because it's a very manual, costs probably hundreds of dollars.

Pierson Marks (39:38.958)
All right.

Bilal Tahir (39:47.683)
And now imagine just uploading your writing, your essay or whatever, and suddenly you have your own font right there. Somebody could make a business out of this and give me 10%.

Pierson Marks (39:53.358)
All right. That's cool.

Pierson Marks (39:58.938)
Yeah, these are the type of things that could be so fun just to be like, okay, like monthly hackathon. That'd be a really cool thing for JellyPod. I remember when we first were talking, one of the things that I really wanted to make sure that we do was give Fridays at least, like no joke, I really want to make sure as a culture, we keep Fridays as, just play around with stuff like whatever, like.

It's competing with priorities, obviously. so staying focused on like the big items is important, but also giving us kind of the time to play around with these cool tools to see what's out there in a world that's moving so fast and so important. so just, I don't know, it'd be cool to do like a monthly hackathon where it's every Friday, end of Friday of the night, like, like throw up on a board, all these cool new tech things go out, play with it. Like the whole day is just like the last Friday of the month.

Bilal Tahir (40:45.783)
yeah.

Pierson Marks (40:54.682)
hackathon day and we're just going to all build.

Bilal Tahir (40:55.635)
How do you do mean, is that just an internal thing or do you wanna do like a Jelly Pot hackathon, I'd invite everyone in the Bay Area there.

Pierson Marks (41:03.662)
Oh, I mean, I was just thinking more internally where it'd just be cool. the last Friday of the month is kind of like a holiday fun Friday day and everybody comes in and you like have beer and bagels, bagels in the morning, beer at night. And then we're just like always building for the day.

Bilal Tahir (41:18.127)
love it. It's like the that social network where you know, you're just coding and chugging beer. It's like frat boy.

Pierson Marks (41:25.17)
Hey, mean, coding with a glass of whiskey is super fun.

Bilal Tahir (41:30.051)
It is, yeah, I agree with that. If I write some beautiful code, I mean, in the morning it's not so beautiful, but in the moment you feel like you're a genius. But I want to circle back before we get too far off from the world model, because one news we didn't mention, which I thought was very interesting, was WorldLabs, which is Fifi Lee's startup. Fifi Lee is one of the OG AI, I guess.

Pierson Marks (41:39.31)
Totally.

Bilal Tahir (41:58.035)
Godfathers, Godmothers, know, it's like, you know, up there with Jeff Hinton and, you know, Yoshia Benjio, you know, Yann LeCun. And so she started a startup, World Labs. I think there is like $200 million or something like, you know, a months ago, I think, or last year, somebody. And they released their first demo or one of their first demos, which kind of went viral because it was so cool. And what they did was you were within, I think, a matter of seconds.

you could generate a whole world like a 3d world like just with a prompt and this world wasn't just like a

Like a room or something. It was like actually very detailed like so the demo that went viral was the Shire it literally was You know Bilbo's house like with rooms or this is the library. This is the kitchen This is the living room. This is the bedroom and you could go in and walk around and you can You know see her as see it and it felt so real and I I literally had this visceral reaction where I was like I want to live here You know, it's so cool. This would be a great like country house

Pierson Marks (42:43.064)
Right.

Pierson Marks (42:59.16)
Totally. No, absolutely. Wait, let's just show it to everybody that's actually watching the video. So this is the World Labs demo that went viral. Let's show it.

Bilal Tahir (43:09.191)
Yeah, this might be another one, but this is pretty cool too. This is like a castle or a garden. beautiful. so now we're in a castle, like Prince of Persia.

Pierson Marks (43:17.538)
this is a...

Pierson Marks (43:21.538)
This is the one that is at least pinned on WorldLab's Twitter account right now.

Bilal Tahir (43:25.221)
Yeah, I think this is a new one, but this is so cool too. I mean, they're just showing off the different worlds you can generate. And all of these are generated, believe, in five to 10, like very little time, which again, this comes to our question about video. Like video is taking so long, but like then you can generate a world like in quicker, you know, and.

Pierson Marks (43:42.114)
Right.

Thanks.

Bilal Tahir (43:44.687)
you can go to their site and generate for free right now. So it's a pretty cool opportunity to play around with this kind of stuff. But this comes to something like I remember we were talking about a few episodes back where you're saying we can create our own worlds and just put them out there and other people can come explore it. It's like a new medium of expression for us.

Pierson Marks (43:51.116)
No, it's...

Pierson Marks (44:07.06)
it's so, so, so cool. I'll have to go through and spend some more time on this Twitter account because like, look at this. mean, if you're watching this, it's a living room right here. And if the generation speed is actually super slow, this is what just makes me bearish on video models specifically compared to like world models. Like, look, these are videos. We're watching a video. So why do we need the video models? We have the video.

Bilal Tahir (44:16.525)
more of it in the world.

Bilal Tahir (44:32.443)
Yeah, it is. It is interesting. You know, like, yeah, this might be it.

Pierson Marks (44:36.902)
this is, that's the, you could play projectile and tango.

Bilal Tahir (44:41.092)
I don't know. That's a new thing. just like, yeah, I guess so somebody edited like a Minecraft thing inside the shower, which is pretty cool.

Pierson Marks (44:48.332)
Right. And it looks like they have like camera button on the bottom. This is gonna be so cool. I'm gonna sign up for this beta right after this. So.

Bilal Tahir (44:55.808)
you're saying you click the button and you can interact with the video itself. yeah. Yeah.

Pierson Marks (45:00.034)
Like, I mean, like, like, look at this because you have this camera icon in the below right here. So that probably just takes a screenshot of wherever you are. So yeah, if you can move the camera around with prompting. So there you go. There's your video model.

Bilal Tahir (45:06.419)
Right.

Bilal Tahir (45:12.124)
wow. That's so cool.

Bilal Tahir (45:18.171)
Mm-hmm. Yeah. I mean, this is the future. I mean, feel like if you just put your sci-fi tinfoil hat on, ultimately, I do think this is like the future is more like WALL-E, except for hopefully skinny and not fat. Everyone's just like, you know, plugged in and in their own little matrix world. I think it's pretty cool, actually. So.

Pierson Marks (45:30.51)
Yeah, yeah.

Pierson Marks (45:37.421)
Totally. Right, right, right. Well, I mean, I know we're pretty much out of time here, but this was, we covered a lot. We covered a lot of different things, world labs, latent space, image res, all of it.

Bilal Tahir (45:43.911)
Yeah.

Bilal Tahir (45:54.546)
Yeah.

Yeah, I mean so much. didn't even get to Lucy or something, which is another model there, Dcraft, which is another company that released a video generation and editing model. they're actually probably the fastest video generation model. they did, maybe they went down in the world. They saw these world models. were like, video is too slow and so we should speed it up. And they've been able to generate videos within like almost real time. And they released a new model which could edit video. So you can have, let's say you're

we have a video of somebody dancing in like a yellow shirt, you can change that shirt to purple. And so, yeah, this is the future, like worlds, editable video, everything is just malleable in front of you. You can just take it, manipulate it, make it your own. So it's wild.

Pierson Marks (46:31.672)
Wow.

Pierson Marks (46:42.166)
Right, it's wild. It's wild. Well, so to wrap up episode 14, Creative Flux, we've talked about a lot, a lot of world models, a lot of image, video, everything, Jen. So stay tuned for next week and goodbye. See you all.

Bilal Tahir (46:59.751)
Yeah? Alright, see you later guys. Bye.

Exploring Fei-Fei Li's World Labs, Image Upscaling, and ChatGPT's Most Common Use Cases
Broadcast by