ElevenLabs V3 and the Future of Text to Speech

Pierson Marks (00:01.122)
Yeah. How's it going?

Bilal Tahir (00:02.662)
Hey, hello, hello, how's it going?

Pierson Marks (00:05.326)
Hey, hey, well, welcome to episode two of Creative Flux. Now we have a name. Last week we didn't have a name and, you know, this is the name. It came to us, or came to me kind of quickly and I was just like, you know what?

Bilal Tahir (00:10.214)
Creative Flux, now we have a name.

Bilal Tahir (00:19.1)
Yeah, I was gonna say, what's the story? What's the story? What did you feel like, know, Creative Flux when you heard that or thought of that and you're like, this is it.

Pierson Marks (00:28.226)
So, yeah, I think I went through the same process as when I created the JellyPod name, but just much more streamlined and compact. I wrote down a bunch of words all around, like creativity, art, media, and things, and I threw them into ChatGPT a little bit, and I got some, like, thesaurus sort of things and related words.

and creative was something that we always kind of want to talk about because we're talking about creativity, talking about media, the future of media and flux is also a really cool term as well because it's the change, know, it's the change in something over time and I was like well...

Bilal Tahir (01:06.096)
Mm-hmm, yeah. I was gonna say, we're not the first ones to put that together, but yeah. Black Forest Labs probably is mad at us.

Pierson Marks (01:12.206)
No. Yes, exactly, exactly. And so like, yeah, that's the Flux image models. But I was like, well, that's cool. mean, it's you measure Flux, measure the change and create creative. was like, well, you know what? I didn't want to spend too much time on it. We could always change the name, but I thought it was pretty cool. And we have creativeflux.com or no, creativefluxpodcast.com. So that's our podcast website. So.

Bilal Tahir (01:32.806)
Yeah.

Bilal Tahir (01:37.774)
I didn't know that. That's awesome. You have a domain now. That's awesome.

Pierson Marks (01:41.112)
We have a domain. All of the episodes will be up there and also on Spotify, Apple podcasts, YouTube. So if you're listening in for the first time, check us out. Last week we talked about a lot of video models, V03. We were supposed to talk about Levin Labs V3 last week, but we didn't even get to it. So that's what today's going to be on, like lot of text to speech stuff. So.

Bilal Tahir (02:03.696)
Yeah, yeah, it's a huge field. last week, you know, it was video and every field has subfields. So video has stick to video, image to video we talked about and even video to video, which is like taking a video and stylizing into other we didn't talk about. And now there's also audio to video, which is like you take an audio file and an image and you can combine it and you can animate the image with the audio. So similarly, text to speech is.

Well text-to-speech is a sub variant of a bigger field the audio field if you just break it down There's text-to-speech, which is you put in some text and you get speech out there's speech-to-speech Which is like you take a recording of me and you can take a recording of you know, I don't know so Walter White or whatever and

the same audio will come out, in Walter White's That's also called voice changer, if you're familiar with that, like people call it, but that's speech to speech. It's not quite voice changer because you don't just like change...

The tone you can also change the way someone else would speak. So there's like a spectrum, know, like naive change, voice changer versus like, just take the actual audience really from first principles, change it based on this new characters voice, et cetera. So it's a very interesting field. And within that there's voice, you can probably pair voice cloning with that. We at Jelly Pot do that too. We let you clone your voice and you can generate with, then put text in and generate that. So that you could say is text to speech, speech to speech.

each combined, right? You can get both aspects. And then the other side there's, and I do pair it with, even though it's very different, is speech to text, which is you take an audio file and you transcribe it or translate it. And it's also called ASR in machine learning, which is automatic speech recognition. And that's its own field. And we should probably do an episode just on that, because that's a very interesting field, too. It's probably the OG field, because for the longest time, we never...

Pierson Marks (04:00.6)
Right, right.

Bilal Tahir (04:02.714)
did text to speech until like recently, ASR has been even in the 70s, 80s, Stephen Hawking, It's been a field that's been ongoing and it's probably the OG, I would say with image recognition and ASR have been the OG machine learning fields probably, right?

Pierson Marks (04:09.742)
Totally.

Pierson Marks (04:19.214)
Totally. It's super interesting. mean, like with, I remember when I was in elementary school, like the big hype was dragon dictation.

And it was probably the first software platform, at least the first big one that I remember where, you know, you could speak to your computer and it would be like decent, decent. And the architecture back then is just completely different than how it's done today. Like completely different fields. mean, you know, like you have, I don't even know. Like, I don't know how much you know about how it used to work, but like pretty simple. mean, you take a, take a word and kind of map what they're saying to like a word and it was bad and it just wasn't.

Bilal Tahir (04:27.919)
Mmm.

Pierson Marks (04:57.654)
ever a field. Like you try to talk and write an email it doesn't work. mean promise was there.

Bilal Tahir (05:02.586)
Right, and the whole computer voice like, hey, you know, like the Stephen Hawking voice and stuff, you know, it, yeah.

Pierson Marks (05:07.662)
Totally.

Yeah. And then Siri came out and changed the whole game. mean, like Siri was definitely the first consumer product. I want to say Siri was before Lexa. Yeah, for sure. And I thought for us when that came out, it was like pretty good. I mean, the voice of Siri back then was, you know, listenable. And I think that was the first time where it didn't sound like your robotic text to speech computer, you know, what

Bilal Tahir (05:18.833)
Mm-hmm.

Pierson Marks (05:38.64)
you'd see in movies or in your other devices. So yeah, lot has changed in text to speech and just in speech in general. So.

Bilal Tahir (05:45.916)
Yeah.

Bilal Tahir (05:50.844)
Yeah, have been all sorts of models, think, that's made. feel like, I think Google actually released the models that changed the game. Was it Wav2? Wav2 sync or not? Oh, sorry, Tortoise. Tortoise GTS was, I think, the big changer. And I think that's what ElevenLabs initially used. Now they probably have their own models. But Tortoise GTS was the game changer, I think, that came out in 2019, 2020. then...

Pierson Marks (06:08.406)
Right. Right.

Bilal Tahir (06:16.8)
Suddenly it got really good and I remember they were actually were a bunch of startups. There was one in Seattle called overduck. They're still around but Yeah, you know, I am Yeah, yeah cool guy, but he they were initially just doing something similar to ElevenLabs labs and then they pivoted towards music where they basically fine-tuned their models just for music and then

Pierson Marks (06:24.191)
Zach, Zach Ocean. Yeah. I met him. Yeah, I met him. Yeah.

Bilal Tahir (06:42.3)
I don't want to go into too much detail but they used to these models where you could literally make a Drake model or whatever. It was all on their site. They had a marketplace actually of models with ratings. It was so cool. And then one day they just took it down. think they just got sued by a bunch of different... And then they pivoted. I think now they just do music, song, videos and that's their niche. yeah, UberDoc, that was a cool company back then for consumers.

Pierson Marks (07:02.839)
you

Pierson Marks (07:08.91)
Totally. No, it's interesting. mean, like I remember reading about ElevenLabs too, starting off as a fine tuned version of Tortoise.

And I, yeah, I never looked into it a little bit more, but I wonder if people listen to it. I'm assuming most people listening to this, if you're like in the field, you've heard of ElevenLabs, but for people that aren't aware, ElevenLabs is probably at least the number one, like the biggest text to speech provider. And...

They should have like the highest they've had the highest quality voices for a long time. There are some definitely competitors. They had the most languages supported for the long time. It was 29 different languages. Then they added a few extra ones. They got up to 32. But the big thing that happened in the last two weeks was the release of their newest V3 model, which supports 70 different languages. Extremely expressive. You could add in laughter and sort of direction to

speech so you could say say this in this way or slow it down and really just prompt with sort of specific tags to make the speech do things and say things that aren't actually words and say things in different ways. Right. So it's pretty exciting. mean we were together when it came out so or.

Bilal Tahir (08:27.996)
Yeah, super excited.

Bilal Tahir (08:33.988)
Right, right, yeah. Yeah, I know, remember I talked to the ElevenLabs guys, think, the day before. They were like, tomorrow, just wait for it. yeah, so I mean, you mentioned the emotional expressiveness and stuff. How does that, how do they surface that?

Pierson Marks (08:40.558)
Right.

Pierson Marks (08:48.93)
Yeah, I mean, so right now the best way to go about this, if you're interested in text-to-speech, is go to ElevenLabs and go to their studio playground. And you get this canvas, and it kind of guides you through this process of how do you create a motive, realistic speech, or you create a voice. You get a voice from ElevenLabs, one of their many in the library. And then you just type out the speech that you want the voice to say. But then if you want to add sort of laughter or

like coughing or you know any of those non verbal speaking like speech patterns you put brackets around the around the text so say laugh in brackets and then you'd say that was so funny and laughter or cough and so that yeah I mean it's pretty cool it works really well too

Bilal Tahir (09:46.812)
Yeah, yeah, mean these tags are so interesting and also they're very natural. doesn't, it's not like that's so funny, haha, but it's like that's so funny, haha, like it comes off as very natural, you know, so they really like made it very, you know, fine-tuned towards human language, which I found was very cool. I also found that on their studio, there's this, you can just add text and there's a button called enhance and alpha or whatever, and it'll add the tags for you, which, you know, just based on, you know, after this probably there should be laughs.

Pierson Marks (09:54.638)
All right.

Bilal Tahir (10:17.006)
or cough or whatever and so it's a very cool dimension I think you can add

to really make it more expressive. Before this, we didn't really have this. We used to have something called SSML tags, which is basically like HTML for voice. And so you could add stuff like, and you can still do it, you can add a tag, a different tag instead of the square brackets, you add a angle bracket, like break time, 1.5 seconds between words, and it'll add pauses and stuff. But that was very, like, it was limited. was probably stuff like prosody and stuff where you can add like two to five different settings.

so you're still limited. These tags really allow you to think express a proper natural flow conversation like and and really hit a bunch of emotions like you even from angry to sad to you know just like a soft voice whispering etc you can easily add all those features.

Pierson Marks (11:09.134)
Right. I remember when we were first like building Jellypot out like SSL, the speech synthesis markup language. mean, it influenced some of the decisions that we had in podcasting. now it's like the interesting thing, though, too, it's like there's not an open standard when it comes to this. And so ElevenLabs went out and they defined, OK, we're going to put these anvil brackets and you just put like a natural English text prompt. And that's going to be, you know, understood by the models.

Bilal Tahir (11:15.58)
Hmm.

Pierson Marks (11:39.16)
And I wonder if that's going to be the winning approach because it may be, but to standardize across model providers, if somebody like OpenAI came out and like they have a text to speech, they have, and it's very good, but it's not as,

They don't have all the different languages. have like six voices or something. But it would be interesting. I think it would be useful for the community to have like an open sourced schema language for this type of...

Bilal Tahir (12:06.609)
Yeah.

Bilal Tahir (12:10.202)
Yeah, no, it's interesting you mentioned OpenAI because OpenAI has a very different approach so OpenAI

I would say for the longest time, they were the second. mean, there's other ones we'll talk about, but OpenAI is a lot cheaper than ElevenLabs, I think. And if you actually do want to play around with it, there's a site called OpenAI.fm, if you just Google OpenAI.fm, it comes up and it lets you do it for free. And you can actually, as far as I know, there's no rate limit. So it's actually like a free text-to-speech UI they have for all their voices which is cool. But OpenAI's approach has been more high level. They take away the customization that we just talked about, the tags.

do is they have a system prompt and

text and they rely on the fact that the context can capture the essence. So you can have like a rough cowboy voice, you know, that's angry. And then you can say, your actual, that's the system prompt. And then the actual prompt could be, Hey, what are you looking at? And instead of saying, Hey, what are you looking at? It, it just relies on the fact that you have the system prompt. It'll come out as, Hey, what are you looking at? Right. And they're like, you don't need to say, Hey, and then, you know, he grunts.

Pierson Marks (13:13.795)
Right.

Bilal Tahir (13:17.95)
loudly and you just need to just say it and the model should be smart enough to understand the context. And so it's an interesting approach. In a lot of ways, I think it's a more robust approach because ideally, I as the models get better, you think it would just get it. But at the same time, you always have this long tail of customization where you're like, you know, I need that little laughter here and there. So I wonder if these two standards kind of meet in the middle where you kind of get both, you know.

Pierson Marks (13:26.008)
sort of.

Pierson Marks (13:47.48)
No, for sure. mean, it's... Yeah, I always wonder if like the right approach will be...

like this world model that has really in-depth understanding of the world. I know when Logan Kilpatrick came to the AI World's Fair last two weeks ago, and they talking about Gemini and Google's approach, they're trying to at least eventually the goal is to consolidate all these different model offerings into one model where it's like just Gemini. There's no Gemini Pro, there's no Gemini Flash, there's no thinking, non-thinking, it's just Gemini. And based on what you're trying to achieve, it will be able to

Bilal Tahir (13:57.82)
Mm-hmm.

Bilal Tahir (14:12.401)
Bye.

Pierson Marks (14:23.028)
Do the thing that you want it to do so whether that's generate text or generate audio? Or video through VEO and it's like it would be a very interesting world where that that happens because it's also going to be hard to parse sort of that information from the model, but like Right

Bilal Tahir (14:30.012)
you

Bilal Tahir (14:39.228)
Yeah, so you're going the LOTR, one model to rule them all.

Pierson Marks (14:44.012)
That's what I mean. That's it makes it like the challenging thing too is like even when we're going to Elevenlabs in Text to Speech ElevenLabs labs I think has six or so models maybe more they have you V2 V2.5 and V3 they have multilingual V3 or V2 turbo V2 it's like why should meet why should I have to care about all these things like I mean it should just kind of work. I get this trade-offs like if I'm Bill

for example, I have a conversational agent and I'm building out a call center where I want the first level of support agents to be like a low latency model. It's gonna be better than today when you call an airline and you get put on hold. This call may be recorded for quality assurance and that robotic voice is like, that's the baseline today. But if you can just be better than that and still, you could be really fast, low latency.

see just be already better but I went on tangent but I mean

Bilal Tahir (15:49.818)
No, no, but it's so interesting because you touched on voice agents because in text to speech, I mean, I would say there are two spectrums and one is definitely more popular than the other. I feel like at least in terms of.

commercial applications, is voice agent. So we're talking more because we're JellyPod. We care about you creating amazing, rich conversation podcasts. So we are naturally attracted to the highest quality text to speech, even if it costs a little bit more because we want the highest quality output because we're creatives and we want very interesting, generative media. That's our end. I would say we're in the minority in terms of this industry. On the other side is conversational agent call centers, because that's so there's like billions of dollars of economic activity that

Pierson Marks (16:24.558)
Totally.

Bilal Tahir (16:31.102)
passes through that channel every business if you think about it has some sort of customer support call center so on that side as you alluded to the it's not about having a rich

quality voice, it's about low latency. can I, as soon as I start talking, can the agent immediately respond to me and low cost. Cause if you're hundreds of thousands of calls per day, you want it at the lowest cost. You don't want to use the LevelApps V3. You probably want to use a flash or whatever. So a lot of businesses focus on that and they're not optimizing for the quality. And like you said, their baseline is the robotic, our quality insurance, right, that voice. They're like, as long as we beat that.

Pierson Marks (16:58.104)
Right.

Bilal Tahir (17:08.176)
That's good enough. Let's move on. But it's also interesting because the conversational AI space has its own issues. And probably the biggest thing with real time is they call it, I think it's called a voice activity, VAD, voice activity detection. It's basically, if I'm talking to you,

Pierson Marks (17:10.488)
to the

Bilal Tahir (17:27.142)
The right way is I say something, the model takes it, it processes it, it gives the answer. In real life, that's a very bad way to have a conversation because there's this latency. So there are smart people who are working on this who, because usually the way it works is like if I'm saying to you, you started thinking about an answer as I'm talking to you, so you're ready to go as soon as I stop talking. Or you might even interact, you might say like, right, right, you're nodding your head right now as I'm talking, right? So there's these algorithms and things people are working on where a model will start streaming an answer

midway or it'll wait for some sort of a silence or a pause as soon as I maybe I stop talking it'll detect that and you're like alright that's my cue so I start talking again and this is a whole space in this industry it's very interesting

Pierson Marks (18:07.331)
Mm-hmm.

Pierson Marks (18:11.832)
Totally.

Yeah, the whole interruption handling. mean, like even like we're on video right now. And if you're just listening to this on Spotify, you're not listening and watching the video. But there's so many subtle cues that we interpret from our lips moving, like where are our eyes? Are we thinking? Because, you know, there's all these dimensions to human conversation that allow us to really easily, with that much thought, know when the other person is talking. And I think it was really hard during, like, for example, during COVID, when everybody was wearing masks.

Bilal Tahir (18:22.31)
Right.

Bilal Tahir (18:43.706)
Mmm.

Pierson Marks (18:43.748)
It was very hard to have a conversation without like butting in because you're not able to read each other's lips and you're not able to see so there's a big dimension I mean of proper interruption handling proper conversation that occurs outside of speech that are more visual cues and so when you see like these players like sesame which was a you know a model that came out about a month or two ago probably two months ago

That was just really great and having natural interruption handling and I think the way that they do it right now I haven't read the whole paper I've used the product but you know the way that I would think about going about building a you know and Conversational agent is like okay one you you listen for pauses you listen for spacing And that dimension then the other dimension is that you have to be able to process what's currently being said So you understand that you know is this mid-sentence pause is this?

like you know something that just probably requires the person to the speaker to be thinking a little bit more so you're actually like intelligently understanding what's being said also you know taking into consideration how it's being said and but then you don't you never have the visual part so I think that like it might be interesting world where if you have a camera on all the time that that would make it even easier for for an AI to understand when to

interopter to not.

Bilal Tahir (20:15.424)
I would assume so, yeah. mean, you're right. It's funny you mentioned code because apparently I read this article, apparently a lot of people developed a visual cue of they would use their eyes to smile basically. So they would blink as an agreement. And so it's funny we subconsciously picked up these new expressions. It's almost a new body language trait we had to adopt.

to make sure we were communicating, but it's fascinating. I do think that the long game is like video, basically, video, video, low bandwidth and micro expressions, not just like the model understanding, but the model maybe having an avatar that smiles, nods their head to us, right? that's like, that's always the...

upstream from text to audio to video like you know we're always like going upstream as we can afford it and it gets better.

Pierson Marks (20:58.798)
That's all.

Pierson Marks (21:05.856)
No, completely agree.

Bilal Tahir (21:06.096)
But yeah, so coming back to, so that's the voice part, but coming back to a level apps though, they are the best. They happened, but the best for past three, four years, I guess. The gap has closed, thankfully, because more competition is good for us. You've alluded to other startups like Sesame and stuff. would say, and you mentioned OpenAI, a couple of others I think I want to mention is there was a company called Play, Play.ai, PlayHD.ai, think, PlayHD.ai. They were, for a while, they were in the running as the best highest quality.

Pierson Marks (21:22.488)
Mm-hmm. Right.

Pierson Marks (21:30.924)
Right.

Bilal Tahir (21:36.032)
voice. think they messed up in their pricing and they never opened up their API until recently. Now you can actually play with play.ai on file.ai and other think platforms maybe replicate as well and they have very interesting models. They actually have an open API for text to dialogue which we'll get to in a second with the levels but you can actually do multi-speaker so

Pierson Marks (22:02.542)
Mm.

Bilal Tahir (22:03.204)
Rather than just putting in text and getting the audio for one speaker, you can say speaker one, colon, hey, how are you? Speaker two, I'm good, how are you? And then you can just generate a whole full dialogue based on one text. Now JellyPod will let you create these conversations, but we have to do a lot of engineering behind the hood to make it happen and make it sound natural for you. But the cool thing about these models is like it just...

It's this audio, Texan Audio Multimodal, N number of speakers. So not just two, it can be speaker three. And it makes it much easier on our end to serve that once we have this capability to let you use it. But also, you can imagine, because we're using a lot of hand coded rules of thumbs, the more rules you add, the more edge cases and places where it can break. But if you just have a model that just natively can output this, the call-tage is way better.

Pierson Marks (22:35.864)
Mm-hmm.

Bilal Tahir (23:01.788)
So that's why it's so exciting to have these things. sorry, I know we haven't gone to ElevenLabs, but Level Labs does also have this model now. It's called Text to Dialogue.

Pierson Marks (23:11.022)
Right. And so is the text and dialogue model different in V3? it the same or is it like, is it the same model or okay.

Bilal Tahir (23:18.364)
It is, I think, so it's interesting. I think their UI is just text to speech where you put in the text and you get the speech and you can add speakers. But on their API, they have a text to speech endpoint, which is the old endpoint with the new model, and they have a text to dialogue endpoint. So they do distinguish between those two, least on the API front. We don't have access to the API yet. It's not released to the public. But I think they distinguish it. I think under the hood. Sorry,

Pierson Marks (23:38.552)
Gotcha.

Pierson Marks (23:41.966)
Maybe? I don't know, I don't know, I don't I don't know, I don't know. I think I looked right before this, I saw the API docs might be up on Living Labs. Which would be annoying because we've been asking them for the API access and they're like, it's not, I actually just saw an email from our account manager over there and.

Bilal Tahir (23:53.764)
okay.

Yeah, yeah.

Pierson Marks (24:03.288)
He was like, I think he might have said that it wasn't available, but then I looked on the website, I was like, wait, it looks like it's available. I haven't tried it. It was right before this call. So we'd have to see.

Bilal Tahir (24:14.596)
I that. I tried the text to speech V3. didn't work right before this call. hey, mean, that would be awesome because I think it's super exciting. yeah, so play.ai.ai. If you do want to check it out, it is available on Faldi.ai. I wouldn't say the only thing is the speakers are limited. So you have to select from their preselected speakers, I think. Maybe you can actually upload some audio. But they do have that ability. The other one.

Pierson Marks (24:19.31)
Oh did? Oh, okay, well, maybe a... Alright.

Bilal Tahir (24:43.14)
Sorry, let me finish, going to play and then I'll get to the other one. The other thing they have, which I think is super interesting, they're the only ones as far as I know that can do it, it's called audio inpainting. So let's say you have a phrase like, you know, my name is Bilal. You generate that and you're like, I wanted to say my name is Pierson.

instead of just changing the Bilal to Pierson and generating the whole thing, you can just say, okay, edit that one word. And the cool thing about this is it'll just charge you for the edit. So it's almost like the parallel is I think to image images. for the longest, when first started, we just could create images. And then somebody came out with image editing where you could take the same image and edit like a part of the image. And that was cool because you don't have to generate it from scratch and you don't have to throw if you just have an image that's 90 % there and you don't want to generate a new one that

Pierson Marks (25:17.154)
Right.

Pierson Marks (25:24.93)
Right, right.

Bilal Tahir (25:32.84)
fixes the thing you were worried about but then breaks something else. Same with audio if you really like the audio but you just messed up this one piece of the audio you can just in-paint that and I think this is super powerful very new slept on but I think it'll be particularly you can imagine when you have texture dialogue you can pair that and you can have this long conversation and then just edit the snippets that you don't like so very powerful combination.

Pierson Marks (25:55.82)
Right. That's super interesting. And that's on Play AI or Play HD. nice.

Bilal Tahir (26:00.624)
That's on Fowl as well. They have a model there. I think usually the way these companies have, they have a model, they host on Fowl or Replicate, but then they have a slightly more advanced model that's maybe part of their subscription. it may be that their site has a better one. And then usually the way that's their, I guess their play. And then that model gets deployed. There's always some three month lag or something.

Pierson Marks (26:24.472)
So just so I understand, mean, is the in-painting model a text-to-speech model that has in-painting? Or is it a different model where I could take, you know, I have another snippet over here and I take, like, and then I just say, hey, change the last two seconds to something else. I mean, like, or is it like I have to generate the audio through the same model and then in-paint with the same middle?

Bilal Tahir (26:49.584)
Yeah, no, it's a good question. it requires an audio input to begin with. So you already need to have the audio. So it's always an editing thing. So you give it an audio URL, you give it the input text and the output text, and it diffs between that.

Pierson Marks (27:03.212)
Right. wow, so that's really cool. mean, that's even like awesome for what we're doing on JellyPod 2. mean...

Bilal Tahir (27:09.348)
Yeah, exactly. So I almost think of it as a post-processing step. generate, I a good combo could be, you generate from a level lapse the dialogue, and then you use the impainting model to just make the edits. And assuming that it can deliver the same quality of the input and it's not jarring, that would be an amazing workflow. Yeah, yeah. So very interesting. And...

Pierson Marks (27:17.431)
Right.

Pierson Marks (27:32.567)
wow. That's sick.

Bilal Tahir (27:36.432)
You can imagine, like, you know, if you were doing text, the way you would do this conversation before is like, you would generate a bunch of, like, audios and have to concatenate them together intelligently, you know. That's a kind of a hacky approach, can say. It's probably the way to do it. But it does allow you to customize each snippet because you're generating a bunch of files. With Text to Dialog, it's one file, and so...

There's always been that question about how do you edit the one big file and maybe the inpainting might be the answer for that. So it's very interesting.

Pierson Marks (28:09.196)
Right, no, that's super interesting.

Bilal Tahir (28:11.132)
Speaking of dialogue, sorry, one other model that I would be reminiscence not to talk about is because they I would say like they started this as Gemini actually, and people don't think about Gemini as an audio model, but actually Gemini TTS is available on the AI studio and it's basically free, it's probably the most economical and it's really good. So you can put in text, you can generate speech, but you can also do the same speaker one, speaker two, text and it'll generate the

and it's pretty good you know it's pretty decent there's two models there's Gemini Flash TTS and Gemini Pro TTS and just like flash and pro the flash is this fast and efficient still good quality and then pro is a little slower but higher quality I think that's how they're gonna divide it

Pierson Marks (28:56.866)
Right, that's interesting. Wait, and you mentioned something. mean, maybe people don't know about it. Like, what is AI Studio?

Bilal Tahir (29:05.85)
Yeah, AI Studio is the UI for using all of Gemini models. We'll leave links of all these after the show in the notes. yeah, Gemini, if you want to try out Gemini, just...

Google AI Studio and you go in and you can try text to but also like just try all the models and it's really good. as we all know, mean anyone who's used Google products, it is a pain to use them because of all the permissions and stuff. But AI Studio is probably the one Google product I've used which has been seamless. I think, and I give full credit to the Gemini team for that because they've basically built a parallel product where it's easy to go in there,

Pierson Marks (29:38.712)
Really?

Bilal Tahir (29:47.406)
you can play around with the UI, can even generate the code, or the cool thing, one of the cool things I like about the UI is if you are just playing around, you can click the code button and it'll actually give you the code that was used under the hood to generate that, and you can just copy paste that if you're using, you wanna use it as an API. And it's easy to see that the...

the different configurations. So if you upload an image or a video, this is how the API changes. So I know exactly how to work with a video or an image. It's like right there. So it's a very seamless way to go from the GUI to the API code level.

Pierson Marks (30:22.692)
interesting. do you think it's a good, like for the people that are, I like the semi-technical kind of term where it's.

the people that are in the weeds with automation software like Gumloop or N8N or Zapier. And then they're like super, they want to know like the AI studios. I think that's kind of like the audience of who this should be for. The people that are, you know, if you're super into the AI world already, you probably know what Texas speech is. And a lot of the stuff that we're talking about is probably not, you probably know it. But I think when I'm sharing this with my friends, you know, they probably had no idea what AI studio is.

Bilal Tahir (30:58.491)
Right. Right.

Pierson Marks (30:58.832)
It's like good spot for, but they're like smart and they could like, cloud code their way to like a basic, you web app. You think that's like the right place for them to go play around with Gemini? Is AI Studio?

Bilal Tahir (31:09.7)
Yeah, I mean, I'm like, I'm biased, but I think AI Studio is super simple to play with. I have...

Some people have complained about it, which I found surprising. think it's a lot of, I think honestly, do get, think Google, because it's Google, it always just is held to such a high standard and people just like complaining about Google. So a lot of times I just see a lot of unfair criticism of that. That doesn't mean they're perfect. They have their issues. EI Studio, is it the prettiest UI you're going to see? No. Compared to ChatGPT, there are some annoying things like you can't just type a text and click enter. You have to actually run code and stuff, you know? So it's like small things like that because they are...

Pierson Marks (31:19.981)
Interesting.

Bilal Tahir (31:45.978)
Most people who go there don't really use it like ChatGPT. They go there, they try it out, and then they get the code, they're developers. But it is such a slept on resource, because it's literally free. It's basically free to use. And Gemini Pro, which is the second best, third best model right now out there, really good. And so I do think Gemini is slept on. And most people think OpenAI or Anthropic are the best, but I think actually Gemini is up there for sure.

Pierson Marks (31:56.878)
nice, yeah.

Pierson Marks (32:04.686)
Alright.

Pierson Marks (32:14.35)
That's interesting. I know, the other day I downloaded this spotlight replacement called Raycast. It's been really cool, like I've liked it and it's kind of like...

It integrates into Mac as if Spotlight was AI native. And it also has a desktop chat bot that you can switch between models. And it's really cool because you could also bring your own keys. So if you just want to plug in your Gemini key or your Anthropic key or OpenAI key, you could use those. But you could also have a subscription if you just wanted to make it easy, like 20 bucks a month, just like ChatGBT Pro. And it hooks in with everything. we're not going to get into it today, but like MCP.

and all these other things that you could just, you know, prompt it to. I know we're getting away from text to speech, but the other day, mean, yesterday you asked me something and I didn't know the answer, but I knew it was in our project management software somewhere. And I opened up Raycast and I said, hey, like look through all the issues and do some research based on these issues. Like go out onto the internet and figure out what's the best way to implement like role-based permissions.

our app. And so I went through all our issues in Linear, pulled that stuff out, went on the internet, connected to Linear back, added the comment as my name. It was pretty cool. And I'm like speaking into it too. So I'm like using speech to text to prompt my AI operating system to go onto the internet to connect to my project management system. It's pretty nuts. mean...

Bilal Tahir (33:45.414)
Yeah, that is so interesting. wonder how it'll add on wall. Maybe you will just integrate using MCP with Claude Goat or whatever, like RayCast, Claude Goat, linear, Very interesting.

Pierson Marks (33:54.414)
All right. All right.

Totally. mean, that's kind of like what's happening to the cloud code. I know Dish had a MCP integration with linear. So that seemed pretty cool. So, but yeah, I mean, getting back to those, like one other thing I wanted to talk about real quick before we wrap up, just like, I think the two other things that we haven't talked about on Texas Beach were one open source models and like how important those are and also local models. I just wanted to bring this as like a thought experiment because

Bilal Tahir (34:02.844)
Hmm.

Pierson Marks (34:26.81)
Because we started off talking about latency and the importance of low latency use cases like a phone call center. But I'd be like, it's really puzzling to me.

the next Pixel phone or the next iPhone, they gotta be able to have Siri running locally, like on a small model that's, you know, there's no way that you'll, if you have a big text-to-speech model on the cloud or a big speech-to-speech model, like, there's always gonna be a latency delay and a penalty there versus having a small model that's either open source on your phone running locally. like, you're not sending all your speech up to the cloud every time you ask Siri to go.

do something. It's like all happening locally on device. It's realistic. It's natural. And I think we're really close to that future. I don't know if Apple will in private and write it. So.

Bilal Tahir (35:19.462)
and private. yeah. No, for sure, it's funny you mentioned that because the next one on my list was actually a model that I feel like is this hidden gem nobody knows about, but it is probably the best model, like small model, and it's open source, and it's called Kokoro.

Pierson Marks (35:35.778)
No way.

Bilal Tahir (35:36.764)
and it's 82 million parameters, very small, it can probably run on most devices. I've run it on my MacBook as well, although it takes a while. So you'll probably wanna use an API, but it's on Fall, it's on Replicate, and it's got a bunch of pre-made voices, but it is so good. And it is probably the cheapest model. think for one, for two cents, you can do 1,000 characters.

Text-to-speech which is like if you compare that to ElevenLabs labs, which is 30 cents, I think right Yeah, so that's what a 15 X difference and I I've used in the past It's it's really good. So definitely you're like if you guys are interested in open source small mall, check out Kokoro I think it is a variant of what's it called style not style again, but

Pierson Marks (36:07.896)
about.

Bilal Tahir (36:29.116)
I forgot the name, but there's a series of open source model that are modeled about for that. But Cocos is good. Any other, you mentioned Sesame. So Sesame is a company, they have a proprietary model, but they did release an open source model, like a 1 billion parameter model, which is pretty good, supposedly.

Pierson Marks (36:34.67)
Right.

Pierson Marks (36:47.628)
Right.

Totally. And know it's kind of comes full circle to we were talking about like the open sourcing of the text of dialogue. I mean like having a standard when it comes to generating speech. And for anyone like if you're a developer out there and if you've ever built with Next.js or on Vercel, Vercel has this thing called the AISDK and it pretty much is a wrapper around all these model providers making it really like easy because the shape of the API for open AI like their responses API versus Gemini's slightly different

And to kind of if you want to switch between calling Gemini versus OpenAI Without like a sort of this wrapper layer you have to change a lot of your code and so you've seen like companies like open router or The AISDK like they've come out say hey just use us as an abstraction to these LLMs, but we haven't really seen yet Text-to-speech abstractions if we haven't seen the the same sort of thing for text-to-speech

Bilal Tahir (37:48.54)
you

Pierson Marks (37:49.112)
I wonder if that's coming. So I know I met a guy recently and he just his companies got acquired by Versel and He had essentially this this platform is called or rate or a te and Versel acquired it and or rate was essentially the The abstraction layer for text-to-speech models where they they were able to call ElevenLabs labs open AI Gemini

play HD essentially with a unified interface. I never used it because it was acquired and then they shut it down. And now when you go to the website, it directs you to the AISDK. So I wonder if they're going to embrace sort of the media gen text-to-speech in that because I mean, it makes sense for any, like no developer, nobody wants to be like locked into a vendor because they can't try the next newest model. Right.

Bilal Tahir (38:43.388)
or pay the subscription. I mean, it makes sense. These companies, they're VC backed, they have investors. And so, as you all know, recurring revenue is the golden standard. And so everyone pushes towards a subscription. I get that, but it is very annoying if you just want to play around, especially if you're developer and you want a bunch of options, you just want to pay.

Pierson Marks (38:55.267)
Yeah.

Bilal Tahir (39:04.614)
for on demand and there are other companies, mean, Civ used to do that, I don't know if they still, but I used to use the Level Labs through them because they let you use, and it's kind of a hack, right? I mean, you can think of an, it's almost like another middleman, you can literally just buy the business plan and then offer the hobby plan rates, but flexible and you can be like, one API key, but we have a Level Labs, we have Play, I mean, I'm sure people have done that, but.

Pierson Marks (39:28.387)
Right.

Pierson Marks (39:32.088)
Right.

Bilal Tahir (39:33.574)
Yeah, it is annoying that you can't just simply use them. I would say that's not just an audio issue, it's also image services issue, character lip-sync issue, mean, headdress, et cetera. I'd love to try them, but they just have these expensive subscriptions and same with video. With video, it's actually probably the best because for some reason, VO3, Kling, Minimax, they're all on valve relatively quickly. So at least you can play around with them.

Pierson Marks (39:48.526)
Totally.

Pierson Marks (40:01.169)
Right, no, it's super true. The Spaces movie is just so crazy. mean, that's kind of like our goal, I think. It's just...

We'll choose a topic on the weeks and today was text to speech, last week was text to video, image, media gen, this is what we're trying to do. So we'll try to stay focused on the media side, I know we got into MCPs a little bit and some of other stuff but hopefully people find this interesting. I love having the conversation with you because I learn so many things from you just even during this time.

Bilal Tahir (40:31.516)
No, same. And it's such an exciting space. Even Media Gen is so much. I mean, we obviously focused on Text to Speech. This was our episode. But other than just to kind of recap, this week we had Midjourney launch their video model, is probably the... I mean, Midjourney is the most expressive image model. So it made sense that based on, because of the data they had, they would also have the most expressive, cool video model. And I've been kind of going down these rabbit holes all on Twitter of...

Pierson Marks (40:44.642)
Right.

Pierson Marks (40:54.563)
Right.

Bilal Tahir (40:59.356)
seeing amazing videos of animations and stuff of people taking the old midjourney images and animating them. I'm like oh this is gonna be a whole like new like domain you know because people have created so much good art and they they can just animate it now so very exciting. Minimax also is having an open source week they launched their image to video model and text to video model called holio highly

Pierson Marks (41:02.402)
Right.

Pierson Marks (41:15.95)
Totally.

Bilal Tahir (41:25.91)
of misremembered the name it's called Helio 2 really good supposedly as good as the best models out there it doesn't have voice like vio 3 vio 3 is only like multimodal audio but Helio 2 really good quality and it's very cheap they have a light version which is only like 28 cents yeah ByteDance also released their own video model which is the light one is 18 cents called Seadream and not sorry Seadance sorry

Sea Dance and their Sea Dream is their image model which is also supposedly the best image model, text to image model. So this space moves so fast. Every week there's this new models and stuff. this is why it's so exciting to be in there.

Pierson Marks (42:06.902)
And there was the other one, I mean, that you mentioned this week was the audio to video model where it's like you throw in an audio file and then you get a video from the audio.

Bilal Tahir (42:16.868)
Yes, so there's a of them there's this I don't think there's any new one this week But the the traditional ones is head. I would say the big ones are head drum head. That's probably the leader They have had a character three, but you have to get my sign up for this subscription If you want to try had a character to you can go on sieve sieve they have a character to them There's also a lemon slice which is

supposedly good they're both ridiculously expensive and I know it's the cost of serving these walls ridiculously expensive that's why they're expensive but hopefully they'll get that down but you can do some cool things you can generate podcasts of like if you've ever seen those baby podcasts and stuff you know we'll probably do an episode on on this area I don't have the word for it it's basically it's it's half lip sync because you're lip syncing but also animating an avatar and and now I say sorry you just made me remember the model I showed

Pierson Marks (42:54.798)
Yeah.

Pierson Marks (43:04.962)
Right.

Bilal Tahir (43:10.334)
I think now the next stage of this is we used to have one image one guy or girl talking and now you can do one image with multiple characters and they're talking to each other and so then there's a been a couple of models that are going down that domain so exciting news for that and there's an probably the open source leader there is Hun Yuan, Hun Yuan avatar

Pierson Marks (43:21.75)
cool.

Bilal Tahir (43:36.208)
but really expensive. did one of their videos, it cost me a dollar 40 cents for five seconds and it took 10 minutes. yeah, I mean, there's a ways to go here to get it down.

Pierson Marks (43:42.818)
Whoa, wow. That's crazy. you know, I always thought it's like, it's so, that's annoying today, but it's also so exciting because it's like this amazing, you just see this, the trajectory downwards, it's so expensive. And you just know based on technology and progress that this is gonna go down to approach zero and approach instant speed. It's...

Bilal Tahir (43:54.555)
Yes.

Bilal Tahir (44:06.032)
Yeah, I mean, I'm excited for that.

every week like you know at Jalpur we get asked people like hey i love the audio and stuff i just want to make some cool video on this and they ask us and we want to give to you that but that's we're just waiting for these models to just get good enough and cheap and at this point they are kind of good enough to be honest it's more the the cost now it's like starts usually there's always the quality goes up and the cost and then it hits a certain threshold and then it's like the it starts getting cheap and it's funny i almost call it it's similar to roger bans

Pierson Marks (44:18.304)
Right.

Bilal Tahir (44:38.27)
like I call it the 4 million mile like everyone has this I've seen this like happen time again I saw it with images with stable diffusion everyone says oh it's too expensive too costly and then one model comes out and it just blows everyone I mean the quality is basically similar but it's like 10 times cheaper or whatever and then within the next one to two months you see a bunch of competitors matching that price so you saw that with images you saw that with audio with open AI with because the level as you charge so much money and then opening I came out and it was like 10 times cheaper and then a bunch of other text-to-speech models

and now with video I feel like we're having that four minute mile moment because you saw the first like sub 20 second 20 cent 5 second video with the see you see the seed ants and cling basically has

amazing model at 25 cents, VO3, $6 for five seconds, but first audio model. And I won't be surprised that somebody basically comes out with similar audio, at least 90 % of the audio quality, but for 10X cheaper price. And so it's just gonna open up a whole slope options in the next 60 days, I think. So very exciting.

Pierson Marks (45:38.862)
Totally.

Pierson Marks (45:46.414)
It's gonna be wild. It's gonna be wild. And I mean, what better place to know those updates than this podcast? I mean, we're gonna be there in the thick of it. We're always gonna know every model that comes out, every price. It's crazy. mean, you're just listing out all these prices. Like, how do you keep these in your head? Because it's impressive.

Bilal Tahir (45:52.54)
Oh yeah, yeah. We're going to be there and take a quick trying out all

Bilal Tahir (46:06.241)
The trick is to not have a life. Yeah, just be glued to the internet.

Pierson Marks (46:08.332)
Yes, yes, is. It is a good trick. Yeah, 100%. Well, yeah, I think it's perfect time to wrap up. So if you've tuned in today, this is episode two of Creative Flux. I'm Pierson Marks, this is Bilal Tahir. yeah, we'll see you next week. So.

Bilal Tahir (46:30.332)
Yeah, we gotta come up with some sort of call off, like a call sign or so, whatever. But for now, bye bye. See you.

Pierson Marks (46:35.724)
Right, right, do. We need a... like the outro music, yeah. Sweet.

Bilal Tahir (46:42.492)
there.

Creators and Guests

person
Host
Bilal Tahir
Builder of things. Product Engineering @ Jellypod
person
Host
Pierson Marks
Software engineer, Designer, and CEO/Co-founder of Jellypod
ElevenLabs V3 and the Future of Text to Speech
Broadcast by