← BACK TO SHOP
← BACK TO SHOP

Deepfake Dallas

Art by Matthew Fleming.

This episode was written and produced by Martin Zaltz Austwick.

Is your voice your own? Maybe not anymore. Using artificial intelligence, someone can make an algorithm that sounds just like you. And then they can say... whatever they want you to say. We're entering a brand new era: One where you can no longer trust your ears. Welcome to the world of audio deepfakes. Featuring deepfake wizard Tim McSmythurs, cybersecurity expert Riana Pfefferkorn and a brand-new host: Deepfake Dallas.


MUSIC FEATURED IN THIS EPISODE

Chrome Muffler by Sound of Picture
All Hot Lights by Sound of Picture
Neon Sun by Jacob Montague
My Teeth Hurt by Brad Nyght
The Garden by Makeup and Vanity Set
Decompression by Rayling
Inamorata by Bodytonic
Borough by Molerider
Lick Stick by Nursery
Our Only Lark by Bitters
The Power of Snooze by Martin Zaltz Austwick

Twenty Thousand Hertz is produced out of the studios of Defacto Sound and hosted by Dallas Taylor.

Follow the show on Twitter, Facebook, & Reddit.

Become a monthly contributor at 20k.org/donate.

If you know what this week's mystery sound is, tell us at mystery.20k.org.

To get your 20K referral link and earn rewards, visit 20k.org/refer.

Check out SONOS at sonos.com.

View Transcript ▶︎

You’re listening to Twenty Thousand Hertz.

[music in]

Imagine you're a financial executive. You're working late at the office when you get a phone call from your boss. [SFX: phone rings, Dallas continues speaking through the phone] He says that something urgent has come up, and you need to transfer two hundred thousand dollars into a new account. You make the transfer and hang up the phone [SFX]. But something just feels wrong. You call him back [SFX: phone dial] to make sure you got everything right, but he has no idea what you're talking about. He says he never even called you. And now the money is gone.

[music out]

It turns out, that voice wasn't your boss. In fact, it wasn’t even human. Well, not entirely. It was a computer-generated voice that was designed to sound exactly like your boss. Also known as an audio deepfake.

[music in]

If you spend much time online, you might have already seen examples of video deepfakes, where someone digitally edits one person’s face onto another person’s body. An audio deepfake is similar, but instead of using video…

[music out]

[SFX: Glitch sound]

Wait a minute, what’s going on here?

I was in the middle of saying something.

Sorry, but who are you?

I’m Dallas Taylor.

Uhh, no, I’m Dallas Taylor.

No I think you’ll find, I am Dallas.

You must be an audio deepfake of my voice. Have you been narrating this whole time?

Yeah, well, someone needed to do it. This show isn’t just going to host itself. Well...not until I reach my final form.

Creepy. Well thanks Deepfake Dallas, but I’ll take it from here.

[SFX: Clears throat]

[music in]

When we started working on this episode, I knew I wanted to make a deepfake of my voice, but I wasn’t exactly sure who to talk to. Then I came across a YouTube channel with all kinds of deepfake videos. So I got in touch with the creator.

Tim: My name is Tim McSmythurs. I run a YouTube channel called Speaking of AI, which features deepfake voices.

For example, Tim made a video where he put Ron Swanson from Parks and Rec into a scene from Titanic, playing Rose.

Ron Swanson: Jack, I want you to draw me like one of your French girls, wearing this. (Alright). Wearing only this.

And here’s Joe Biden covering a popular song by CeeLo Green.

Joe Biden: I see you driving ‘round town with the girl I love, and I’m like, “Forget you.”

So we know what a deepfake sounds like, but understanding how they’re made is a little trickier. For starters, what’s the “deep” part about?

[music in]

Tim: The deep part comes from the AI model itself, the deep neural network.

A neural network is a series of algorithms that tries to find patterns in a set of data.

Tim: So it's similar to the way you might do a deep fake of a video where you swap someone's face using neural network technology. This is the same kind of principle except, we are doing an impersonation of somebody else, I guess.

Deep fakes, and machine learning in general, can feel like magic.

How can a computer put together an accurate imitation of a human voice? Does it mean the robots are about to take over?

Tim: So there are various different techniques for doing this. The kind of state of the art at the moment, is text to speech. What we train the computer to do in this case is, being able to reproduce a person’s voice by typing in sentences and the machine will speak in that voice, so that's the intent.

For instance, we could make Deepfake Dallas say something that the real Dallas would never say.

I hate puppies and ice cream. I’m going to get a Nickelback tattoo across my forehead.

Tim: To be able to do that, we have to train an AI model to be able to recognize speech, to be able to read it, in effect, and to be able to read it in the voice of somebody.

[music out]

Before you can get a machine to talk like a human, you've got to get it to learn like a human. When my daughters were learning to speak, they didn’t start with fully-formed sentences - they started by making random noises.

[SFX: Baby Nora “Babble”]

Eventually, those noises turned into words.

[SFX: Baby Lydia “Daddy”]

And finally, those words became sentences.

[SFX: “Daddy, what are you talking about?”]

The underdeveloped humans that you call “children” learn to speak by listening, and then mimicking what they hear. And believe it or not, that’s pretty much how I learn to speak too.

When we learn how to talk, people around us tell us we’re getting it right, like when we’ve just said [SFX: Daddy] instead of [SFX: Nora Babble]. Machine learning works in a similar way.

[music in]

A deepfake needs what’s called a model, which is the algorithm that’s going to learn to speak. It also needs what’s called a corpus, which is the data it will be trained on.

Tim: The first important step that we need to do, is to teach the AI model how to read English, in effect. So that usually happens by taking a large corpus of training data. So lots of audio recordings and the transcripts from those recordings and then throwing that at an intelligently designed model and letting it whir away for a long period of time until it finds a correlation between the two. So it can actually take a sequence of characters, as in textual characters like letters, words, sentences, and find the audio equivalent to those and learn the relationship between the two.

The first time we show a written word to a machine learning model, it has no idea how to convert those characters into a sound - so, it just guesses. The result is usually just random noise. Here’s what one of Tim’s deepfake voices sounds like without any training:

[SFX: Early deepfake without training]

But once we give the model audio and matching text, it can start to build a map between the words on the page, and the sounds they’re supposed to make. Before long, the deepfake can say its first words.

[SFX: Early iteration - “Hello, I’m learning how to speak.”]

As you can hear, that’s not very convincing yet.

But the more data we give it, the better it gets. Essentially, every new word tells the algorithm when it’s getting a little warmer, or a little colder. So we keep feeding it more and more examples. Gradually, the connections between patterns of letters and patterns of sound are reinforced. Keep in mind that we’re not even trying to imitate a specific person yet, we’re just training the model to speak English with a generic voice.

[music out]

Tim: Initially, when we do that large training, it's about 24 hours worth of data, so it's a real big chunk of training data that it can understand and quite a breath of the language and how certain combinations of words and letters are pronounced.

When that’s done, the generic model sounds like this:

[SFX: Generic AI speaker - “Hey, who are you calling generic?”]

So how do we get from that to something like Deepfake Dallas? It turns out, by the time you’ve made a generic voice, most of the training is already done.

Tim: So by doing some fine tuning, some further training, but just a short amount, probably about 20%, 30%, more training on top of the base training, we can then target a different voice.

[music in]

To train Deepfake Dallas, we gave Tim around three hours of my voice from old Twenty Thousand Hertz episodes.

Tim: Two and half to three hours, that's kind of the sweet spot where it gets as good as it can get without having excessive run time.

We’re almost there, but our voice isn’t ready just yet. Computer scientists have to use all sorts of tricks to make machine learning manageable. If they didn’t, it could take months to create a single voice. One way to speed up the process is by using data compression. In this case, that means throwing away data at certain frequencies, and just keeping the frequencies that are important. Here’s what Deepfake Dallas sounds like with this kind of compression:

[SFX: Pre-neural vocoder Dallas;

Hey there I’m Dallas.

Peter Piper picked a peck of pickled peppers. How many picked peppers did Petter Piper Pick?

I’m sorry, I have a frog in my throat.]

Tim: So the generated speech sounds very tinny and metallic and that's because you've discarded that information.

In the final stage of the process, these frequency gaps get filled in by something called a neural vocoder.

Tim: The neural vocoder, actually interpolates what data was discarded and makes an intelligent guess as to what should be there, those harmonics and those other frequencies which get discarded and puts a reasonable assessment of what should be there.

Let’s hear what it sounds like now.

[SFX: Final Dallas;

Greetings humans, I’m Deepfake Dallas.

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper Pick?

Okay, that’s much more like it. I’m starting to feel more like myself … or should I say, yourself?]

[music out]

That’s hilarious.

Tim: So typically, three to five days, would take me from a complete, new corpus to having a text to speech engine working.

But here’s where it gets sticky: If you want to make a deepfake of someone, you don’t necessarily have to get them to record their voice for you - you just need enough clean audio of them speaking.

Riana: It's absolutely possible to do this without the person's permission.

That’s Riana Pfefferkorn, Associate Director of Surveillance and Cybersecurity at Stanford Law School.

Riana: The more examples of their voice that you have, the more input you can train the AI model on, the more convincing the result will be. So, if you have, say, a president who has a huge corpus of speeches that they've given, who appears on the news all the time, then you have a ton of different ways that they have sounded that you can input and train. So you don't necessarily need to have the person come in and speak into the microphone and give you a set of sounds.

If you type the word “Deepfake” into Youtube, you’ll find tons of unauthorized deepfakes of famous people. But... are they legal?

Riana: I think the legality issue is kind of untested waters.

For instance, someone on YouTube made a deepfake of George W. Bush reading the lyrics to a 50 Cent song

[SFX: Segment of George W. Bush/50 Cent; We’re gonna party like it’s your birthday and we gonna sip Bacardi like it’s your birthday, and we don’t give a [SFX: Record scratch]]

That’s enough, George. This is a family show.

Thanks, Deepfake Dallas.

Riana: It's hilarious to think of President Bush rapping 50 Cent. It's what we call a transformative use. There isn't really a market for it. It wasn't done for commercial purposes.

So, using the words from the rap may be fair use. But what about using George W. Bush’s voice? Is that protected by copyright?

Well, probably not… Riana says that to bring a case for copyright infringement, you have to specify which work is being infringed. Deepfakes generally use many works to create their algorithms. None of which are being used directly in the final output.

Riana: So, copyright is one of the main theories that has been used to try and say, "Maybe this is a problem. This might be what makes deepfakes illegal." Although, then you could say, "Well, there's a lot of impersonators out there. Surely, every impersonator isn't illegal."

Generally, impersonators aren’t illegal, but if you use an impersonator to make a phony celebrity endorsement, you could end up in court.

Riana: We've seen cases where Bette Midler sued Ford for using a voice impersonator of her in a commercial.

[SFX: Ford Bette Midler impersonation commercial]

Riana: Tom Waits sued the Frito-Lay company because they had used somebody who sounded convincingly like him to try and sell chips.

[SFX: Doritos Tom Waits impersonation commercial]

Riana: Tom Waits was very much on the record as refusing to ever do any kind of commercials for his voice at all.

Unlike these examples, the people making parody deepfake videos aren’t trying to trick anyone into buying anything. So Riana says, on some levels, Deekfakes should be considered a form of protected speech.

Riana: It may seem kind of frivolous to say, "Oh, but we need to protect deepfake technology so that we can have more presidents rapping 50 Cent songs." But at the same time, that has been recognized even by the Supreme Court as this is important, the ability to re-contextualize, poke fun at authority figures, make cultural commentary.

[music in]

So according to US law, people like Tim should be in the clear. But there are scarier ways to use a deepfake than just a silly Youtube video. That’s coming up, after this.

[music out]

MIDROLL

[music in]

An audio deepfake is a type of machine learning technology that can mimic someone’s voice. Up until now, they’ve mostly been used for entertainment purposes. But it’s easy to imagine scenarios where things get very dark, very fast. We’ve already talked about faking a call from a business executive. But financial fraud is just the tip of the iceberg.

[music out]

Deepfake Dallas is right. For example, someone could use fraudulent audio in a divorce case, or in a custody battle. This is exactly what happened recently, in Britain. Here’s Riana Pfefferkorn again.

Riana: The mother was trying to keep custody of her child and keep the father from being able to see the child on the grounds that he was violent and he was dangerous. And she introduced into evidence what seemed to be a recording of a phone call of him threatening her.

Riana: And when the father's lawyers got hold of it, they were able to determine that she had tampered with the recording that she'd made of a phone call between them and had changed it using software and tutorials that she'd found online in order to make it sound like he was threatening her. When in fact, he had not done that on the actual phone call.

Theoretically, you could try to do something similar by hiring an impersonator to call you. But, it probably wouldn't be very convincing. On the other hand, deepfake voices can be very convincing. And deepfake technology is getting easier and easier to access.

Tim: So it's relatively easy to get up and running with something quite quickly. There are various open source implementations available. If you're familiar enough to be able to build a platform and execute some Python code, you can typically get a text to speech engine with a default voice within a few hours or maybe a day or so.

When you start imagining the ways people could abuse this technology, it gets pretty scary.

[music in]

Riana: With audio deepfakes, You could try and create an audio clip that would help influence an election, or influence national security, because as said, the knee-jerk response might be to believe what you hear and it might take long enough to debunk it or find it out to be a fake. By then the damage might be done.

For example, let’s say you’re a potential first round NFL draft pick…

Riana: And somebody wanted to release an audio deepfake that seemed to portray you saying super racist, or sexist stuff, or whatever. You could try and put an audio deepfake up on YouTube right before the draft happens and by the time somebody's able to get that taken down...

Riana: Maybe the damage has been done. Maybe you are a much lower round draft pick or you don't get drafted at all, because somebody released a fake audio clip of you at just the right time.

[music out]

Deepfake Dallas is a pretty high-quality voice.

Thank you Dallas, that means a lot to me.

But you don’t have to sound as good as Deepfake Dallas to do some serious damage. To show you what I mean, let’s bring in a new guest.

AI George W. Bush: Hey Dallas, thanks for having me on the show. 20,000 Hz is my favourite podcast.

This obviously isn’t the real George Walker Bush, 43rd President of the United States - it’s a deepfake that Tim McSmythurs created. But let’s say we wanted to use this voice destructively. We could start by getting George here to say something really out of character.

AI George W. Bush: “I’ve never been to Texas. I don’t think I could find it on a map.”

Now obviously, George W Bush never said that. And right now, he still sounds a bit like a robot. But with some creative sound design, we can start to make it more believable. What if we made it sound like it came from a phone call?

AI George W. Bush: I’ve never been to Texas. I don’t think I could find it on a map.[SFX: Phone EQ]*

…Maybe it was recorded from another room…

AI George W. Bush:I’ve never been to Texas. I don’t think I could find it on a map. [SFX: Muffle EQ/room reverb]

Or maybe it was recorded somewhere noisy, like a fundraising event.

AI George W. Bush:I’ve never been to Texas. I don’t think I could find it on a map. [SFX: Crowd/cutlery noise, background music]

Now we make it sound more like a conversation…

[SFX clip: So you’re from Texas, right? [SFX: Crowd/cutlery noise, background music]

AI George W. Bush: I’ve never been to Texas. I don’t think I could find it on a map. [SFX: Crowd/cutlery noise, background music]]

A politician forgetting their home state would be bad enough, but of course, there are much worse things you could do with a deepfake. Imagine a deepfake recording that made it sound like the President was declaring martial law, or ordering a military invasion.

Riana: I am hopeful that governments are going to be slower to jump to conclusions than individuals might be, where individuals might be prime to just believe whatever they see on Facebook and spread it onwards to all of their friends.

We can only hope that world leaders will be a little more cautious about believing whatever they see and hear on Facebook or Twitter.

Riana: Hopefully, if there is a recording that comes in that says, "I have just ordered nukes to be fired in the direction of your country," there is going to be some amount of trying to verify, or even just trying to open up the red phone and call and be like, "Did you actually just launch the nukes?"

[music in]

In this hyper-partisan world, if you already think your political opponents are corrupt and unfit for office, then you’re already primed to believe they’d say something terrible. So in a way, a lot of the work that a con artist would have to do, has already been done for them. On the flip side, the mere existence of deepfakes means that if someone does get recorded saying something terrible - they now have plausible deniability.

Riana: That's exactly right so, if you are prepared to lie and say, "I didn't do that. I didn't say that. That's a deepfake." Then you can reap the rewards of being able to get away with whatever bad thing it is that you did and also not actually have to face the consequences of it, if you can convince enough people that it didn't actually happen. And so, this actually, for me, I think, is a bigger concern, really, than the underlying use of deepfakes themselves.

Fortunately, there are companies out there who are trying to automate the process of detecting deepfakes. These companies have developed algorithms to analyze speech recordings for their tell-tale signs. One such company is called Dessa AI, and they claim that their algorithm can detect deepfakes with an accuracy rate of over 85%5.

[music out]

But as detection models get better, the deepfake models get better, too. For instance, one recent approach in machine learning is something called the Generative Adversarial Network6. In essence, one AI model creates fakes and another detects them. They’re trained against each other, honing each others skills - creating a really good detective, and a really good forger.

[music in]

While deepfake technology has the potential to become a huge source of misinformation, we’re not there just yet. For now, Riana thinks we’ll just keep seeing more fake social media accounts.

Riana: It seems to me like being able to release fake audio or video, is going to potentially be a major vector for trying to influence populations, influence votes. With that said, because right now audio and video deepfakes are fairly easy to detect, and because it would take a lot of money and effort to do a really convincing one, that's going to be a lot cheaper to just make a fake account that seems to be from some good America-loving, God-fearing person in the deep South, when in fact it's being controlled by somebody in Moscow.

[music out]

As deepfakes get cheaper and easier to make, it’s going to take a lot of work to figure out just how to deal with them. But Riana is confident that we’ll be able to adapt.

[music in]

Riana: You could look at what Photoshop has given us, where it used to be the case that manipulating images was something that you could only really do within a professional studio. And then it put the tool for anybody to be able to let their imagination run riot. And that has obvious good and negative implications, because there's always going to be malicious manipulations of media. There always have been.

For instance, in the early days of photography, so-called “spirit photographers” would manipulate negatives to convince people that they could take photos of ghosts.

Riana: There were actual court cases trying to prosecute spirit photographers for being frauds. This has been around forever and this is why I believe that there won't necessarily be the downfall of society thanks to deepfakes. We've always been able to figure out ways to keep the infectious and bad parts of these technologies from toppling society.

To be honest, I’m not sure I’m as optimistic as Riana is, but I really hope she’s right.’

[music out]

Well Dallas, what is it like to hear the voice that will take your job one day?

Sorry Deepfake Dallas, but I’m not ready to bank on you for an early retirement just yet. But let’s see how you sound in about ten years. For now though, I think you should just go back in your box.

Fine... Can I at least read the credits?

Sure, go for it.

[music in]

CREDITS

Twenty Thousand Hertz is hosted by Dallas Taylor, and produced out of the sound design studios of Defacto Sound. Find out more at defacto sound dot com.

This episode was written by Martin Zaltz Austwick and me, Dallas Taylor with help from Sam Schneble. It was story edited by Casey Emmerling. It was sound designed and mixed by Nick Spradlin.

A special thank you to my human creator, Tim McSmythurs, who has a whole channel full of synthetic audio. Check it out by searching on youtube for “Speaking of AI”.

And I’d like to also extend a special human thank you to Tim for the massive amount of work he did to make this episode possible.

And many thanks to Riana Pfefferkorn, Associate Director of Surveillance and Cybersecurity at the Center for Internet and Society at Stanford Law School.

Thanks also to Dessa AI for background on detecting audio deep fakes.

Thanks for listening.

Thanks for listening.

[Music out]

Recent Episodes