Fingers-On: How Apple’s New Speech APIs Outpace Whisper for Lightning-Quick Transcription

Thank you for reading this post, don't forget to subscribe!

Late final Tuesday night time, after watching F1: The Film on the Steve Jobs Theater, I used to be driving again from dropping Federico off at his resort once I acquired a textual content:

Are you able to decide me up?

It was from my son Finn, who had spent the night close by and was stalking me in Discover My. In fact, I swung by and picked him up, and we headed again to our resort in Cupertino.

On the way in which, Finn stuffed me in on a brand new class in Apple’s Speech framework known as SpeechAnalyzer and its SpeechTranscriber module. Each the category and module are a part of Apple’s OS betas that had been launched to builders final week at WWDC. My ears perked up instantly when he informed me that he’d examined SpeechAnalyzer and SpeechTranscriber and was impressed with how briskly and correct they had been.

It’s nonetheless early days for these applied sciences, however I’m right here to let you know that their velocity alone is a sport changer for anybody who makes use of voice transcription to create textual content from lectures, podcasts, YouTube movies, and extra. That’s one thing I do a number of instances each week for AppStories, NPC, and Unwind, producing transcripts that I add to YouTube as a result of the location’s built-in transcription isn’t excellent.

What’s pissed off me with different instruments is how sluggish they’re. Most are constructed on Whisper, OpenAI’s open supply speech-to-text mannequin, which was launched in 2022. It’s low cost at beneath a penny per a million tokens, however isn’t quick, which is irritating while you’re within the remaining steps of a YouTube workflow.

I requested Finn what it could take to construct a command line device to transcribe video and audio information with SpeechAnalyzer and SpeechTranscriber. He figured it could solely take about 10 minutes, and he wasn’t far off. In the long run, it took me longer to get round to putting in macOS Tahoe after WWDC than it took Finn to construct Yap, a easy command line utility that takes audio and video information as enter and outputs SRT- and TXT-formatted transcripts.

Yesterday, I lastly took the Tahoe plunge and instantly put in Yap. I grabbed the 7GB 4K video model of AppStories episode 441, which is about 34 minutes lengthy, and ran it by means of Yap. It took simply 45 seconds to generate an SRT file. Right here’s Yap ripping by means of practically 20% of an episode of NPC in 10 seconds:

Subsequent, I ran the identical file by means of VidCap and MacWhisper, utilizing its V2 Massive and V3 Turbo fashions. Right here’s how every app and mannequin did:

App	Transcripiton Time
Yap	0:45
MacWhisper (Massive V3 Turbo)	1:41
VidCap	1:55
MacWhisper (Massive V2)	3:55

All three transcription workflows had related hassle with final names and phrases like “AppStories,” which LLMs are likely to separate into two phrases as a substitute of camel casing. That’s simply fastened by operating a set of discover and substitute guidelines, though I’d like to feed these corrections again into the mannequin itself for future transcriptions.

What stood out above all else was Yap’s velocity. By harnessing SpeechAnalyzer and SpeechTranscriber on-device, the command line device tore by means of the 7GB video file a full 2.2× quicker than MacWhisper’s Massive V3 Turbo mannequin, with no noticeable distinction in transcription high quality.

At first blush, the distinction between 0:45 and 1:41 could seem insignificant, and it arguably is, however these are the outcomes for only one 34-minute video. Extrapolate that to operating Yap towards the hours of Apple Developer movies launched on YouTube with the assistance of yt-dlp, and abruptly, you’re speaking a couple of important period of time. Like all automation, choosing up a 2.2× velocity achieve one video or audio clip at a time, a number of instances every week, provides up rapidly.

Whether or not you’re producing video for YouTube and want subtitles, producing transcripts to summarize lectures at college, or doing one thing else, SpeechAnalyzer and SpeechTranscriber – out there throughout the iPhone, iPad, Mac, and Imaginative and prescient Professional – mark a big leap ahead in transcription velocity with out compromising on high quality. I absolutely anticipate this mix to switch Whisper because the default transcription mannequin for transcription apps on Apple platforms.

To check Apple’s new mannequin, set up the macOS Tahoe beta, which presently requires an Apple developer account, after which set up Yap from its GitHub web page.