Here’s how I transcribe audio on macOS. To prepare:
- Install Homebrew
- Install required tools:
brew install ffmpeg whisper.cpp - Download an appropriate Whisper speech recognition model, for example, from the Hugging Face repository. I have been using the
ggml-large-v3.binmodel with good results.
Then, to convert a file:
# Convert MP3 file to WAV
ffmpeg -i lecture01.mp3 -ar 16000 lecture01.wav
# Run Whisper model to transcribe audio
whisper-cli --language en --max-context 0 --max-len 65 --split-on-word --output-json --model ~/Downloads/ggml-large-v3.bin --file lecture01.wav --output-file lecture01
# Convert JSON output to text (see explanation and sample script below)
python3 json2text.py lecture01.json > lecture01.txt
# Cleanup
rm lecture01.wav lecture01.json
Some notes on the above:
- The tool automatically appends a file extension on the output file name; this is why I have left the extension off in the
--output-filespecification. - I had difficulty with the model hallucinating about 30 minutes into a 45-minute lecture recording; it generated repetitive text from that point to the end of the recording. I found the use of
--max-contextrecommended to help with this, and it did help in my case. Other people recommend switching to theggml-large-v2model, or restarting the transcription near that point in the recording using the--offset-tparameter (note that this takes input in milliseconds). - I like the idea of including timestamps in the output text, and while the tool prints this on
stdout, it does not include them in an output text file. Therefore I’ve chosen to output JSON and convert it to text myself. I use the simple script below to do so.
# json2text.py
import json, sys
j = json.load(open(sys.argv[1]))
for i in j['transcription'] :
print(f"[{i['timestamps']['from']} --> {i['timestamps']['to']}] {i['text']}")
As a test case I transcribed a Biblical theology lecture, and I was pleased to find that the model had no difficulty with names such as Hittites, Ephraim, Manasseh, Melchizedek, and Eber. It also had a decent sense of capitalization of titles and of acronyms. My test case was also a relatively poor-quality recording and this did not seem to pose a problem for the model.
I found on an M1 MacBook Air this 45-minute lecture took about 20 minutes to transcribe. It successfully leveraged the GPU. This was much faster than using the faster-whisper tool (about 120 minutes, using CPU only). I also attempted the insanely-fast-whisper tool, but this took even longer as well as having difficulty using the GPU. I confess I did no tuning, but in spite of using a batch-size of 4 (as recommended) it failed after many hours with a GPU allocation error. So I am quite pleased with the performance of whisper.cpp!
By comparison, an M4 MacBook Pro was able to process the same 45-minute lecture in 3 minutes.