I tend to be something of a digital packrat. If there’s interesting data somewhere, I’ll collect it just in case I want to do something with it.
Helpful? Usually not. But it does lead to some interesting scripts.
In this case, I have a site that hosts videos. I want to download those videos and get a text based transcription of them. With new AI tools, that shouldn’t be hard at all. Let’s give it a try!
Installing whisper.cpp
First up, OpenAI’s whisper. It’s a audio to text model that does exactly what I’m looking for. Really, there are two main wrappers around the model itself:
- whisper - the original Python version
- whisper.cpp - a port using the same models, but in C++
I’ve used the Python version at first and it works fine. It’s also a bit more compatible, it doesn’t mind taking an mp3 in and will convert it behind the scenes. But I’m primarily writing this on an M1 mac. Out of the box, whisper.cpp supports the NEON chipset and (being C++) is just generally faster.
So let’s set that up instead, since (at least with my current setup) it’s really not so bad:
# Checkout
┌ ^_^ [email protected] ~/Projects
└ git clone [email protected]:ggerganov/whisper.cpp.git
Cloning into 'whisper.cpp'...
remote: Enumerating objects: 3024, done.
remote: Counting objects: 100% (46/46), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 3024 (delta 20), reused 38 (delta 19), pack-reused 2978
Receiving objects: 100% (3024/3024), 5.10 MiB | 18.40 MiB/s, done.
Resolving deltas: 100% (1864/1864), done.
# Build and download the base English model
┌ ^_^ [email protected] {git master} ~/Projects/whisper.cpp
└ make base.en
I whisper.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c whisper.cpp -o whisper.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/main/main.cpp examples/common.cpp ggml.o whisper.o -o main -framework Accelerate
./main -h
usage: ./main [options] file0.wav file1.wav ...
options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-sow, --split-on-word [false ] split on word rather than on token
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [-1 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-owts, --output-words [false ] output script for generating karaoke video
-fp, --font-path [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-oj, --output-json [false ] output result in a JSON file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [true ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path
bash ./models/download-ggml-model.sh base.en
Downloading ggml model base.en from 'https://huggingface.co/ggerganov/whisper.cpp' ...
ggml-base.en.bin 100%[============================================================>] 141.11M 103MB/s in 1.4s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
You can now use it like this:
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
===============================================
Running base.en on all samples in ./samples ...
===============================================
----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 117.99 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 20.57 ms
whisper_print_timings: sample time = 11.74 ms / 27 runs ( 0.43 ms per run)
whisper_print_timings: encode time = 304.17 ms / 1 runs ( 304.17 ms per run)
whisper_print_timings: decode time = 72.67 ms / 27 runs ( 2.69 ms per run)
whisper_print_timings: total time = 544.48 ms
And that’s really it.
Setting up a Python Poetry project
Next up, setting up a quick Python script. I’ve recently gotten a bit annoyed at requirements.txt files (not having a nice CLI and not locking versions by default) and am trying out Poetry.
It’s really quite easy as well:
# Set up a new directory for the project
┌ ^_^ [email protected] ~/Projects
└ mkdir transcript-scraper
┌ ^_^ [email protected] ~/Projects
└ cd transcript-scraper
# Initialize poetry, just use the defaults
┌ ^_^ [email protected] ~/Projects/transcript-scraper
└ poetry init -q
# Install the libraries I'm going to use
┌ ^_^ [email protected] ~/Projects/transcript-scraper
└ poetry add bs4 requests
Creating virtualenv transcript-scraper-WqErZKPo-py3.10 in /Users/jp/Library/Caches/pypoetry/virtualenvs
Using version ^0.0.1 for bs4
Using version ^2.28.2 for requests
Updating dependencies
Resolving dependencies... (0.2s)
Writing lock file
Package operations: 8 installs, 0 updates, 0 removals
• Installing soupsieve (2.4)
• Installing beautifulsoup4 (4.12.0)
• Installing certifi (2022.12.7)
• Installing charset-normalizer (3.1.0)
• Installing idna (3.4)
• Installing urllib3 (1.26.15)
• Installing bs4 (0.0.1)
• Installing requests (2.28.2)
Et voila.
Scraping the page
Finally, the script itself. I did use Github Copilot to help accelerate some of this, but it got a bit more wrong than last time I tried it out. YMMV.
Our goals:
- Download the page
- For each video:
- Extract date and title from this specific format
- Generate the filenames we’ll be using
- Check if we’ve already generate the text file, if so skip
- Otherwise:
- Use
requests
withstream=True
to download the mp4 - Use
ffmpeg
to extract the audio as mp3 (this could be combined with the next step, but I wanted to keep the mp3 + text) - Use
ffmpeg
to convert to the mp3 to wav (whisper needs this; this is where Copilot failed, it didn’t add the extra parameters to actually convert to a wav) - Use
whisper
to extract the text - Clean up the
mp4
andwav
, keeping only themp3
andtxt
files
- Use
Something like this:
import bs4
import datetime
import os
import requests
import sys
import urllib.parse
url = sys.argv[1]
print('Scraping', url)
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
for article in soup.select('article'):
# Extract metadata
time = article.select_one('.time').text.strip()
title = article.select_one('h4 a').text.strip()
href = article.select_one('.media a').attrs['href']
# Convert date like Mar 05, 2023 to datetime
# Copilot just wrote this line for me; how cool is that?
date = datetime.datetime.strptime(time, '%b %d, %Y')
# Generate filename
working_path = f'data/{date:%Y}/{date:%m}/'
output_path = 'output'
filename = f'{date:%Y-%m-%d} - {title}}'
print(filename)
os.makedirs(working_path, exist_ok=True)
os.makedirs(output_path, exist_ok=True)
mp4_filename = os.path.join(working_path, filename + '.mp4')
mp3_filename = os.path.join(working_path, filename + '.mp3')
wav_filename = os.path.join(working_path, filename + '.wav')
txt_filename = os.path.join(output_path, filename + '.txt')
# Skip if text already exists
if os.path.exists(txt_filename):
continue
# Download href video
if not os.path.exists(mp4_filename):
print('- Downloading', mp4_filename)
response = requests.get(href, stream=True)
with open(filename + '.mp4', 'wb') as f:
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
# Convert to mp3 with ffmpeg
if not os.path.exists(mp3_filename):
print('- Converting', mp3_filename)
os.system(f'ffmpeg -i "{mp4_filename}" "{mp3_filename}"')
# Covert to wav with ffmpeg
if not os.path.exists(wav_filename):
print('- Converting', wav_filename)
os.system(f'ffmpeg -i "{mp3_filename}" -acodec pcm_s16le -ac 1 -ar 16000 "{wav_filename}"')
# Extract text with whisper
if not os.path.exists(txt_filename):
print('- Extracting text', txt_filename)
os.system(f'~/Projects/whisper.cpp/main -m ~/Projects/whisper.cpp/models/ggml-base.en.bin -f "./{wav_filename}" --output-txt')
os.rename(f'{wav_filename}.txt', txt_filename)
# Once we have the text, we no longer need to keep the video or wav
os.remove(mp4_filename)
os.remove(wav_filename)
So I’m doing a few ugly things.
For example, a lot of os.system
calls instead of subprocess.check_call
/ subprocess.check_output
. If there are any quotes in the filenames, I’ve opened myself right up for a command injection attack.
And then there’s the super ugly whisper command:
~/Projects/whisper.cpp/main
-m ~/Projects/whisper.cpp/models/ggml-base.en.bin
-f "./{wav_filename}"
--output-txt
Having to specify the executable’s full path made sense, I didn’t put it on my Path. I’ll probably fix that at some point, but not right now.
But having to specify the full path to the model (rather than whisper knowing to check relative to first my working directory and then it’s own directory) is a bit annoying. And if you don’t have the ./
prefix on the file path, who knows where it’s going to run from…
But other than that, worked great. 10 minutes of transcription in a few seconds.
And that’s it. Something I’ll fire set up as a cronjob and just leave running until I decide to try to do something more with it. 😄
Onward!