yt-cast: Generating podcasts from YouTube URLs

Today’s goal: Turn a collection of YouTube links into a podcast.

Start with a config.json like this:

{
  "brandon-sanderson": [
    "https://www.youtube.com/watch?v=H4lWbkERlxo&list=PLSH_xM-KC3ZuOZayK68JAAjj5W9ShnFVC",
    "https://www.youtube.com/watch?v=YyaC7NmPsc0&list=PLSH_xM-KC3ZtjKTR2z8rPWxv1pP6bOVzZ",
    "https://www.youtube.com/watch?v=o3V0Zok_kT0&list=PLSH_xM-KC3ZuteHw3G1ZrCDWQrAVgO0ER"
  ]
}

And it will automatically download all referenced YouTube videos, convert them to MP3 (both using youtube-dl), and serve an RSS feed that’s compatible with most podcast programs.

Tested URLs include:

  • Playlist URLs (like the above)
  • Single video URLs
  • Channel URLs

Most youtube URLs should work though.

How? With these parts:

  • The update-thread to download all the playlists and metadata
  • The download-thread to download the actual episodes and convert them to mp3s
  • Flask routes to generate the RSS feed and return the mp3s
  • The main function

Update thread

# A thread to download information on the requested URLs periodically
def update_thread():
    # Update once a minute, but caches will probably mostly be used
    while True:
        with open(CONFIG_FILE, 'r') as fin:
            config = json.load(fin)

        for key in config:
            for url in config[key]:
                path = path_for(url)

                # If we don't have the metadata (or an update in the last hour), download it
                if not os.path.exists(path) or os.path.getmtime(path) + CACHE_TTL < time.time():
                    logging.info(f'Fetching {url}')
                    try:
                        with youtube_dl.YoutubeDL(YDL_OPTS) as ydl:
                            info = ydl.extract_info(url, download=False)
                            
                        with open(path, 'w') as fout:
                            json.dump(info, fout)

                        # Queue downloads for all videos (existing ones will be skipped)
                        if 'entries' in info:
                            for entry in info['entries']:
                                if entry['upload_date'] >= cutoff():
                                    DOWNLOAD_QUEUE.append(entry['id'])
                        else:
                            if info['upload_date'] >= cutoff():
                                DOWNLOAD_QUEUE.append(info['id'])

                    except Exception as ex:
                        logging.error(f'Failed to fetch {url}: {ex}')

        time.sleep(60)

Go through the config file and use YouTubeDL to download the info files:

with youtube_dl.YoutubeDL(YDL_OPTS) as ydl:
    info = ydl.extract_info(url, download=False)

If there are entries, we have more than one video. Include them all. If not, it’s a single video. I did add a cutoff function to make sure I didn’t download every video on long feeds:

# When we should return history back to
def cutoff():
    return (datetime.date.today() - datetime.timedelta(**CUTOFF)).strftime('%Y%m%d')

The next thing that we will do is generate a hashed filename for the URL (mostly to make sure unusual filenames etc aren’t a problem):

# Get the cache file for a url
def path_for(url):
    hash = hashlib.md5(url.encode()).hexdigest()
    return f'data/{hash}.json'

This also uses the file mtime to make sure that we only download files every so often, even if this update script runs once per minute.

Download thread

Next up, the download thread!

# Download a single youtube video
def download_thread():
    # Prepopulate with any missing videos
    logging.info('Prepopulating download queue')
    for key in config:
        for url in config[key]:
            path = path_for(url)
            if not os.path.exists(path):
                continue

            with open(path, 'r') as fin:
                info = json.load(fin)
                if 'entries' in info:
                    for entry in info['entries']:
                        DOWNLOAD_QUEUE.append(entry['id'])
                else:
                    DOWNLOAD_QUEUE.append(info['id'])

    # Download loop
    while True:
        while DOWNLOAD_QUEUE:
            id = DOWNLOAD_QUEUE.pop()
            logging.info(f'Download queue [{len(DOWNLOAD_QUEUE)}]: {id}')

            url = f'https://www.youtube.com/watch?v={id}'
            path = path_for(url)
            if os.path.exists(path):
                continue

            logging.info(f'Downloading video at {url}')
            with youtube_dl.YoutubeDL(YDL_OPTS) as ydl:
                ydl.extract_info(url, download=True)

        time.sleep(60)

It starts with a single download ahead of time because the mp3 server doesn’t like not having files. If files are already updated, it should go very quickly and then go into the main update loop which goes through newly queued videos and downloads them. It’s actually the same functionality as getting the info, just for single videos and with download=True. Onwards!

Flask server: generate podcast RSS/XML

First, let’s generate the XML files:

@app.route('/<key>.xml')
def podcast(key):
    entries = []

    for url in config[key]:
        path = path_for(url)
        if not os.path.exists(path):
            continue

        with open(path, 'r') as fin:
            info = json.load(fin)
            if 'entries' in info:
                for entry in info['entries']:
                    if entry['upload_date'] >= cutoff():
                        entries.append(entry)
            else:
                if info['upload_date'] >= cutoff():
                    entries.append(info)

    entries = list(reversed(sorted(entries, key=lambda entry: entry['upload_date'])))

    # Generate the XML
    return flask.Response(
        flask.render_template('podcast.xml', key = key, entries = entries, format_date = format_date),
        mimetype='application/atom+xml'
    )

This will filter through videos based on the cutoff (above). Most of the work is done in the Jinja templates.

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
    <channel>
        <title>{{ key }}</title>
        <link>{{ request['url'] }}</link>
        <language>en-us</language>
        <itunes:subtitle>Generated by yt-cast</itunes:subtitle>
        <itunes:author>{{ entries[0]['uploader'] }}</itunes:author>
        <itunes:summary>{{ key }}</itunes:summary>
        <description>{{ key }}</description>
        <itunes:owner>
            <itunes:name>{{ entries[0]['uploader'] }}</itunes:name>
            <itunes:email>[email protected]</itunes:email>
        </itunes:owner>
        <itunes:explicit>no</itunes:explicit>
        <itunes:image href="{{ request.host_url }}static/rss.svg" />
        <itunes:category>Comedy</itunes:category>
        {% for entry in entries %}
        <item>
            <title>{{ entry['title'] }}</title>
            <itunes:summary></itunes:summary>
            <description>{{ entry['description'] }}</description>
            <link>{{ entry['webpage_url'] }}</link>
            <enclosure url="{{ request.host_url }}{{ entry['id'] }}.mp3" type="audio/mpeg" length="1024"></enclosure>
            <pubDate>{{ format_date(entry['upload_date']) }}</pubDate>
            <itunes:author></itunes:author>
            <itunes:duration>00:00:01</itunes:duration>
            <itunes:explicit>no</itunes:explicit>
            <guid>{{ path }}</guid>
        </item> 
        {% endfor %}
    </channel>
</rss>

It’s half incomplete, but it’s at least functional. One interesting thing I did learn was that the itunes:duration and enclosure@length don’t actually have to be realistic values, but for many programs they do have to be set. Legacy!

Flask server: serve mp3s

This one is really quick. It does require that the id file actually look like an ID (primarily to prevent a directory traversal attack, although Flask should do that). Then just send_file it back. This could be much more efficient by using a reverse proxy (nginx etc) in front of Flask to actually serve the static files, but in practice it seems to be working well enough.

@app.route('/<id>.mp3')
def episode(id):
    if not re.match(r'^[a-zA-Z0-9_-]+$', id):
        raise Exception('Close but no cigar')

    return flask.send_file(f'data/{id}.mp3')

Main

Start it all up and we’re good to go:

if __name__ == '__main__':
    threading.Thread(target=download_thread, daemon=True).start()
    threading.Thread(target=update_thread, daemon=True).start()

    app.run(host = '0.0.0.0')

Setting this as daemon threads means that when the server is shut down, the threads will go with it.

TODOs

  • Automatically remove files that have passed the cutoff date

Source

Full source: https://github.com/jpverkamp/yt-cast

If you have any ideas, send in a pull request or shoot me an email.