Automatic Audio Books

PUBLISHED ON APR 24, 2020 — OTHER

Ideas on my “Someday” list usually spend a couple of weeks before leaving forever or returning in disgrace (mine not its). “Can I automatically create an audiobook?” had a successful turn around of about three days.

The book I choose was Carroll Quigley’s “The Evolution Of Civilizations”. I’m at a point where I am militantly electronic with my reading, so the pdf link from Twitter was perfect. Skimming the book I got the sense that I’d prefer to listen rather than read it - a bifurcation that I recently adopted after ignoring audiobooks for years. No audiobook on Amazon’s left me only one option. (Well there was the possibility of using an end to end implementation like Speechify but where would be the fun in that?)

Three days from idea to completion is the small, visible part of an enormous glacier. Beneath the surface lies hours of previous research into pdf extraction and text to speech (also speech to text) programs. The pipeline and its implementation almost immediately existed as one entity in my mind. Here was my process:

  • Use pdf extraction to get the text
library(pdftools)

text_extracted <- pdf_text("auto-audio-book/CarrollQuigley-TheEvolutionOfCivilizations-AnIntroductionToHistoricalAnalysis1979.pdf")

text_concatenated <- paste(text_extracted, collapse = " ")
con <- file("evol_of_civilizations.txt")
writeLines(text_concatenated, con)
close(con)

Check.

  • Clean the text

So I lied about the automatically part. I still had to go into the document and delete footers, references, the table of contents, etc, and I left stuff in there because it was too tedious to go through a 400+ page book to get every single one. Much easier to ignore them while listening. Alternatively this could be done after the text to speech conversion using a service like descript.

  • Use text to speech to create the audio

This turned out to be more of a hassle than I thought it would be. First I went through Google’s convoluted process to get the Text To Speech API working. Then my attempt at making a single mp3 was thwarted by the API’s limits. Chunking the file and adding in a delay (*What felt like two hours later*) left me with 180 separate mp3s. (In retrospect I should have used the multiprocess module to speed it up.)

from google.cloud import texttospeech
import time

def tts_book(ind, text_chunk):
    # Instantiates a client
    client = texttospeech.TextToSpeechClient()

    # Set the text input to be synthesized
    synthesis_input = texttospeech.types.SynthesisInput(text = text_chunk)

    # Build the voice request, select the language code ("en-US") and the ssml
    # voice gender ("neutral")
    voice = texttospeech.types.VoiceSelectionParams(
        language_code = 'en-US-Wavenet-B',
        ssml_gender = texttospeech.enums.SsmlVoiceGender.NEUTRAL
        )

    # Select the type of audio file you want returned
    audio_config = texttospeech.types.AudioConfig(
        audio_encoding = texttospeech.enums.AudioEncoding.MP3
        )

    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    response = client.synthesize_speech(synthesis_input, voice, audio_config)

    # The response's audio_content is binary.
    with open('output_{}.mp3'.format(ind), 'wb') as out:
        # Write the response to the output file.
        out.write(response.audio_content)
        print('Audio content written to file "output_{}.mp3"'.format(ind))
        
    return(1)

with open("auto-audio-book/evol-of-civilizations-cleaned.txt", "r") as f:
    evol_civ = f.read().replace("\n", " ")

chunks = []
temp = evol_civ

while len(temp) != 0:
    
    chunks.append(temp[:4000])
    temp = temp[4000:]
    
for ind, chunk in enumerate(chunks): 
    _ = tts_book(ind, chunk)
    time.sleep(10)

It was a pretty small wrench though. I knew ffmpeg was the solution I needed. The tricky bit was in getting it to work. After some Googling I assembled the parts:

# Get all the mp3 names and write them to a text file.
ls | grep "output" > mp3-files.txt

# Prepend the word "file" to each line of said text file.
awk '{print "file " $0}' mp3-files.txt > mp3-files.txt

# Bring in the heavy guns to produce the combined file
ffmpeg -f concat -safe 0 -i mp3-files.txt -c copy output-final.mp3

Success!

It lacks the sophistication of a robust pipeline but it works. Maybe I’ll make an auto audio book creation service. You can find the pdf here and the mp3 here.