Login

Python Video to Text – Speech Recognition

<div>
<div class="kk-star-ratings kksr-auto kksr-align-left kksr-valign-top" data-payload='{"align":"left","id":"1077176","slug":"default","valign":"top","ignore":"","reference":"auto","class":"","count":"1","legendonly":"","readonly":"","score":"5","starsonly":"","best":"5","gap":"5","greet":"Rate this post","legend":"5\/5 - (1 vote)","size":"24","width":"142.5","_legend":"{score}\/{best} - ({count} {votes})","font_factor":"1.25"}'>
<div class="kksr-stars">
<div class="kksr-stars-inactive">
<div class="kksr-star" data-star="1" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="2" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="3" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="4" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" data-star="5" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
<div class="kksr-stars-active" style="width: 142.5px;">
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
<div class="kksr-star" style="padding-right: 5px">
<div class="kksr-icon" style="width: 24px; height: 24px;"></div>
</p></div>
</p></div>
</div>
<div class="kksr-legend" style="font-size: 19.2px;"> 5/5 – (1 vote) </div>
</p></div>
<p>A good friend and his wife recently founded an AI startup in the lifestyle niche that uses <a rel="noreferrer noopener" href="https://blog.finxter.com/machine-learning-engineer-income-and-opportunity/" data-type="post" data-id="306050" target="_blank">machine learning</a> to discover specific real-world patterns from videos.</p>
<p>For their business system, they need a pipeline that takes a video file, converts it to audio, and transcribes the audio to standard text that is then used for further processing. I couldn’t help but work on a basic solution to help fix their business problem. </p>
<h2>Project Overview</h2>
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="568" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-276-1024x568.png" alt="" class="wp-image-1077229" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-276-1024x568.png 1024w, https://blog.finxter.com/wp-content/uplo...00x166.png 300w, https://blog.finxter.com/wp-content/uplo...68x426.png 768w, https://blog.finxter.com/wp-content/uplo...ge-276.png 1453w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
<p>I finished the project in three steps:</p>
<ul>
<li>First, install the necessary libraries.</li>
<li>Second, <strong>convert the video to an audio file</strong> (<code>.mp4</code> to <code>.wav</code>)</li>
<li>Third, <strong>convert the audio file to a speech file</strong> (<code>.wav</code> to <code>.txt</code>). We first break the large audio file into smaller chunks and convert each of them separately due to the size restrictions of the used API.</li>
</ul>
<p>Let’s get started!</p>
<h2>Step 1: Install Libraries</h2>
<p>We need the following <code>import</code> statements in our code:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Import libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
import moviepy.editor as mp</pre>
<p>Consequently, you need to <code>pip install</code> the following three libraries in your shell — assuming you run <a href="https://blog.finxter.com/how-to-check-your-python-version/" data-type="post" data-id="1371" target="_blank" rel="noreferrer noopener">Python version</a> 3.9:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="generic" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">pip3.9 install pydub
pip3.9 install SpeechRecognition
pip3.9 install moviepy</pre>
<p>The <code><a href="https://blog.finxter.com/exploring-pythons-os-module/" data-type="post" data-id="19050" target="_blank" rel="noreferrer noopener">os</a></code> module is already preinstalled as a Python Standard Library.</p>
<p>If you need an additional guide on how to install Python libraries, check out this tutorial:</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/how-to-install-xxx-in-python/" data-type="post" data-id="653128" target="_blank" rel="noreferrer noopener">Python Install Library Guide</a></p>
<h2>Step 2: Video to Audio</h2>
<div class="wp-block-image">
<figure class="aligncenter size-full"><img decoding="async" loading="lazy" width="690" height="460" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-252.png" alt="" class="wp-image-1075726" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-252.png 690w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w" sizes="(max-width: 690px) 100vw, 690px" /></figure>
</div>
<p>Before you can do speech recognition on the video, we need to extract the audio as a <code>.wav</code> file using the <code>moviepy.editor.VideoFileClip().audio.write_audiofile()</code> method.</p>
<p>Here’s the code:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def video_to_audio(in_path, out_path): """Convert video file to audio file""" video = mp.VideoFileClip(in_path) video.audio.write_audiofile(out_path)</pre>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a href="https://blog.finxter.com/python-video-to-audio/" data-type="post" data-id="1077175" target="_blank" rel="noreferrer noopener">Python Video to Audio</a></p>
<h2>Step 3: Audio to Text</h2>
<p>After extracting the audio file, we can start transcribing the speech from the <code>.wav</code> file using Google’s powerful speech recognition library on chunks of the potentially large audio file. </p>
<p>Using chunks instead of passing the whole audio file avoids an error for large audio files — Google has some restrictions on the audio file size. </p>
<p>However, you can play around with the splitting thresholds of 700ms silence—it can be more or less, depending on your concrete file.</p>
<p>Here’s the audio to text code function that worked for me:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">def large_audio_to_text(path): """Split audio into chunks and apply speech recognition""" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700) # Create folder to store audio chunks folder_name = "audio-chunks" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = "" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f"chunk{i}.wav") audio_chunk.export(chunk_filename, format="wav") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print("Error:", str(e)) else: text = f"{text.capitalize()}. " print(chunk_filename, ":", text) whole_text += text # Return text for all chunks return whole_text</pre>
<p>Need more info? Check out the following deep dive:</p>
<p class="has-base-background-color has-background"><img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f449.png" alt="?" class="wp-smiley" style="height: 1em; max-height: 1em;" /> <strong>Recommended</strong>: <a rel="noreferrer noopener" href="https://blog.finxter.com/large-audio-to-text-heres-my-speech-recognition-solution-in-python/" data-type="post" data-id="1075593" target="_blank">Large Audio to Text? Here’s My Speech Recognition Solution in Python</a></p>
<h2>Step 4: Putting It Together</h2>
<div class="wp-block-image">
<figure class="aligncenter size-large"><img decoding="async" loading="lazy" width="1024" height="683" src="https://blog.finxter.com/wp-content/uploads/2023/01/image-255-1024x683.png" alt="" class="wp-image-1075808" srcset="https://blog.finxter.com/wp-content/uploads/2023/01/image-255-1024x683.png 1024w, https://blog.finxter.com/wp-content/uplo...00x200.png 300w, https://blog.finxter.com/wp-content/uplo...68x512.png 768w, https://blog.finxter.com/wp-content/uplo...ge-255.png 1282w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>
</div>
<p>Finally, we can combine our functions. First, we extract the audio from the video. Second, we chunk the audio into smaller files and recognize speech independently on each chunk using Google’s speech recognition module.</p>
<p>I added comments to annotate the most important parts of this code:</p>
<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group=""># Import libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
import moviepy.editor as mp def video_to_audio(in_path, out_path): """Convert video file to audio file""" video = mp.VideoFileClip(in_path) video.audio.write_audiofile(out_path) def large_audio_to_text(path): """Split audio into chunks and apply speech recognition""" # Open audio file with pydub sound = AudioSegment.from_wav(path) # Split audio where silence is 700ms or greater and get chunks chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700) # Create folder to store audio chunks folder_name = "audio-chunks" if not os.path.isdir(folder_name): os.mkdir(folder_name) whole_text = "" # Process each chunk for i, audio_chunk in enumerate(chunks, start=1): # Export chunk and save in folder chunk_filename = os.path.join(folder_name, f"chunk{i}.wav") audio_chunk.export(chunk_filename, format="wav") # Recognize chunk with sr.AudioFile(chunk_filename) as source: audio_listened = r.record(source) # Convert to text try: text = r.recognize_google(audio_listened) except sr.UnknownValueError as e: print("Error:", str(e)) else: text = f"{text.capitalize()}. " print(chunk_filename, ":", text) whole_text += text # Return text for all chunks return whole_text # Create a speech recognition object
r = sr.Recognizer() # Video to audio to text
video_to_audio('sample_video.mp4', 'sample_audio.wav')
result = large_audio_to_text('sample_audio.wav') # Print to shell and file
print(result)
print(result, file=open('result.txt', 'w'))
</pre>
<p>Store this code in a folder next to your video file <code>'sample_video.mp4'</code> and run it. It will create an audio file <code>'sample_audio.wav'</code> and chunk the audio and print the result to the shell, as well as to a file called <code>'result.txt'</code>. This contains the transcription of the video file.</p>
</div>

https://www.sickgaming.net/blog/2023/01/...cognition/

xSicKxBot