How to match text to audio in Python? - python

I have an audio file and a text that corresponds to the speech in this audio file.
Is there any way to match the text to the audio so that I get something like timestamps that show where the words in the text file appear in the audio.

So I have found exactly what I was looking for.
Apparently the technology that matches a given Text to an Audio and returns the exact timestamps is called Forced Alignment.
Here is an extremely useful link to a list of the best forced alignment tools: https://github.com/pettarin/forced-alignment-tools
Personally, I have used Aeneas as it worked really well for me.

Yes, that is possible. I am assuming you are aware of basic terminology around the audio tech.
Check library https://www.geeksforgeeks.org/python-speech-recognition-on-large-audio-files/
The library can read any audio file chunk by chunk. One could pass the file for audio to text conversion and further can collect the result of text chunk by chunk.
Also, If the SampleRate of the Audio File is 44100, then 8192 chunks will represent a time unit around 185 milliseconds.

Related

sampling audio doesn't preserve waves (vectors)!

I made a Telegram robot, and one of its jobs is to create samples from audio files. Now for most audios that is sent to it, the sample is perfectly fine; something like this:
However, for some audios, the sample looks a bit odd:
As you can see, the waves in this file are not shown! (I can assure you that the voice is not empty)
For creating the sample, I use pydub (Thanks, James!). Here's the part that I create the sample:
song = AudioSegment.from_mp3('song.mp3')
sliced = song[start*1000:end*1000]
sliced.export('song.ogg', format='ogg', parameters=["-acodec", "libopus"])
And then I send the sample using bot.send_voice method. Like this:
bot.send_voice(
chat_id=update.message.chat.id,
voice=open('song.ogg', 'rb'),
caption=settings.caption,
parse_mode=ParseMode.MARKDOWN,
timeout=1000
)
The documentation of Telegram Bot API says:
Use this method to send audio files, if you want Telegram clients to
display the file as a playable voice message. For this to work, your
audio must be in an .ogg file encoded with OPUS (other formats may be
sent as Audio or Document).
That's why in this line of code:
sliced.export('song.ogg', format='ogg', parameters=["-acodec", "libopus"])
I used parameters=["-acodec", "libopus"].
Can anyone tell me what I'm doing wrong? Thanks in advance!
Shot in the dark guess:
Having just sampled those two Muse songs, "Pressure" is a much louder rock song than "The Void". I suspect Telegram service itself just detects the music as noise when performing speech to text translation. Unlike speech, which has an wide dynamic range between spoken words, music tends to be all the same volume. Hence, the relative volume of each sample is relatively the same - hence, a flat line.
Since it happen only to some of the songs, I believe the issues is linked with the original song format. Make sure that pudub got file parameters right, e.g.: number of channels, sample width, frame rate, etc. Sometimes the resulting format also changes, so you can get audio in range [-1..1] (float), and sometimes [-32767..32768] (integer).

Return type of wavfile.read in python

I need to analyse a sound file, in order to get when the sound is louder.
I have this :
rate, data = wavfile.read('test.wav')
I know the meaning of the rate value, but what really is in the data variable ?
It works well when I want to retrieve the time intervals of the louder part of the audio, by looking at the data list, but I can't really find out the meaning of this list...
Thank you very much
data holds a numpy array representing the sound in your .wav file.
Some good explanations on how the sound is represented in that data can be found in the following question:
What do the bytes in a .wav file represent?
data in wav files is audio samples. Most of the time it's 16bit signed integers. For wav files you mostly care about rate (sound frequency) and number of channels (if your wav file is not mono).

Right way to equalize mp3 file in python

I'm currently working on a little python script to equalize MP3 file.
I've read some docs about MP3 file format (at https://en.wikipedia.org/wiki/ID3)
And i've noticed that in the ID3v2 format there is a field for Equalization (EQUA, EQU2)
Using the python librarie mutagen i've tried to extract theses information from the MP3 but the field isn't present.
What's the right way to equalize MP3 file regardless of the ID3 version ?
Thank in advance. Creekorful
There are two high-level approaches you can take: modify the encoded audio stream, or put metadata on it describing the desired change. Modifying the audio stream is the most compatible, but generally less desirable. However, ID3v1 has no place for this metadata, only ID3v2.2 and up do.
Depending on what you mean by equalize, you might want equalization information stored in the EQA/EQUA/EQU2 frames, or a replay gain volume adjustment stored in the RVA/RVAD/RVA2 frames. Mutagen supports the linked frames, so all but EQA/EQUA. If you need them, it should be straightforward to add them from the information in the actual specification (see 4.12 on http://id3.org/id3v2.4.0-frames). With tests they could likely be contributed back to the project.
Note that Quod Libet, the player paired with Mutagen, has taken a preference for reading and storing replay gain information in a TXXX frame.

How do you read this .wav file byte data?

I used the python wave module and read the first frame from a .wav file and it returned this :
b'\x00\x00\x00\x00\x00\x00'
What does each byte mean and will it be the same for every frame or for just some?
I've done some research into the subject and have found that there are bytes that give information about the .wav file in front of the sound data, so does python miss out this information and skip straight to the sound data or do I have to manually separate it?
There are 2 channels and a sample width of 3 according to python.
UPDATE
I have successfully created the waveform for the wav file, it wasn't as difficult as I first thought, now to show it whilst the song is playing....
The wave module reads the header for you, which is why it can tell you how many channels there are, and what the sample width is.
Reading frames gives you direct access to the raw sample data, but because the WAV format is a bit of a mixed, confused beast it depends on the sample width and channel count how you need to interpret each frame. See this article for a good in-depth discussion on that.

.wav questions and python wave

The module "wave" of python gives me a list of hexadecimal bytes, that I can read like numbers. Let's say the frequency of my sample is 11025. Is there a 'header' in those bytes that specify this? I know I can use the wave method to get the frequency, but I wanna talk about the .wav file structure. It has a header? If I get those bytes, how do I know wich ones are the music and the ones that are information? If I could play these numbers in a speaker 11025 times per second with the intensity from 0 to 255, could I play the sound just like it is in the file?
Thanks!
.wav files are actually RIFF files under the hood. The WAVE section contains both the format information and the waveform data. Reading the codec, sample rate, sample size, and sample polarity from the format information will allow you to play the waveform data assuming you support the codec used.

Categories

Resources