How to process a wav file for WebCRT VAD on Python

How to process a wav file for WebCRT VAD on Python - python

I'm learning how to use py-webrtcvad library for Python and trying to pick out segments containing speech on a .wav file. Right now, I don't know how to split a .wav files into smaller frames to put it into a vad object. Here's my code below
spf = wave.open("test4.wav", "rb")
signal = spf.readframes(-1)
vad = webrtcvad.Vad(3)
sample_rate = 16000
frame_duration = 10 # ms
#TODO: turn signal into segments and run the below print statement in a for loop
print('Contains speech: %s' % (vad.is_speech(signal, sample_rate)))
As I have read from https://github.com/wiseman/py-webrtcvad, the frames fed into the vad object must be 16-bit and either 10,20 or 30ms long. My sample is about 9s long and now I don't know how to split it into the appropriate frames. I have tested the example.py in the repo, but it is a bit too complicated for me so I tried to do a very simple example first.
Thank you.

Related

How to split an audio file (.mp3 or .wav) into 10 secs parts to send its parts to a function?

This is not part of the code, so indicate it with the ! operator, since it is a command for the Google Colab command line (because I am first testing this code on a notebook, and then the same code can be used on a local PC). But this first line is necessary for the code to work in Google Colab.
! pip install -q transformers
And here the code:
import librosa
import torch
#for download Wav2Vec2 from the Transformers library of Hugging Face
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
In this case, I load the tokenizer and this pre-trained model
#load model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
Here is the problem, since I am loading the complete audio file and it probably saturates the RAM memory when trying to process an .mp3 file with a duration of 14:18 mins (6,54 Mb).
I choose the 16000 Hz frequency because this facebook Wav2Vec2 model is pretrained in a 16 kHz rate.
I'm not sure if the problem that causes the RAM to saturate is the weight of the file or if it is the length of the audio file? In any case I think the solution is to cut it into parts of 5 seconds, or 10 seconds or 30 seconds each (so it would be convenient to have a function with a parameter to adjust that).
input_audio_file = "CLASE 9 HOMBRE MAQUINA.mp3"
speech, rate = librosa.load(input_audio_file, sr=16000)
Finally the audio processing process, and this is where my RAM gets clogged, so I needed some help for create some kind of loop, where if the input audio is longer than X amount of seconds then the algorithm split this original audio into n-amount of parts of X amount of seconds each of them (perhaps the last part is shorter as it is less than X seconds).
Here I tokenize the inputs (in this case it is the complete audio, but I think it would be best if they were passed from audio fragments of X seconds duration each) and I would configure the tensors that contain the information of the audios in PyTorch objects.
Here the processing process:
input_values = tokenizer(speech, return_tensors = 'pt').input_values
#Store logits (non-normalized predictions)
logits = model(input_values).logits
#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)
#decode the audio to generate text
transcriptions = tokenizer.decode(predicted_ids[0])
#And finally I would print the text after this text-to-speech process
print(transcriptions)
Here the saturation of the RAM that caused the connection to be closed (I think if I tried this code with this audio file on my PC it would just freeze)
I think that for this question the Wav2Vec2 algorith operation could be taken as an information of the context in which the program is going to work, however what I was needing is how to fragment the audio into parts and send it part by part to the processing code (perhaps it is convenient to place it inside a function that is called inside the loop) and print what is converted to text line by line in the console.
I have been able to determine the length of a audio file in seconds with librosa methods to decide whether to split it or not
speech, rate = librosa.load(input_audio_file)
x = 10
len_data = len(speech) # holds length of the numpy array
duration_in_seconds = int( len_data / rate ) # returns duration in seconds but in floats
print(duration_in_seconds)
if(duration_in_seconds >= x):
#split the audio file in x secs audio chunks
else:
#dont split the audio file

Trying to split audio into 20ms chunks and making a spectrogram but its looking weird

I basically have this audio file that is 16-bit PCM WAV mono at 44100hz and I'm trying to convert it into a spectrogram. But I want a spectrogram of the audio every 20ms (Trying this for speech recognition), but whenever I try to compare what I have to Audacity, its really different. I'm kind of new to python so I was trying to base this off of my java knowledge. Any help would be appreciated. I think I'm either splitting the read samples incorrectly (What I did was split it every 220 elements in the array since I believe Audio Data is just samples in the time domain to get it to 20ms audio)
Here's the code I have right now:
import librosa.display
import numpy
audioPath = 'C:\\Users\\pawar\\Desktop\\Resister.wav'
audioData, sampleRate = librosa.load(audioPath, sr=None)
print(sampleRate)
new = numpy.zeros(shape=(220, 1))
counter = 0
for i in range(0, len(audioData), 882):
new = (audioData[i: i + 882])
STFT = librosa.stft(new, n_fft=882)
print(type(STFT))
audioDatainDB = librosa.amplitude_to_db(abs(STFT))
print(type(audioDatainDB))
librosa.display.specshow(audioDatainDB, sr=sampleRate, x_axis='time', y_axis='hz')
#plt.figure(figsize=(20,10))
plt.show()
counter += 1
print("Your local debug print statement ", counter)
As for the values, well I was playing around with them quite a bit trying to get it to work. Doesn't seem to be of any luck ;/
Here's the output:
https://i.stack.imgur.com/EVntx.png
And here's what Audacity shows:
https://i.stack.imgur.com/GIGy8.png
I know its not 20ms in the audacity one but you can see the two don't look even a bit similar

Real time audio processing in Python

I'm writing a program to check for glitches in an audio signal recorded by a computer. After audio has been detected, I would like to check for glitches in the first 5 seconds of data ( corresponding to 220500 samples at a sampling rate of 44.1kHz), move on to the next 5 seconds of data and check for glitches in that, then the next 5 seconds etc. I have a while loop which begins after audio has been detected, it starts reading audio samples from a stream into an array until it has 220500 samples in the array, after which it enters an if statement where it begins checking for glitches in the 220500 samples (and deletes all the elements in the array afterwards). My problem is that while this is happening , audio is still being recorded by the computer but it is not being read from the stream into the array and by the time I have exited the if statement and restarted the while loop I have missed a few seconds of audio data.
while 1:
# little endian, signed short
snd_data = array('h', stream.read(1500))
if byteorder == 'big':
snd_data.byteswap()
r.extend(snd_data)
if len(r) == 220500 or silent:
r = trim(r)
data = pack('<' + ('h'*len(r)), *r)
data = np.fromstring(data,dtype=np.int16)
glitch detection carried out here...
I am using PyAudio for recording audio
p=pyaudio.PyAudio() # start the PyAudio class
stream=p.open(format=pyaudio.paInt16,channels=1,rate=RATE,input=True,input_device_index = 1, output_device_index = 6,frames_per_buffer=1500)
I would like to know is there any way for me to continue reading from the audio stream into the array whilst carrying out the glitch detection in the if statement? Or if not, is there any other way I could go about this?

Your Answer is Multi-threading not Multiprocessing, By use python threading package and its wonderful signal feature, you can achieve what you want.

pydub audio glitches when splitting/joining mp3

I'm experimenting with pydub, which I like very much, however I am having a problem when splitting/joining an mp3 file.
I need to generate a series of small snippets of audio on the server, which will be sent in sequence to a web browser and played via an <audio/> element. I need the audio playback to be 'seamless' with no audible joins between the separate pieces. At the moment however, the joins between the separate bits of audio are quite obvious, sometimes there is a short silence and sometimes a strange audio glitch.
In my proof of concept code I have taken a single large mp3 and split it into 1-second chunks as follows:
song = AudioSegment.from_mp3('my.mp3')
song_pos = 0
while song_pos < 100:
p1 = song_pos * 1000
p2 = p1 + 1000
segment = song[p1:p2] # 1 second of audio
output = StringIO.StringIO()
segment.export(output, format="mp3")
client_data = output.getvalue() # send this to client
song_pos += 1
The client_data values are streamed to the browser over a long-lived http connection:
socket.send("HTTP/1.1 200 OK\r\nConnection: Keep-Alive\r\nContent-Type: audio/mp3\r\n\r\n")
and then for each new chunk of audio
socket.send(client_data)
Can anyone explain the glitches that I am hearing, and suggest a way to eliminate them?

Upgrading my comment to an answer:
The primary issue is that MP3 codecs used by ffmpeg add silence to the end of the encoded audio (and your approach is producing multiple individual audio files).
If possible, use a lossless format like wave and then reduce the file size with gzip or similar. You may also be able to use lossless audio compression (for example, flac) but it probably depends on how the encoder works.
I don't have a conclusive explanation for the audible artifacts you're hearing, but it could be that you're splitting the audio at a point where the signal is non-zero. If a sound begins with a sample with a value of 100 (for example), that would sound like a digital popping sound. The MP3 compression may also alter the sound though, especially at lower bit rates. If this is the issue, a 1ms fade in will eliminate the pop without a noticeable audible "fade" (though potentially introduce other artifacts) - a longer fade in (like 20 or 50 ms would avoid strange frequency domain artifacts but would introduce noticeable a "fade in".
If you're willing to do a little more (coding) work, you can search for a "zero crossing" (basically, a place where the signal is at a zero point naturally) and split the audio there.
Probably the best approach if it's possible:
Encode the entire signal as a single, compressed file, and send the bytes (of that one file) down to the client in chunks for playback as a single stream. If you use constant bitrate mp3 encoding (CBR) you can send almost perfectly 1 second long chunks just by counting bytes. e.g., with 256kbps CBR, just send 256 KB at a time.

So, I could be totally wrong I don't usually mess with audio files but it could be an indexing issue. try,
p2 = p1 + 1001
but you may need to invert the concatenation process for it to work. Unless you add an extra millisecond on the end.
The only other thing I would think it could be is an artifact in the stream that enters when you convert the bytes to string. Try, using the AudioSegment().raw_data endpoint for a bytes representation of the audio.

Sound is waveform and you are connecting two waves that are out of phase from one another; so you get a step function and that makes the pop.
I'm unfamiliar with this software but codifying Nils Werner's suggestions, you might try:
song = AudioSegment.from_mp3('my.mp3')
song_pos = 0
# begin with a millisecond of blank
segment = AudioSegment.silent(duration=1)
# append all your pieces to it
while song_pos < 100:
p1 = song_pos * 1000
p2 = p1 + 1000
#append an item to your segment with several milliseconds of crossfade
segment.append(song[p1:p2], crossfade=50)
song_pos += 1
# then pass it on to your client outside of your loop
output = StringIO.StringIO()
segment.export(output, format="mp3")
client_data = output.getvalue() # send this to client
depending upon how low/high the frequency of what you're joining you'll need to adjust the crossfade time to blend; low frequency will require more fade.

Binary string to wav file

I'm trying to write a wav upload function for my webapp. The front end portion seems to be working great. The problem is my backend (python). When it receives the binary data I'm not sure how to write it to a file. I tried using the basic write functon, and the sound is corrupt... Sounds like "gobbly-gook". Is there a special way to write wav files in Python?
Here is my backend... Not really much to it.
form = cgi.FieldStorage()
fileData = str(form.getvalue('data'))
with open("audio", 'w') as file:
file.write(fileData)
I even tried...
with open("audio", 'wb') as file:
file.write(fileData)
I am using aplay to play the sound, and I noticed that all the properties are messed up as well.
Before:
Signed 16 bit Little Endian, Rate 44100 Hz, Stereo
After upload:
Unsigned 8 bit, Rate 8000 Hz, Mono

Perhaps the wave module might help?
import wave
import struct
import numpy as np
rate = 44100
def sine_samples(freq, dur):
# Get (sample rate * duration) samples on X axis (between freq
# occilations of 2pi)
X = (2*np.pi*freq/rate) * np.arange(rate*dur)
# Get sine values for these X axis samples (as integers)
S = (32767*np.sin(X)).astype(int)
# Pack integers as signed "short" integers (-32767 to 32767)
as_packed_bytes = (map(lambda v:struct.pack('h',v), S))
return as_packed_bytes
def output_wave(path, frames):
# Python 3.X allows the use of the with statement
# with wave.open(path,'w') as output:
# # Set parameters for output WAV file
# output.setparams((2,2,rate,0,'NONE','not compressed'))
# output.writeframes(frames)
output = wave.open(path,'w')
output.setparams((2,2,rate,0,'NONE','not compressed'))
output.writeframes(frames)
output.close()
def output_sound(path, freq, dur):
# join the packed bytes into a single bytes frame
frames = b''.join(sine_samples(freq,dur))
# output frames to file
output_wave(path, frames)
output_sound('sine440.wav', 440, 2)
EDIT:
I think in your case, you might only need:
packedData = map(lambda v:struct.pack('h',v), fileData)
frames = b''.join(packedData)
output_wave('example.wav', frames)
In this case, you just need to know the sampling rate. Check the wave module for information on the other output file parameters (i.e. the arguments to the setparams method).

The code I pasted will write a wav file as long as the data isn't corrupt. It was not necessary to use the wave module.
with open("audio", 'w') as file:
file.write(fileData)
I was originally reading the file in Javascript as FileAPI.readAsBinaryString. I changed this to FileAPI.readAsDataURL, and then decoded it in python using base64.decode(). Once I decoded it I was able to just write the data to a file. The .wav file was in perfect condition.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.