I'm doing speech recognition and denoising. In order to feed the data to my model I need to resample and make it 2 channels. although I don't know the optimized resampling rate for each sound. when I use a fixed number for resampling rate(resr) like 20000 or 16000 sometimes it works and sometimes it makes the pitch wrong or makes it slow. How does resampling work in this case? Do I need an optimizer?
Also what can I do if I have a phone call and one person's voice is too quiet that it indeed gets recognized as noise?
This is my code:
num_channels = sig.shape[0]
# Resample first channel
resig = torchaudio.transforms.Resample(sr, resr)(sig[:1,:])
print(resig.shape)
if (num_channels > 1):
# Resample the second channel and merge both channels
retwo = torchaudio.transforms.Resample(sr,resr)(sig[1:,:])
resig = torch.cat([resig, retwo])
I don't know the optimized resampling rate for each sound
Sample rate is not a parameter you tune for each audio, rather, the same sample rate that was used to train the speech recognition model should be used.
sometimes it works and sometimes it makes the pitch wrong or makes it slow.
Resampling when properly done, does not alter pitch or speed. I guess that you are saving the resulting data with wrong sample rate. Sample rate is not something you can pick an arbitrary number. You have to pick a one that conforms to the system you are working with.
Having said that the proper way to do resampling, regardless of the number of channels is to simply pass the waveform to torchaudio.functional.resample function with original and target sample rate. The function process multiple channels at the same time, so there is no need to run resample function separately for each channel.
Then, if you know the sample rate of input audio beforehand, and all the audio you process have the same sample rate, using torchaudio.transforms.Resample will make the process faster because it will cache the convolution kernel used for resampling.
resampler = torchaudio.transforms.Resample(original_sample_rate, target_sample_rate)
for sig in signals:
resig = resampler(sig)
# process the resulting resampled signal
Related
I'm building a simple Python application that involves altering the speed of an audio track.
(I acknowledge that changing the framerate of an audio also make pitch appear different, and I do not care about pitch of the audio being altered).
I have tried using solution from abhi krishnan using pydub, which looks like this.
from pydub import AudioSegment
sound = AudioSegment.from_file(…)
def speed_change(sound, speed=1.0):
# Manually override the frame_rate. This tells the computer how many
# samples to play per second
sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
"frame_rate": int(sound.frame_rate * speed)
})
# convert the sound with altered frame rate to a standard frame rate
# so that regular playback programs will work right. They often only
# know how to play audio at standard frame rate (like 44.1k)
return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate)
However, the audio with changed speed sounds distorted, or crackled, which would not be heard with using Audacity to do the same, and I hope I find out a way to reproduce in Python how Audacity (or other digital audio editors) changes the speed of audio tracks.
I presume that the quality loss is caused by the original audio having low framerate, which is 8kHz, and that .set_frame_rate(sound.frame_rate) tries to sample points of the audio with altered speed in the original, low framerate. Simple attempts of setting the framerate of the original audio or the one with altered framerate, and the one that were to be exported didn't work out.
Is there a way in Pydub or in other Python modules that perform the task in the same way Audacity does?
Assuming what you want to do is to play audio back at say x1.5 the speed of the original. This is synonymous to saying to resample the audio samples down by 2/3rds and pretend that the sampling rate hasn't changed. Assuming this is what you are after, I suspect most DSP packages would support it (search audio resampling as the keyphrase).
You can try scipy.signal.resample_poly()
from scipy.signal import resample_poly
dec_data = resample_poly(sound.raw_data,up=2,down=3)
dec_data should have 2/3rds of the number of samples as the original raw_data samples. If you play dec_data samples at the sound's sampling rate, you should get a sped-up version. The downside of using resample_poly is you need a rational factor, and having large numerator or denominator will cause output less ideal. You can try scipy's resample function or seek other packages, which supports audio resampling.
I'm making a Raspberry Pi bat detector using a USB-powered ultrasonic microphone. I want to be able to record bats while excluding insects and other non-bat noises. Recording needs to be sound-triggered to avoid filling the SD card too quickly and to aid with analysis. This website explains how to do this with SoX:
rec - c1 -r 192000 record.wav sinc 10k silence 1 0.1 1% trim 0 5
This records for 5 seconds after a trigger sound of at least 0.1 seconds and includes a 10kHz high pass filter. This is a good start, but what I'd really like is an advanced filter that excludes crickets and other non-bat noises. Insect and bat calls overlap in frequency so a high pass or band filter won't do.
The Elekon Batlogger does this with a period trigger that analyses zero crossings. From the Batlogger website:
The difference in sound production of bats (vocal cords) and insects
(stridulation) affects the period continuity. The period trigger takes
advantage of this:
The trigger fires when ProdVal and DivVal are lower than the set
limits, so if the values are within the yellow range.
(Values mean default values): ProdVal = 8, higher values trigger
easier DivVal = 20, higher values trigger easier
Translated text from the image:
Bat: Tonal signal
Period constant => zero crossings / time = stable
Insects: scratching
Period constant => zero crossings / time = differs
MN => mean value of the number of periods per measurement interval
SD => standard deviation of the number of periods
Higher values trigger better even at low frequencies (also insects!)
And vice versa
Is there a way to implement this (or something to the same effect) in Raspberry Pi OS? The language I'm most familiar with is R. Based on answers to this question it seems like R would be suitable for this problem, although if R isn't the best choice then I'm open to other suggestions.
I'd really appreciate some working code for recording audio and filtering as described above. My desired output is 5 second files that contain bat calls, not insects or noise. Needs to be efficient in terms of CPU / power use and needs to work on-the-fly.
Example recordings of bats and insects here.
UPDATE:
I've got a basic sound-activated script working in Python (based on this answer) but I'm not sure how to include an advanced filter in this:
import pyaudio
import wave
from array import array
import time
FORMAT=pyaudio.paInt16
CHANNELS=1
RATE=44100
CHUNK=1024
RECORD_SECONDS=5
audio=pyaudio.PyAudio()
stream=audio.open(format=FORMAT,channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
nighttime=True # I will expand this later
while nighttime:
data=stream.read(CHUNK)
data_chunk=array('h',data)
vol=max(data_chunk)
if(vol>=3000):
print("recording triggered")
frames=[]
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("recording saved")
# write to file
words = ["RECORDING-", time.strftime("%Y%m%d-%H%M%S"), ".wav"]
FILE_NAME= "".join(words)
wavfile=wave.open(FILE_NAME,'wb')
wavfile.setnchannels(CHANNELS)
wavfile.setsampwidth(audio.get_sample_size(FORMAT))
wavfile.setframerate(RATE)
wavfile.writeframes(b''.join(frames))
wavfile.close()
# check if still nighttime
nighttime=True # I will expand this later
stream.stop_stream()
stream.close()
audio.terminate()
TL;DR
R should be capable of doing this in post-processing. If you want this done on a live audio recording/stream, I'd suggest you look for a different tool.
Longer Answer
R is capable of processing audio files through several packages (most notable seem to be tuneR), but I'm fairly sure that will be limited to post-collection processing, i.e. analysing the files you have already collected, rather than a 'live' filtering of streaming audio input.
There are a couple of approaches you could take to 'live' filtering of the insect/unwanted sounds. One would be to just record files like you have listed above, and then write R code to process them (you could automate this on a schedule with cron for example) and discard parts or files that don't match your criteria. If you are concerned about SD card space you can also offload these files to another location after processing (i.e. upload to another drive somewhere). You could make this a fairly short time frame (at the risk of CPU usage on the Pi), to get an 'almost-live' processing approach.
Another approach would be to look more at the sox documentation and see if there are options in there to achieve what you want based on streaming audio input, or see if there is another tool that you can stream input to, that will allow that sort of filtering.
Viel Glück!
I have a numpy array which is continuously growing in size, with a function adding data to it every so often. This array actually is sound data, which I would like to play, not after the array is done, but while it is still growing. Is there a way I can do that using pyaudio? I have tried implementing a callback, but without success. It gives me choppy audio with delay
You could perhaps intercept the event or pipeline that appends the data to your array.
In order to get rid of the choppiness you will need some kind of intermediate buffer - imagine that data comes at random intervals - sometimes there will be several data points added simultaneously and sometimes no data for a period of time, but on longer timescales there will be some average inflow value. This is a standard practice done in streaming services to increase video quality.
Adjust the buffer size and this should eliminate the choppiness. This will of course introduce initial delay in playing the data, i.e. it won't be "live", but might be close to live with less choppiness.
I am about to make the following experiment, but wondering if the chosen approach is correct hence the cry for help. To goal is to be able to develop a general rough rule of thumb for the target video clip size. We have an input video and, using MoviePy library, want to have an output video that should not exceed a certain arbitrary output size. The output is created by accelerating the input video:
VideoFileClip(fname).without_audio().fx(vfx.resize, 0.3).fx(vfx.speedx, 4)
So, if we set a target that the output should not exceed 200kb, the question is what speed rate should be applied for a video of size 3mb, 10mb, 20mb?
To do that the idea was to take a sample of random videos, try out different speed rates, and find a line best fit on the outputs. What is your take on that approach? I know that it's a very coarse measure rendering different results depending on the input type, but still...Thanks
I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.
In other words, I want to give it the audio file and get something like this:
[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]
Is there a solution (preferably in python) that only finds human speech and runs on a local machine?
The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.
To use it for your purpose, you would do something like this:
Convert file to be either 8 KHz or 16 Khz, 16-bit, mono format. This is required by the WebRTC code.
Create a VAD object: vad = webrtcvad.Vad()
Split the audio into 30 millisecond chunks.
Check each chunk to see if it contains speech: vad.is_speech(chunk, sample_rate)
The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.
You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.
A ring buffer (collections.deque in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.
You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.
import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
totalPower = np.sum(Spectrum)
fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies
if fractionPowerInSignal > threshold:
return 1
else:
return 0
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (hasHumanVoice( window, threshold, samplingRate)