Difficulty performing silence removal with librosa.effects.trim command - python

I am trying to do a project, and in part of the project I have the user say a word which gets recorded. This word then gets the silence around it cut out, and there is a button that plays back their word without the silence. I am using librosa's librosa.effects.trim command to achieve this.
For example:
def record_audio():
global myrecording
global yt
playsound(beep1)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
playsound(beep2)
#trimming the audio
yt, index = librosa.effects.trim(myrecording, top_db=60)
However, when I play the audio back, I can tell that it is not trimming the recording. The variable explorer shows that myrecording and yt are the same length. I can hear it when I play what is supposed to be the trimmed audio clip back as well. I don't get any error messages when this occurs either. Is there any way to get librosa to actually clip the audio? I have tried adjusting top_db and that did not fix it. Aside from that, I am not quite sure what I could be doing wrong.

For a real answer, you'd have to post a sample recording so that we could inspect what exactly is going on.
In lieu of of that, I'd like to refer to this GitHub issue, where one of the main authors of librosa offers advice for a very similar issue.
In essence: You want to lower the top_db threshold and reduce frame_length and hop_length. E.g.:
yt, index = librosa.effects.trim(myrecording, top_db=50, frame_length=256, hop_length=64)
Decreasing hop_length effectively increases the resolution for trimming. Decreasing top_db makes the function less sensitive, i.e., low level noise is also regarded as silence. Using a computer microphone, you do probably have quite a bit of low level background noise.
If this all does not help, you might want to consider using SOX, or its Python wrapper pysox. It also has a trim function.
Update Look at the waveform of your audio. Does it have a spike somewhere at the beginning? Some crack sound perhaps. That will keep librosa from trimming correctly. Perhaps manually throwing away the first second (=fs samples) and then trimming solves the issue:
librosa.effects.trim(myrecording[fs:], top_db=50, frame_length=256, hop_length=64)

Related

How to get complete fundamental (f0) frequency extraction with python lib librosa.pyin?

I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.
librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html
sr: 22050
fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')
f0, voiced_flag, voiced_probs = librosa.pyin(y,
fmin=fmin,
fmax=fmax,
pad_mode='constant',
n_thresholds = 10,
max_transition_rate = 100,
sr=sr)
Raw audio:
Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.
link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav
times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)
Another view with power spectrogram:
I tried compressing the audio, but that didn't seem to work.
Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?
What type of things affect fundamental tone extraction success?
TL;DR It seems like it's all about the parameters tweaking.
Here are some results that I've got playing with the example, it would be better to open it in a separate tab:
The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:
There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
Some words are divided into parts (e.g. "fo"-"x").
I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
Two-seconds period is a pretty short amount of time.
Some experiments:
If we want to see a smooth F0 graph, going with n_threshold=1 will do the thing. It's a bad idea. In the "voiced_flag" part of the graphs, we see that for n_threshold=1 it decides that each frame was voiced, counting every frequency change as activity.
Changing the sample rate affects the ability to retrieve F0 (in the rightmost graph, the sample rate was halved), as it was previously mentioned the n_threshold=1 doesn't count, but also we see that n_threshold=100 (which is a default value for pyin) doesn't produce any F0 at all.
Top most left (max_transition_rate=200) and middle (max_transition_rate=100) graphs show the extracted F0 for n_threshold=2 and n_threshold=100. Actually it degrades pretty fast, and n_threshold=3 looks almost the same as n_threshold=100. I find the lower part, the voiced_flag decision plot, has high importance when combined with the phonetics transcript. In the middle graph, default parameters recognise "qui", "jum", "over", "la". If we want F0 for other phonems, n_threshold=2 should do the work.
Setting n_threshold=3+ gives F0s in the same range. Increasing the max_transition_rate adds noice and reluctancy to declare that the voice segment is over.
That's my thoughts. Hope it helps.

How do I change the speed of an audio file in Python, like in Audacity, without quality loss?

I'm building a simple Python application that involves altering the speed of an audio track.
(I acknowledge that changing the framerate of an audio also make pitch appear different, and I do not care about pitch of the audio being altered).
I have tried using solution from abhi krishnan using pydub, which looks like this.
from pydub import AudioSegment
sound = AudioSegment.from_file(…)
def speed_change(sound, speed=1.0):
# Manually override the frame_rate. This tells the computer how many
# samples to play per second
sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
"frame_rate": int(sound.frame_rate * speed)
})
# convert the sound with altered frame rate to a standard frame rate
# so that regular playback programs will work right. They often only
# know how to play audio at standard frame rate (like 44.1k)
return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate)
However, the audio with changed speed sounds distorted, or crackled, which would not be heard with using Audacity to do the same, and I hope I find out a way to reproduce in Python how Audacity (or other digital audio editors) changes the speed of audio tracks.
I presume that the quality loss is caused by the original audio having low framerate, which is 8kHz, and that .set_frame_rate(sound.frame_rate) tries to sample points of the audio with altered speed in the original, low framerate. Simple attempts of setting the framerate of the original audio or the one with altered framerate, and the one that were to be exported didn't work out.
Is there a way in Pydub or in other Python modules that perform the task in the same way Audacity does?
Assuming what you want to do is to play audio back at say x1.5 the speed of the original. This is synonymous to saying to resample the audio samples down by 2/3rds and pretend that the sampling rate hasn't changed. Assuming this is what you are after, I suspect most DSP packages would support it (search audio resampling as the keyphrase).
You can try scipy.signal.resample_poly()
from scipy.signal import resample_poly
dec_data = resample_poly(sound.raw_data,up=2,down=3)
dec_data should have 2/3rds of the number of samples as the original raw_data samples. If you play dec_data samples at the sound's sampling rate, you should get a sped-up version. The downside of using resample_poly is you need a rational factor, and having large numerator or denominator will cause output less ideal. You can try scipy's resample function or seek other packages, which supports audio resampling.

Pyaudio Output Callback Stream delivers very choppy audio

I am writing a program which takes audio input via Pyaudio, analyses the input via fft, extracts the dominant frequency and then generates a sin wave as a bytestring based upon the obtained frequency. All of those steps work (kinda).
However when it comes to playing the sin wave via a pyaudio callback stream, the signal gets very choppy. If i play the audio in a blocking stream however, the output sound fine.
This is the function that generates the sin wave(note that I left out quite a bit of surrounding code here):
self.waveData = b'' #needs to be reset here, otherwise content of this grows exponentially i guess
for i in range(0, int(fs * 0.01)): #make just small sample of 0.1 duration
pcmValue = int(volume*np.sin(2*np.pi*frequency*i/fs)) #i replaces self.x
self.waveData += struct.pack('h', pcmValue)
self.waveData+=int((duration/0.01))*self.waveData[0:len(self.waveData)] #just multiply content of bytestring with duration/samplesize(duration/0.02)
So I don't think the problem lies withing here.
This is the stream and callback function (they both sit in the same class obviously):
def __init__(self):
self.stream2 = p.open(format=pyaudio.paInt16, channels=1, rate=fs, output=True,output_device_index=5,frames_per_buffer=2**13,stream_callback=self.actuallyplaystream)
def actuallyplaystream(self,in_data, frame_count, time_info, status):
return (self.waveData, pyaudio.paContinue)
In the rest of the program I just use
self.stream2.start_stream()
and (.stop_stream()) to start the playback. As said the playback in and of it self works ok. However the choppiness in intervalls of the frames_per_buffer=2**13, setting in creating the stream. In plain english that means that, when I increase the buffer-chunksize, mentioned above, the intervalls between each 'chop' decreases and respectively increases when decreasing the chunksize.
This seems like a very easy problem to solve, however I can't seem to find the answer in any SO post or even the docs of pyaudio. I would greatly appreciate any help or even just a hint towards solving this issue.
Please don't hesitate to ask if you are confuses by the code snippets. I will figure a way out to link to all my code, if that's needed.
Thank You
(Also: if this question is inappropriate, I will immediately remove it)

Detecting a noise in an audio stream

My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sound is being played out of the speakers, by applications such as games for example,
ny "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device
I actually have this bit working -- with the code found here : https://gist.github.com/renegadeandy/8424327f471f52a1b656bfb1c4ddf3e8 -- it is based off of sounddevice example plot - which I combine with an audio loopback device. This allows my code, to receive a callback with data that is played to the speakers.
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
Can somebody help me with the problem #2 as detailed above?
How should I attempt this comparison (ideally with a little example), based upon my sampling using sounddevice in the audio_callback function.
Many thanks in advance.

Making specific frequency (ranges) louder

I want to make certain frequencies in a sequence of audio data louder. I have already analyzed the data using FFT and have gotten a value for each audio frequency in the data. I just have no idea how I can use the frequencies to manipulate the sound data itself.
From what I understand so far, data is encoded in such a way that the difference between every two consecutive readings determines the audio amplitude at that time instant. So making the audio louder at that time instant would involve making the difference between the two consecutive readings greater. But how do I know which time instants are involved with which frequency? I don't know when the frequency starts appearing.
(I am using Python, specifically PyAudio for getting the audio data and Num/SciPy for the FFT, though this probably shouldn't be relevant.)
You are looking for a graphic equalizer. Some quick Googling turned up rbeq, which seems to be a plugin for Rhythmbox written in Python. I haven't looked through the code to see if the actual EQ part is written in Python or is just controlling something in the host, but I recommend looking through their source.

Categories

Resources