I want to make certain frequencies in a sequence of audio data louder. I have already analyzed the data using FFT and have gotten a value for each audio frequency in the data. I just have no idea how I can use the frequencies to manipulate the sound data itself.
From what I understand so far, data is encoded in such a way that the difference between every two consecutive readings determines the audio amplitude at that time instant. So making the audio louder at that time instant would involve making the difference between the two consecutive readings greater. But how do I know which time instants are involved with which frequency? I don't know when the frequency starts appearing.
(I am using Python, specifically PyAudio for getting the audio data and Num/SciPy for the FFT, though this probably shouldn't be relevant.)
You are looking for a graphic equalizer. Some quick Googling turned up rbeq, which seems to be a plugin for Rhythmbox written in Python. I haven't looked through the code to see if the actual EQ part is written in Python or is just controlling something in the host, but I recommend looking through their source.
Related
I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.
librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html
sr: 22050
fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')
f0, voiced_flag, voiced_probs = librosa.pyin(y,
fmin=fmin,
fmax=fmax,
pad_mode='constant',
n_thresholds = 10,
max_transition_rate = 100,
sr=sr)
Raw audio:
Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.
link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav
times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)
Another view with power spectrogram:
I tried compressing the audio, but that didn't seem to work.
Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?
What type of things affect fundamental tone extraction success?
TL;DR It seems like it's all about the parameters tweaking.
Here are some results that I've got playing with the example, it would be better to open it in a separate tab:
The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:
There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
Some words are divided into parts (e.g. "fo"-"x").
I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
Two-seconds period is a pretty short amount of time.
Some experiments:
If we want to see a smooth F0 graph, going with n_threshold=1 will do the thing. It's a bad idea. In the "voiced_flag" part of the graphs, we see that for n_threshold=1 it decides that each frame was voiced, counting every frequency change as activity.
Changing the sample rate affects the ability to retrieve F0 (in the rightmost graph, the sample rate was halved), as it was previously mentioned the n_threshold=1 doesn't count, but also we see that n_threshold=100 (which is a default value for pyin) doesn't produce any F0 at all.
Top most left (max_transition_rate=200) and middle (max_transition_rate=100) graphs show the extracted F0 for n_threshold=2 and n_threshold=100. Actually it degrades pretty fast, and n_threshold=3 looks almost the same as n_threshold=100. I find the lower part, the voiced_flag decision plot, has high importance when combined with the phonetics transcript. In the middle graph, default parameters recognise "qui", "jum", "over", "la". If we want F0 for other phonems, n_threshold=2 should do the work.
Setting n_threshold=3+ gives F0s in the same range. Increasing the max_transition_rate adds noice and reluctancy to declare that the voice segment is over.
That's my thoughts. Hope it helps.
My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sound is being played out of the speakers, by applications such as games for example,
ny "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device
I actually have this bit working -- with the code found here : https://gist.github.com/renegadeandy/8424327f471f52a1b656bfb1c4ddf3e8 -- it is based off of sounddevice example plot - which I combine with an audio loopback device. This allows my code, to receive a callback with data that is played to the speakers.
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
Can somebody help me with the problem #2 as detailed above?
How should I attempt this comparison (ideally with a little example), based upon my sampling using sounddevice in the audio_callback function.
Many thanks in advance.
I have a large library of many pre-recorded music notes (some ~1200), which are all of consistant amplitude.
I'm researching methods of layering two notes over each other so that it sounds like a chord where both notes are played at the same time.
Samples with different attack times:
As you can see, these samples have different peak amplitude points, which need to line up in order to sound like a human played chord.
Manually aligned attack points:
The 2nd image shows the attack points manually alligned by ear, but this is a unfeasable method for such a large data set where I wish to create many permutations of chord samples.
I'm considering a method whereby I identify the time of peak amplitude of two audio samples, and then align those two peak amplitude times when mixing the notes to create the chord. But I am unsure of how to go about such an implementation.
I'm thinking of using python mixing solution such as the one found here Mixing two audio files together with python with some tweaking to mix audio samples over each other.
I'm looking for ideas on how I can identify the times of peak amplitude in my audio samples, or if you have any thoughts on other ways this idea could be implemented I'd be very interested.
Incase anyone were actually interested in this question, I have found a solution to my problem. It's a little convoluded, but it has yeilded excellent results.
To find the time of peak amplitude of a sample, I found this thread here: Finding the 'volume' of a .wav at a given time where the top answer provided links to a scala library called AudioFile, which provided a method to find the peak amplite by going through a sample in frame buffer windows. However this library required all files to be in .aiff format, so a second library of samples was created consisting of all the old .wav samples converted to .aiff.
After reducing the frame buffer window, I was able to determine in which frame the highest amplitude was found. Dividing this frame by the sample rate of the audio samples (which was known to be 48000), I was able to accurately find the time of peak amplitude. This information was used to create a file which stored both the name of the sample file, along with its time of peak amplitude.
Once this was accomplished, a python script was written using the Pydub library http://pydub.com/ which would pair up two samples, and find the difference (t) in their times of peak amplitudes. The sample with the lowest time of peak amplitude would have silence of length (t) preappended to it from a .wav containing only silence.
These two samples were then overlayed onto each other to produce the accurately mixed chord!
I actually have Photodiode connect to my PC an do capturing with Audacity.
I want to improve this by using an old RPI1 as dedicated test station. As result the shutter speed should appear on the console. I would prefere a python solution for getting signal an analyse it.
Can anyone give me some suggestions? I played around with oct2py, but i dont really under stand how to calculate the time between the two peak of the signal.
I have no expertise on sound analysis with Python and this is what I found doing some internet research as far as I am interested by this topic
pyAudioAnalysis for an eponym purpose
You an use pyAudioAnalysis developed by Theodoros Giannakopoulos
Towards your end, function mtFileClassification() from audioSegmentation.py can be a good start. This function
splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.
For instance
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc, CM] = aS.mtFileClassification("data/scottish.wav","data/svmSM", "svm", True, 'data/scottish.segments')
Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: ,,. For example:
0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...
If I have well understood your question this is at least what you want to generate isn't it ? I rather think you have to provide it there.
Most of these information are directly quoted from his wiki which I suggest you to read it. Yet don't hesitate to reach out as far as I am really interested by this topic
Other available libraries for audio analysis :
So, I'm planning on trying out making a light organ with an Arduino and Python, communicating over serial to control the brightness of several LEDs. The computer will use the microphone or a playing MP3 to generate the data.
I'm not so sure how to handle the audio processing. What's a good option for python that can take either a playing audio file or microphone data (I'd prefer the microphone), and then split it into different frequency ranges and write the intensity to variables? Do I need to worry about overtones if I use the microphone?
If you're not committed to using Python, you should also look at using PureData (PD) to handle the audio analysis. Interfacing PD to the Arduino is already a solved problem, and there are a lot of pre-existing components that make working with audio easy.
Try http://wiki.python.org/moin/Audio for links to various Python audio processing packages.
The audioop package has some basic waveform manipulation functions.
See also:
Detect and record a sound with python
Detect & Record Audio in Python
Portaudio has a Python interface that would let you read data off the microphone.
For the band splitting, you could use something like a band-pass filter feeding into an envelope follower -- one filter+follower for each frequency band of interest.