Calculation of mean fundamental frequency of audio sample

Calculation of mean fundamental frequency of audio sample - python

I am working on a project to predict the gender of a user by taking live audio input from the user. While researching for this project I came across a dataset by kaggle https://www.kaggle.com/primaryobjects/voicegender, a CART logic was proposed where
if meanfun<0.14:
if IQR>=0.07:
return male
else
return female
else
return female
I have tried to search for mean fundamental frequency but could not find any useful resources.
Please explain this concept and what is the difference between mean frequency and mean fundamental frequency? and also how to calculate it's value.

I'll attempt to explain the concept...
Signals in general can be defined by being a sum of sine waves. As you may or may not know, a sine wave can be defined mathematically with the equation Asin(ωt+φ) where A is the amplitude, ω is the angular frequency, t is the time, and φ is the phase shift. ω can be further replaced by 2πf where f is frequency in Hz (the unit used in the documentation you linked). When they refer to frequency in this context, you could think of it as a sine wave component of the original/raw signal.
The definition of a sine wave is described in the wikipedia page, amongst many other resources, here.
The audio signals you're looking at are complex signals likely with many sine waves involved. Fundamental frequency is referring to the lowest frequency that is detected (wiki here). I imagine that the mean fundamental frequency is the average of all the frequencies that were detected in the signal.
The most common method to find the frequencies is by use of the Fast Fourier Transform (FFT) - this changes the signal from time domain to frequency domain and you essentially get the break down of all the sine waves components that make up the original signal. Alternatively, you could get your hands dirty with peak detection - frequency is essentially number of times something occurs within some period of time so you could literally count number of peaks occurring over 1 minute (for example) to get your frequency value in Hz. I definitely don't recommend it for voice audio signals though.
To give you an idea of how a frequency value places within the audio spectrum, let's compare the musical note middle C to the A above it. Middle C is 261.626 Hz and A is 440.000 Hz (source). As you can see, higher notes have higher frequencies.
What this project's logic is saying is that female voices are made up of higher frequencies than male voices (somewhat unsurprising). It's also saying that female voices on the lower frequencies are more tightly bound in the range of other frequency components than male voices (?) just based on the IQR > 0.07 - which is pretty interesting to know.
Hope this helps.

Related

Model for predicting temperature data of fridge

I set up a sensor which measures temperature data every 3 seconds. I collected the data for 3 days and have 60.000 rows in my csv export. Now I would like to forecast the next few days. When looking at the data you can already see a "seasonality" which displays the fridges heating and cooling cycle so I guess it shouldn't be too difficult to predict. I am not really sure if my data is too granular and if I should do some kind of undersampling. I thought about using a seasonal ARIMA model but I am having difficulties with picking parameters. As the seasonality in the data is pretty obious is there maybe a model that fits better? Please bear with me I'm pretty new to machine learning.

When the goal is to forecast rising temperatures, you can forecast the lower and upper peaks, i.e., their hight and distances. Assuming (simplified model) that the temperature change in between is linear we can, model each complete peak starting from a first lower peak of the temperature curve to the next upper peak down to next lower peak. So a complete peak can be seen as triangle which we easily integrate (calculate its area + the area of the rectangle below of it). The estimation can now be done by a integrating a number of complete peaks we have already measured. By repeating this procedure, we can do now a linear regression on the average temperatures and alert when the slope is above a defined threshold.
As this only tackles a certain kind of errors, one can do the same for the average distances between the upper peaks and the also for the lower peaks. I.e., take the times between them for a certain periode, fit a curve (linear regression can possibly be sufficient) and alert when the slope of the curve is indicating too long distances.

It's mission impossible. If fridge work without interference, then graph always looks the same. The change can be caused, for example, by opening a door, a breakdown, a major change in external conditions. But you cannot predict such events. Instead, you can try to warn about the possibility of problems in the near future, for example, based on a constant increase in average temperature. This situation may indicate a leak in the cooling system.
By the way, have you considered logging the temperature every 3 seconds? This is usually unjustified, because it is physically impossible for the temperature to change to a measurable degree in such an interval. Our team usually sets the login interval to 30 or 60 seconds in such cases. Sometimes even more. Depending on the size of the chamber, the way the air is circulated, the ratio of volume to power of the refrigeration unit, etc.

Can I retrieve a signal from a scipy.signal.welch power spectrum?

If I have a power spectrum that has been computed using the welch method in scipy.signal is there any way I can retrieve the signal original signal? If not, what data can I get that can tell me something about the signal given the power spectrum?

Is there any way I can retrieve the signal original signal?
It is impossible to recover the original signal from its power spectral density. Welch's method is calculated as
where K is the number of segments averaged together, L is the number of samples in each segment's Fourier transform, R is the decimation factor, or the number of samples "jumped" when moving to the next segment, w[n] is a window function (e.g. Hann, Hamming), and U is a normalization factor equal to the energy of the window function:
Two important things to notice:
The Fourier transform (the sum indexed by n) has an absolute value squared around it. This means that you have lost all phase information of your signal. Take this post for the importance of the phase information in a signal. Just using the magnitude information, it is impossible to tell what the original signal/image was.
The equation above is averaging multiple PSD estimates (modified Periodogram's to be specific). In the same way that a simple average loses all of the detailed information contained in individual samples x[n] and how they vary over time, Welch's method also loses how the signal varies over time. To have any information about how the signal varies over time, you would need to have calculated a spectrogram.
If not, what data can I get that can tell me something about the signal given the power spectrum?
As the name implies, the power spectral density (PSD) tells you the density of energy at each frequency. You can identify whether the majority of energy is in low, mid, or high frequencies. The averaging of Welch's method does a decent job at reducing stochastic noise, so the steady state signatures present in your signal should be well separated from the noise. Assuming your L is sufficiently large and you are not aliasing your data, you should be able to easily estimate the power level and frequency of any signatures.

Setting parameters in Librosa's CQT function for an 88-key piano

I am doing an automatic music recognition project with a deep learning model. For my data preprocessing, I am trying to calculate the Constant Q Transform for polyphonic 88-key piano audio using Python's Librosa library. However, I do not understand what I should set fmin, n_bins, and bins_per_octave to in Librosa's cqt() method to do this. Specifically:
What exactly is a bin? Do the upper and lower boundaries of a bin correspond to the frequencies of two consecutive notes? In other words, because an 88-key piano has 7 octaves each with 12 unique notes, should I set n_bins = 7 * 12 = 84 or equivalently bins_per_octave = 7? Or should several bins correspond to a single note interval?
Is fmin supposed to be the deepest note on the 88-key piano, i.e. the A note with a frequency of about 27.5 Hz?
Why do we need fmin? Is this some sort of reference point, similar to the equation from amplitude to decibels?
What are the differences between n_bins and bins_per_octave and which is better to use? For example, this research paper here uses both.
When is it appropriate to use Librosa's chroma_cqt method?

I'm not an expert in CQT, but I can maybe help to answer some of these question. According to this wikipedia page, CQT can be thought of as series of filters on the signal. Each filter isolates some frequency domain of the signal, and then the amplitude of the filtered signal is the amplitude which is output for that frequency.
So to your first question, a bin is a filter which isolates a particular part of the signal in the frequency domain. The upper and lower bound of the filter is a bit unclear to me exactly, but certainly the idea is that each bin is centered at frequencies which I'll describe after this, and the bins ideally wouldn't overlap, but also not lose any data if you attempted to reconstruct the original signal.
For your case, I would set fmin to the lowest note on the piano, like you said, 27.5 hz. Then I would put bins_per_octave at the default (12), since you'd like to match the bins to each note on the piano. Finally, the number of bins would be 88, since you have 88 keys. You won't capture any harmonics of the keys (especially the higher ones), but maybe that's okay for you case.
To explain more about how frequencies are chosen, the idea is to mimic how humans hear frequencies. We are more discerning at lower frequencies, and less so at high frequencies, with a roughly logarithmic response. So the internal formula for each bin's frequency is probably something like:
f_min * 2 ** (i / bins_per_octave)
where i is the bin index and in range [0, n_bins).
I have no idea when chroma_cqt is best used, so hopefully someone else can help with that :)

Pitch detection in Python

The concept of the program I'm working on is a Python module which detects certain frequencies (human speech frequency 80-300hz) and by checking from a database shows the intonation of the sentence. I use SciPy to plot frequency of the sound files, but I cannot set any certain frequency in order to analyze pitch. How can I do this?
more info: I would like to be able to set a defined pattern in speech (e.g. Rising, Falling) and the program detects if the sound file follows the specific pattern.

UPDATE in 2019, now there are very accurate pitch trackers based on neural networks. And they work in Python out-of-the-box. Check
https://pypi.org/project/crepe/
ANSWER FROM 2015. Pitch detection is a complex problem, a latest Google's package provides highly intelligent solution to this non-trivial task:
https://github.com/google/REAPER
You can wrap it in Python if you want to access it from Python.

You could try the following. I'm sure you know that human voice also has harmonics which go way beyond 300 Hz. Nevertheless, you can move a window across your audio file, and try to look at change in power in the max ( as shown below) or a set of frequencies in a window. The code below is for giving intuition:
import scipy.fftpack as sf
import numpy as np
def maxFrequency(X, F_sample, Low_cutoff=80, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
maximumFrequency = np.where(Spectrum == np.max(Spectrum[Low_point : High_point])) # Calculating which frequency has max power.
return maximumFrequency
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (maxFrequency( window, samplingRate))
Now based on the intonation of the voice, the maximum power frequency may shift which you can register and map to a given intonation. This may not necessarily be true always, and you may have to monitor shifts in a lot of frequencies together, but this should get you started.

There are many different algorithms to estimate pitch, but a study found that Praat's algorithm is the most accurate [1]. Recently, the Parselmouth library has made it a lot easier to call Praat functions from Python [2].
[1]: Strömbergsson, Sofia. "Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech." INTERSPEECH. 2016. https://pdfs.semanticscholar.org/ff04/0316f44eab5c0497cec280bfb1fd0e7c0e85.pdf
[2]: https://github.com/YannickJadoul/Parselmouth

There are basically two classes of f0 (pitch) estimation: time domain (with autocorrelation/cross-correlation, for example), and frequency domain (e.g. identifying the fundamental frequency by measuring distances between harmonics, or identifying the frequency in the spectrum with maximum power, as shown in the example above by Sahil M).
For many years I have successfully used RAPT (Robust Algorithm for Pitch Tracking), the predecessor of REAPER, also by David Talkin. The widely used Praat software which you mention also includes a RAPT-like cross-correlation algorithm option. Description and code are readily available on the web. A DEB install archive is available here: http://www.phon.ox.ac.uk/releases
Pattern detection (rises, falls, etc.) with the pitch function is a separate issue. The suggestion above by Sahil M for using a moving window across the pitch function is a good way to start.

Strange FFT peaks when working with many tight frequencies

I'm using a slightly modified version of this python code to do frequency analysis:
FFT wrong value?
Lets say I have a pack of sine waves in the time domain that are very close together in frequency, while sharing the same amplitude. This is how they look like in the frequency domain, using FFT on 1024 samples from which I strip out the second half, giving 512 bins of resolution:
This is when I apply a FFT over the same group of waves but this time with 128 samples (64 bins):
I expected a plateau-ish frequency response but it looks like the waves in the center are being cancelled. What are those "horns" I see? Is this normal?

I believe your result is correct. The peaks are at ±f1 and ±f2), corresponding to the respective frequency components of the two signals shown in your first plot.
I assume that you are shifting the DC component back to the center? What "waves in the center" are you referring to?
There are a couple of other potential issues that you should be aware of:
Aliasing: by inspection it appears that you have enough samples across your signal but keep in mind that artificial (or aliased) frequencies can be created by the FFT, if there are not enough sample points to capture the underlying frequency. Specifically, if your frequency is f, then you need your data sample spacing to be at least, Δx = 1/(2*f), or smaller.
Windowing: your signal is windowed (has a finite extent) so there will also be some broadening, ringing, or redistribution of power about each spatial frequency due to edge affects.
Since I don't know the details of your data, I went ahead and created a sinusoid and then sampled the data close to what appears to be your sampling rate. For example, below is a sinusoid with 64 points and with a signal frequency at 10 cycles (count the peaks):
The FFT result is then:
which shows the same quantitative features as yours, but without having your data, its difficult for me to match your exact situation (spacing and taper).
Next I applied a super-Gauss window function (shown below) to simulate the finite extent of your data:
After applying the window to the input signal we have:
The corresponding FFT result shows some additional power redistribution, due to the finite extent of the data:
Although I can't match your exact situation, I believe your results appear as expected and some qualitative features of your data have been identified. Hope this helps.

Sine waves closely spaced in the frequency domain will occasionally nearly cancel out in the time domain. Since your second FFT is 8 times shorter than your first FFT, you may have windowed just such an short area of cancellation. Try a different time location of shorter time window to see something different (or different phases of sinusoids).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.