Pitch detection in Python - python

The concept of the program I'm working on is a Python module which detects certain frequencies (human speech frequency 80-300hz) and by checking from a database shows the intonation of the sentence. I use SciPy to plot frequency of the sound files, but I cannot set any certain frequency in order to analyze pitch. How can I do this?
more info: I would like to be able to set a defined pattern in speech (e.g. Rising, Falling) and the program detects if the sound file follows the specific pattern.

UPDATE in 2019, now there are very accurate pitch trackers based on neural networks. And they work in Python out-of-the-box. Check
https://pypi.org/project/crepe/
ANSWER FROM 2015. Pitch detection is a complex problem, a latest Google's package provides highly intelligent solution to this non-trivial task:
https://github.com/google/REAPER
You can wrap it in Python if you want to access it from Python.

You could try the following. I'm sure you know that human voice also has harmonics which go way beyond 300 Hz. Nevertheless, you can move a window across your audio file, and try to look at change in power in the max ( as shown below) or a set of frequencies in a window. The code below is for giving intuition:
import scipy.fftpack as sf
import numpy as np
def maxFrequency(X, F_sample, Low_cutoff=80, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
maximumFrequency = np.where(Spectrum == np.max(Spectrum[Low_point : High_point])) # Calculating which frequency has max power.
return maximumFrequency
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (maxFrequency( window, samplingRate))
Now based on the intonation of the voice, the maximum power frequency may shift which you can register and map to a given intonation. This may not necessarily be true always, and you may have to monitor shifts in a lot of frequencies together, but this should get you started.

There are many different algorithms to estimate pitch, but a study found that Praat's algorithm is the most accurate [1]. Recently, the Parselmouth library has made it a lot easier to call Praat functions from Python [2].
[1]: Strömbergsson, Sofia. "Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech." INTERSPEECH. 2016. https://pdfs.semanticscholar.org/ff04/0316f44eab5c0497cec280bfb1fd0e7c0e85.pdf
[2]: https://github.com/YannickJadoul/Parselmouth

There are basically two classes of f0 (pitch) estimation: time domain (with autocorrelation/cross-correlation, for example), and frequency domain (e.g. identifying the fundamental frequency by measuring distances between harmonics, or identifying the frequency in the spectrum with maximum power, as shown in the example above by Sahil M).
For many years I have successfully used RAPT (Robust Algorithm for Pitch Tracking), the predecessor of REAPER, also by David Talkin. The widely used Praat software which you mention also includes a RAPT-like cross-correlation algorithm option. Description and code are readily available on the web. A DEB install archive is available here: http://www.phon.ox.ac.uk/releases
Pattern detection (rises, falls, etc.) with the pitch function is a separate issue. The suggestion above by Sahil M for using a moving window across the pitch function is a good way to start.

Related

How to Identify Each Components from Audio Signal?

I have some audio files recorded from wind turbines, and I'm trying to do anomaly detection. The general idea is if a blade has a fault (e.g. cracking), the sound of this blade will differ with other two blades, so we can basically find a way to extract each blade's sound signal and compare the similarity / distance between them, if one of this signals has a significant difference, we can say the turbine is going to fail.
I only have some faulty samples, labels are lacking.
However, there seems to be no one doing this kind of work, and I met lots of troubles while attempting.
I've tried using stft to convert the signal to power spectrum, and some spikes show. How to identify each blade from the raw data? (Some related work use AutoEncoders to detect anomaly from audio, but in this task we want to use some similarity-based method.)
Anyone has good idea? Have some related work / paper to recommend?
Well...
If your shaft is rotating at, say 1200 RPM or 20 Hz, then all the significant sound produced by that rotation should be at harmonics of 20Hz.
If the turbine has 3 perfect blades, however, then it will be in exactly the same configuration 3 times for every rotation, so all of the sound produced by the rotation should be confined to multiples of 60 Hz.
Energy at the other harmonics of 20 Hz -- 20, 40, 80, 100, etc. -- that is above the noise floor would generally result from differences between the blades.
This of course ignores noise from other sources that are also synchronized to the shaft, which can mess up the analysis.
Assuming that the audio you got is from a location where one can hear individual blades as they pass by, there are two subproblems:
1) Estimate each blade position, and extract the audio for each blade.
2) Compare the signal from each blade to eachother. Determine if one of them is different enough to be considered an anomaly
Estimating the blade position can be done with a sensor that detects the rotation directly. For example based on the magnetic field of the generator. Ideally you would have this kind known-good sensor data, at least while developing your system. It may be possible to estimate using only audio, using some sort of Periodicity Detection. Autocorrelation is a commonly used technique for that.
To detect differences between blades, you can try to use a standard distance function on a standard feature description, like Euclidean on MFCC. You will still need to have some samples for both known faulty examples and known good/acceptable examples, to evaluate your solution.
There is however a risk that this will not be good enough. Then try to compute some better features as basis for the distance computation. Perhaps using an AutoEncoder. You can also try some sort of Similarity Learning.
If you have a good amount of both good and faulty data, you may be able to use a triplet loss setup to learn the similarity metric. Feed in data for two good blades as objects that should be similar, and the known-bad as something that should be dissimilar.

Calculation of mean fundamental frequency of audio sample

I am working on a project to predict the gender of a user by taking live audio input from the user. While researching for this project I came across a dataset by kaggle https://www.kaggle.com/primaryobjects/voicegender, a CART logic was proposed where
if meanfun<0.14:
if IQR>=0.07:
return male
else
return female
else
return female
I have tried to search for mean fundamental frequency but could not find any useful resources.
Please explain this concept and what is the difference between mean frequency and mean fundamental frequency? and also how to calculate it's value.
I'll attempt to explain the concept...
Signals in general can be defined by being a sum of sine waves. As you may or may not know, a sine wave can be defined mathematically with the equation Asin(ωt+φ) where A is the amplitude, ω is the angular frequency, t is the time, and φ is the phase shift. ω can be further replaced by 2πf where f is frequency in Hz (the unit used in the documentation you linked). When they refer to frequency in this context, you could think of it as a sine wave component of the original/raw signal.
The definition of a sine wave is described in the wikipedia page, amongst many other resources, here.
The audio signals you're looking at are complex signals likely with many sine waves involved. Fundamental frequency is referring to the lowest frequency that is detected (wiki here). I imagine that the mean fundamental frequency is the average of all the frequencies that were detected in the signal.
The most common method to find the frequencies is by use of the Fast Fourier Transform (FFT) - this changes the signal from time domain to frequency domain and you essentially get the break down of all the sine waves components that make up the original signal. Alternatively, you could get your hands dirty with peak detection - frequency is essentially number of times something occurs within some period of time so you could literally count number of peaks occurring over 1 minute (for example) to get your frequency value in Hz. I definitely don't recommend it for voice audio signals though.
To give you an idea of how a frequency value places within the audio spectrum, let's compare the musical note middle C to the A above it. Middle C is 261.626 Hz and A is 440.000 Hz (source). As you can see, higher notes have higher frequencies.
What this project's logic is saying is that female voices are made up of higher frequencies than male voices (somewhat unsurprising). It's also saying that female voices on the lower frequencies are more tightly bound in the range of other frequency components than male voices (?) just based on the IQR > 0.07 - which is pretty interesting to know.
Hope this helps.

How to Calculate power spectral density using USRP data?

I wanted to plot a graph between Average power spectral density(in dbm) and the frequency (2.4 GHZ to 2.5 GHZ).
The basic procedure i used earlier for power vs freq plot was to store the data generated by "usrp_specteum_sense.py" for some time period and then taking average.
Can i calculate PSD from the power used in "usrp_spectrum_sense.py"?
Is there any way to calculate PSD directly from usrp data?
Is there any other apporch which can be used to calculate PSD using USRP for desired range of frquency??
PS: I recently found out about the psd() in matplotlib, can it be use to solve my problem??
I wasn't 100% sure whether or not to mark this question a duplicate of Retrieve data from USRP N210 device ; however, since the poster of that question was very confused and so was his question, let's answer this in a concise way:
What an SDR device like the USRP does is give you digital samples. What these are is nothing more or less than what the ADC (Analog-to-Digital converter) makes out of the voltages it sees. Then, those numbers are subject to a DSP chain that does frequency shifting, decimation and appropriate filtering. In other words, the discrete complex signal's envelope coming from the USRP should be proportional to the voltages observed by the ADC. Thanks to physics, that means that the magnitude square of these samples should be proportional to the signal power as seen by the ADC.
Thus, the values you get are "dBFS" (dB relative to Full Scale), which is an arbitrary measure relative to the maximum value the signal processing chain might produce.
Now, notice two things:
As seen by the ADC is important. Prior to the ADC there's
an unknown antenna with a) an unknown efficiency and b) unknown radiation pattern illuminated from an unknown direction,
connected to a cable that might or might not perfectly match the antennas impedance, and that might or might not perfectly match the USRP's RF front-end's impedance,
potentially a bank of preselection filters with different attenuations,
a low-noise frontend amplifier, depending on the device/daughterboard with adjustable gain, with non-perfectly flat gain over frequency
a mixer with frequency-dependent gain,
baseband and/or IF gain stages and attenuators, adjustable,
baseband filters, might be adjustable,
component variances in PCBs, connectors, passives and active components, temperature-dependent gain and intermodulation, as well as
ADC non-linearity, frequency-dependent behaviour.
proportional is important here, since after sampling, there will be
I/Q imbalance correction,
DC/LO leakage cancellation,
anti-aliasing filtering prior to
decimation,
and bit-width and numerical type changing operations.
All in all, the USRPs are not calibrated measurement devices. They are pretty nice, and if chose the right one for your specific application, you might just need to calibrate once with a known external power source feeding exactly your system from antenna to sampling rate coming out at the end, at exactly the frequency you want to observe. After knowing "ok, when I feed in x dBm of power, I see y dBFS, so there's this factor (x-y) dB between dBFS", you now have calibrated your device for exactly one configuration consisting of
hardware models and individual units used, including antennas and cables,
center frequency,
gain,
filter settings,
decimation/sampling rate
Note that doing such calibrations, especially in the 2.4 GHz ISM band will require a "RF silent" room – it'll be hard to find an office or lab with no 2.4 GHz devices these days, and the reason why these frequencies are free for usage is that microwave ovens interfere; and then there's the fact that these frequencies tend to diffract and reflect on building structures, PC cases, furniture with metal parts... In other words: get access to an anechoic chamber, a reference transmit antenna and transmit power source, and do the whole antenna system calibration dance that results in a directivity diagram normally, but instead generate a "digital value relative to transmit power" measurement. Whether or not that measurement is really representative for how you'll be using your USRP in a lab environment is very much up for your consideration.
That is a problem of any microwave equipment, not only the USRPs – RF propagation isn't easy to predict in complex environments, and the power characteristics of a receiving system isn't determined by a single component, but by the system as a whole in exactly its intended operational environment. Thus, calibration must require you either know your antenna, cable, measurement frontend, digitizer and DSP exactly and can do the math including error margins, or that you calibrate the system as a whole, and change as little as possible afterwards.
So: No. No Matlab function in this world can give meaning to numbers that isn't in these numbers – for absolute power, you'll need to calibrate against a reference.
Another word on linearity: A USRP's analog hardware at full gain is pretty sensitive – so much sensitive that operating e.g. a WiFi device in the same room would be like screaming in its ear, blanking out weaker signals, and driving the analog signal chain into non-linearity. In that case, not only do the voltages observed by the ADC lose their linear relation to the voltages inserted at the antenna port, but also, and that is usually worse, amplifiers become mixers, so unwanted intermodulation introduces energy in spectral places where there was none. So make sure you operate your device in a place where you make the most of your signal's dynamic range without running into nonlinearities.

Strange FFT peaks when working with many tight frequencies

I'm using a slightly modified version of this python code to do frequency analysis:
FFT wrong value?
Lets say I have a pack of sine waves in the time domain that are very close together in frequency, while sharing the same amplitude. This is how they look like in the frequency domain, using FFT on 1024 samples from which I strip out the second half, giving 512 bins of resolution:
This is when I apply a FFT over the same group of waves but this time with 128 samples (64 bins):
I expected a plateau-ish frequency response but it looks like the waves in the center are being cancelled. What are those "horns" I see? Is this normal?
I believe your result is correct. The peaks are at ±f1 and ±f2), corresponding to the respective frequency components of the two signals shown in your first plot.
I assume that you are shifting the DC component back to the center? What "waves in the center" are you referring to?
There are a couple of other potential issues that you should be aware of:
Aliasing: by inspection it appears that you have enough samples across your signal but keep in mind that artificial (or aliased) frequencies can be created by the FFT, if there are not enough sample points to capture the underlying frequency. Specifically, if your frequency is f, then you need your data sample spacing to be at least, Δx = 1/(2*f), or smaller.
Windowing: your signal is windowed (has a finite extent) so there will also be some broadening, ringing, or redistribution of power about each spatial frequency due to edge affects.
Since I don't know the details of your data, I went ahead and created a sinusoid and then sampled the data close to what appears to be your sampling rate. For example, below is a sinusoid with 64 points and with a signal frequency at 10 cycles (count the peaks):
The FFT result is then:
which shows the same quantitative features as yours, but without having your data, its difficult for me to match your exact situation (spacing and taper).
Next I applied a super-Gauss window function (shown below) to simulate the finite extent of your data:
After applying the window to the input signal we have:
The corresponding FFT result shows some additional power redistribution, due to the finite extent of the data:
Although I can't match your exact situation, I believe your results appear as expected and some qualitative features of your data have been identified. Hope this helps.
Sine waves closely spaced in the frequency domain will occasionally nearly cancel out in the time domain. Since your second FFT is 8 times shorter than your first FFT, you may have windowed just such an short area of cancellation. Try a different time location of shorter time window to see something different (or different phases of sinusoids).

Using Python to measure audio "loudness"

I'm looking to calculate the loudness of a piece of audio using Python — probably by extracting the peak volume of a piece of audio, or possibly using a more accurate measure (RMS?).
What's the best way to do this? I've had a look at pyaudio, but that didn't seem to do what I wanted. What looked good was ruby-audio, as this seemingly has sound.abs.max built into it.
The input audio will be taken from various local MP3 files that are around 30s in duration.
I think that the RMS would be the the most accurate measure. One thing to note is that we percieve loudness differently at different frequencies, so convert the audio to frequency space with an fft (numpy.fft should work great on only 30s of audio). Now compute a power spectral density from this. Weight the PSD by frequency using some loudness curve. Especially frequencies below 10Hz, since there will be a lot of power there (it would dominate the RMS calculation in the time-domain), yet we can't hear it. Now integrate the PSD and take the square root and that will give a percieved RMS.
You can also break the mp3 into sections or windows and apply this technique to give the volume in particular sections.

Categories

Resources