Amplitude units in FFT

Amplitude units in FFT - python

I'm completely new to python, scipy, matplotlib and programming in general.
I'm using the following code, which I came across online, to apply FFT to .wav files:
import scipy.io.wavfile as wavfile
import scipy
import scipy.fftpack as fftpk
import numpy as np
from matplotlib import pyplot as plt
s_rate, signal = wavfile.read("file.wav")
FFT = abs(scipy.fft.fft(signal))
freqs = fftpk.fftfreq(len(FFT), (1.0/s_rate))
plt.plot(freqs[range(len(FFT)//2)], FFT[range(len(FFT)//2)])
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.show()
The resulting graphs give amplitude values that range from 0 to a few thousands, depending on the files, and I have no idea what unit these are in. I'm guessing they might be relative amplitudes, and I was wondering if there is a way to turn that into decibels, as I need specific values.
Thank you
Tanguy

They are amplitudes relative to the quantization units used for the samples in your input signal. So, without calibrating your input signal against a known level of source input (to get Volts per 1 bit change, etc.), the actual units are unknown. If calibrated, you may still need to divide the magnitudes of the FFT output by N (the FFT length), depending on your particular FFT implementation.
To get Decibels, convert by taking 20*log10(abs(...)) of the FFT results, and offset by your 0 dB calibration level.

Related

How to smooth frequency spectrum of time series?

I tried to reproduce Watson's spectrum plot from these set of slides (PDF p. 30, p.29 of the slides), that came from this data of housing building permits.
Watson achieves a very smooth spectrum curve in which it is very easy to tell the peak frequencies.
When I tried to run a FFT on the data, I get a really noisy spectrum curve and I wonder if there is an intermediate step that I am missing.
I ran the fourier analysis on python, using scipy package fftpack as follows:
from scipy import fftpack
fs = 1 / 12 # monthly
N = data.shape[0]
spectrum = fftpack.fft(data.PERMITNSA.values)
freqs = fftpack.fftfreq(len(spectrum)) #* fs
plt.plot(freqs[:N//2], 20 * np.log10(np.abs(spectrum[:N//2])))
Could anyone help me with the missing link?
The original data is:
Below is the Watson's spectrum curve, the one I tried to reproduce:
And these are my results:

The posted curve doesn't look realistic. But there are many methods to get a smooth result with a similar amount of "curviness", using various kinds of resampling and/or plot interpolation.
One method I like is to chop the data into segments (windows, possibly overlapped) roughly 4X longer than the maximum number of "bumps" you want to see, maybe a bit longer. Then window each segment before using a much longer (size of about the resolution of the final plot you want) zero-padded FFT. Then average the results of the multiple FFTs of the multiple windowed segments. This works because a zero-padded FFT is (almost) equivalent to a highest-quality Sinc interpolating low-pass filter.

Difference between scipy periodogram and self implemented power spectral density

I am trying to evaluate the frequency domain of several signals. For this I used the PSD implementation given in this answer. As a comparison I used the signal.periodogram function provided in scipy:
from scipy.signal import tukey
import scipy as sp
f, Pxx_den = sp.signal.periodogram(a_gtrend_orig,12,window=tukey( len(a_gtrend_orig) ))
However when I plot this next to the self-implemented PSD they look significantly different:
As the same window function is used and the periodogram function should also use an FFT where does this difference coming from?

The example that you are comparing this to, is graphing the amplitude at each frequency bin, i.e, abs(fft())
The periodogram produces a power spectral density, that means it is the square of the amplitude at each frequency bin.
The label "windowed psd" is from an early edit, and was corrected later.

Finding only the "prominent" local maxima of a 1d array

I have a couple of data sets with clusters of peaks that look like the following:
You can see that the main features here a clusters of peaks, each cluster having three peaks. I would like to find the x values of those local peaks, but I am running into a few problems. My current code is as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy import loadtxt, optimize
from scipy.signal import argrelmax
def rounddown(x):
return int(np.floor(x / 10.0)) * 10
pixel, value = loadtxt('voltage152_4.txt', unpack=True, skiprows=0)
ax = plt.axes()
ax.plot(pixel, value, '-')
ax.axis([0, np.max(pixel), np.min(value), np.max(value) + 1])
maxTemp = argrelmax(value, order=5)
maxes = []
for maxi in maxTemp[0]:
if value[maxi] > 40:
maxes.append(maxi)
ax.plot(maxes, value[maxes], 'ro')
plt.yticks(np.arange(rounddown(value.min()), value.max(), 10))
plt.savefig("spectrum1.pdf")
plt.show()
Which works relatively well, but still isn't perfect. Some peaks labeled: The main problem here is that my signal isn't smooth, so a few things that aren't actually my relevant peaks are getting picked up. You can see this in the stray maxima about halfway down a cluster, as well as peaks that have two maxima where in reality it should be one. You can see near the center of the plot there are some high frequency maxima. I was picking those up so I added in the loop only considering values above a certain point.
I am afraid that smoothing the curve will actually make me loose some of the clustered peaks that I want, as in some of my other datasets there are even closer together. Maybe my fears are unfounded, though, and I am just misunderstanding how smoothing works. Any help would be appreciated.
Does anyone have a solution on how to pick out only "prominent" peaks? That is, only those peaks that are quick large compared to the others?

Starting with SciPy version 1.1.0 you may also use the function scipy.signal.find_peaks which allows you to select detected peaks based on their topographic prominence. This function is often easier to use than find_peaks_cwt. You'll have to play around a little bit to find the optimal lower bound to pass as a value to prominence but e.g. find_peaks(..., prominence=5) will ignore the unwanted peaks in your example. This should bring you reasonably close to your goal. If that's not enough you might do your own peak selection based upon peak properties like the left_/right_bases which are optionally returned.

I'd also recommend scipy.signal.find_peaks for what you're looking for. The other, older, scipy alternate find_peaks_cwt is quite complicated to use.
It will basically do what you're looking for in a single line. Apart from the prominence parameter that lagru mentioned, for your data either the threshold or height parameters might also do what you need.
height = 40 would filter to get all the peaks you like.
Prominence is a bit hard to wrap your head around for exactly what it does sometimes.

Confusion with bandwidth on seaborn's kdeplot

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.
The following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()
yields
Which looks like a Gaussian with bandwidth much larger than 5 MHz.
I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()
I get:
With lines that seem to have the expected 5 MHz bandwidth.
As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.
Thanks,
Samuel

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)
The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.
So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.
To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).
By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

Matplotlib slow with large data sets, how to enable decimation?

I use matplotlib for a signal processing application and I noticed that it chokes on large data sets. This is something that I really need to improve to make it a usable application.
What I'm looking for is a way to let matplotlib decimate my data. Is there a setting, property or other simple way to enable that? Any suggestion of how to implement this are welcome.
Some code:
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
plt.plot(np.random.random_sample(n))
plt.show()
Some background information
I used to work on a large C++ application where we needed to plot large datasets and to solve this problem we used to take advantage of the structure of the data as follows:
In most cases, if we want a line plot then the data is ordered and often even equidistantial. If it is equidistantial, then you can calculate the start and end index in the data array directly from the zoom rectangle and the inverse axis transformation. If it is ordered but not equidistantial a binary search can be used.
Next the zoomed slice is decimated, and because the data is ordered we can simply iterate a block of points that fall inside one pixel. And for each block the mean, maximum and minimum is calculated. Instead of one pixel, we then draw a bar in the plot.
For example: if the x axis is ordered, a vertical line will be drawn for each block, possibly the mean with a different color.
To avoid aliasing the plot is oversampled with a factor of two.
In case it is a scatter plot, the data can be made ordered by sorting, because the sequence of plotting is not important.
The nice thing of this simple recipe is that the more you zoom in the faster it becomes. In my experience, as long as the data fits in memory the plots stays very responsive. For instance, 20 plots of timehistory data with 10 million points should be no problem.

It seems like you just need to decimate the data before you plot it
import numpy as np
import matplotlib.pyplot as plt
n=100000 # more then 100000 points makes it unusable slow
X=np.random.random_sample(n)
i=10*array(range(n/10))
plt.plot(X[i])
plt.show()

Decimation is not best for example if you decimate sparse data it might all appear as zeros.
The decimation has to be smart such that each LCD horizontal pixel is plotted with the min and the max of the data between decimation points. Then as you zoom in you see more an more detail.
With zooming this can not be done easy outside matplotlib and thus is better to handle internally.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.