How to validate the downsampling is as intended - python

how to validate whether the down sampled output is correct. For example, I had make some example, however, I am not sure whether the output is correct or not?
Any idea on the validation
Code
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
import mne
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=400 # frequency of the signal
x = np.arange(fs)
y = [ np.sin(2*np.pi*fTwo * (i/fs)) for i in x]
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure(1)
plt.subplot(211)
plt.stem(x, y)
plt.subplot(212)
plt.stem(xnew, f_res, 'r')
plt.show()

Plotting the data is a good first take at a verification. Here I made regular plot with the points connected by lines. The lines are useful since they give a guide for where you expect the down-sampled data to lie, and also emphasize what the down-sampled data is missing. (It would also work to only show lines for the original data, but lines, as in a stem plot, are too confusing, imho.)
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
fs = 100 # sample rate
rsample=43 # downsample frequency
fTwo=13 # frequency of the signal
x = np.arange(fs, dtype=float)
y = np.sin(2*np.pi*fTwo * (x/fs))
print y
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xnew, f_res, 'or')
plt.show()
A few notes:
If you're trying to make a general algorithm, use non-rounded numbers, otherwise you could easily introduce bugs that don't show up when things are even multiples. Similarly, if you need to zoom in to verify, go to a few random places, not, for example, only the start.
Note that I changed fTwo to be significantly less than the number of samples. Somehow, you need at least more than one data point per oscillation if you want to make sense of it.
I also remove the loop for calculating y: in general, you should try to vectorize calculations when using numpy.

The spectrum of the resampled signal should have a tone at the same frequency as the input signal just in a smaller nyquist bandwidth.
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
import scipy.fftpack as fft
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=10 # frequency of the signal
n = np.arange(1024)
y = np.sin(2*np.pi*fTwo/fs*n)
y_res = signal.resample(y, len(n)/2)
Y = fft.fftshift(fft.fft(y))
f = -fs*np.arange(-512, 512)/1024
Y_res = fft.fftshift(fft.fft(y_res, 1024))
f_res = -fs/2*np.arange(-512, 512)/1024
plt.figure(1)
plt.subplot(211)
plt.stem(f, abs(Y))
plt.subplot(212)
plt.stem(f_res, abs(Y_res))
plt.show()
The tone is still at 10.

IF you down sample a signal both signals will still have the exact same value and a given time , so just loop through "time" and check that the values are the same. In your case you go from a sample rate of 100 to 50. Assuming you have 1 seconds worth of data from building your x from fs, then just loop through t = 0 to t=1 in 1/50'th increments and make sure that Yd(t) = Ys(t) where Yd d is the down sampled f and Ys is the original sampled frequency. Or to say it simply Yd(n) = Ys(2n) for n = 1,2,3,...n=total_samples-1.

Related

How to plot frequency data from a .wav file in Python?

I'm trying to plot the frequencies that make up the first 1 second of a voice recording.
My approach was to:
Read the .wav file as a numpy array containing time series data
Slice the array from [0:sample_rate-1], given that the sample rate has units of [samples/1 second], which implies that sample_rate [samples/seconds] * 1 [seconds] = sample_rate [samples]
Perform a fast fourier transform (fft) on the time series array in order to get the frequencies that make up that time-series sample.
Plot the the frequencies on the x-axis, and amplitude on the y-axis. The frequency domain would range from 0:(sample_rate/2) since the Nyquist Sampling Theorem tells us that the recording captured frequencies of at least two times the maximum frequency, i.e 2*max(frequency). I'll also slice the frequency output array in half since the output frequency data is symmetrical
Here is my implementation
import matplotlib.pyplot as plt
import numpy as np
from scipy.fftpack import fft
from scipy.io import wavfile
sample_rate, audio_time_series = wavfile.read(audio_path)
single_sample_data = audio_time_series[:sample_rate]
def fft_plot(audio, sample_rate):
N = len(audio) # Number of samples
T = 1/sample_rate # Period
y_freq = fft(audio)
domain = len(y_freq) // 2
x_freq = np.linspace(0, sample_rate//2, N//2)
plt.plot(x_freq, abs(y_freq[:domain]))
plt.xlabel("Frequency [Hz]")
plt.ylabel("Frequency Amplitude |X(t)|")
return plt.show()
fft_plot(single_sample_data, sample_rate)
This is the plot that it generated
However, this is incorrect, my spectrogram tells me I should have frequency peaks below the 5kHz range:
In fact, what this plot is actually showing, is the first second of my time series data:
Which I was able to debug by removing the absolute value function from y_freq when I plot it, and entering the entire audio signal into my fft_plot function:
...
sample_rate, audio_time_series = wavfile.read(audio_path)
single_sample_data = audio_time_series[:sample_rate]
def fft_plot(audio, sample_rate):
N = len(audio) # Number of samples
y_freq = fft(audio)
domain = len(y_freq) // 2
x_freq = np.linspace(0, sample_rate//2, N//2)
# Changed from abs(y_freq[:domain]) -> y_freq[:domain]
plt.plot(x_freq, y_freq[:domain])
plt.xlabel("Frequency [Hz]")
plt.ylabel("Frequency Amplitude |X(t)|")
return plt.show()
# Changed from single_sample_data -> audio_time_series
fft_plot(audio_time_series, sample_rate)
The code sample above produced, this plot:
Therefore, I think one of two things is going on:
The fft() function is not actually performing an fft on the time series data it is being given
The .wav file does not contain time series data to begin with
What could be the issue? Has anyone else experienced this?
I have replicated, essentially replicated, the code in the question and I don't see the problem the OP has described.
In [172]: %reset -f
...: import matplotlib.pyplot as plt
...: import numpy as np
...: from scipy.fftpack import fft
...: from scipy.io import wavfile
...:
...: sr, data = wavfile.read('sample.wav')
...: print(data.shape, sr)
...: signal = data[:sr,0]
...: Signal = fft(signal)
...: fig, (axt, axf) = plt.subplots(2, 1,
...: constrained_layout=1,
...: figsize=(11.8,3))
...: axt.plot(signal, lw=0.15) ; axt.grid(1)
...: axf.plot(np.abs(Signal[:sr//2]), lw=0.15) ; axf.grid(1)
...: plt.show()
sr, data = wavfile.read('sample.wav')
(268237, 2) 8000
Hence, I'm voting for closing the question because it is "Not reproducible or was caused by a typo".

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Setting Wn for analog Bessel model in scipy.signal

I have been trying to implement an analog Bessel filter with a cutoff frequency 2kHz using scipy.signal, and I am confused about what value of Wn to set, as the documentation states Wn (for analog filters) should be set to angular frequency (12000 rad/s approximately). But if I implement this to my 1 second of dummy data, with half a second pulse sampled at 500 000 Hz, I get a string of 0s and nans. What is it that I am missing?
import numpy as np
import scipy
import matplotlib.pyplot as plt
import scipy.signal
def make_signal(pulse_length, rate = 500000):
new_x = np.zeros(rate)
end_signal = 250000+pulse_length
new_x[250000:end_signal] = 1
data = new_x
print (np.shape(data))
# pad on both sides
data=np.concatenate((np.zeros(rate),data,np.zeros(rate)))
return data
def conv_time(t):
pulse_length = t * 500000
pulse_length = int(pulse_length)
return pulse_length
def make_data(ti): #give time in seconds
pulse_length=conv_time(ti)
print (pulse_length)
data = make_signal(pulse_length)
return data
time_scale = np.linspace(0,1,500000)
data = make_data(0.5)
[b,a] = scipy.signal.bessel(4, 12566.37, btype='low', analog=True, output='ba', norm='phase', fs=None)
output_signal = scipy.signal.filtfilt(b, a, data)
plt.plot(data[600000:800000])
plt.plot(output_signal[600000:800000])
When plotting response using freqs, it doesn't seem that bad to me; where am I making a mistake?
You are passing an analog filter to a function, scipy.signal.filtfilt, that expects a digital (i.e. discrete time) filter. If you are going to use filtfilt or lfilter, the filter must be digital.
To work with continuous time systems, take a look at the functions
scipy.signal.impulse (scipy.signal.impulse2)
scipy.signal.step (scipy.signal.step2)
scipy.signal.lsim (scipy.signal.lsim2)
(The 2 versions solve the same mathematical problem as the version without 2 but use a different method. In most cases, the version without 2 is fine and is much faster than the 2 version.)
Other related functions and classes are listed in the section Continuous-Time Linear Systems of the SciPy documentation.
For example, here's a script that plots the impulse and step responses of your Bessel filter:
import numpy as np
from scipy.signal import bessel, step, impulse
import matplotlib.pyplot as plt
order = 4
Wn = 2*np.pi * 2000
b, a = bessel(order, Wn, btype='low', analog=True, output='ba', norm='phase')
# Note: the upper limit for t was chosen after some experimentation.
# If you don't give a T argument to impulse or step, it will choose a
# a "pretty good" time span.
t = np.linspace(0, 0.00125, 2500, endpoint=False)
timp, yimp = impulse((b, a), T=t)
tstep, ystep = step((b, a), T=t)
plt.subplot(2, 1, 1)
plt.plot(timp, yimp, label='impulse response')
plt.legend(loc='upper right', framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.title('Impulse and step response of the Bessel filter')
plt.subplot(2, 1, 2)
plt.plot(tstep, ystep, label='step response')
plt.legend(loc='lower right', framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.xlabel('t')
plt.show()
The script generates this plot:

Frequency domain of a sine wave with frequency 1000Hz

I'm starting DSP on Python and I'm having some difficulties:
I'm trying to define a sine wave with frequency 1000Hz
I try to do the FFT and find its frequency with the following piece of code:
import numpy as np
import matplotlib.pyplot as plt
sampling_rate = int(10e3)
n = int(10e3)
sine_wave = [100*np.sin(2 * np.pi * 1000 * x/sampling_rate) for x in range(0, n)]
s = np.array(sine_wave)
print(s)
plt.plot(s[:200])
plt.show()
s_fft = np.fft.fft(s)
frequencies = np.abs(s_fft)
plt.plot(frequencies)
plt.show()
So first plot makes sense to me.
Second plot (FFT) shows two frequencies:
i) 1000Hz, which is the one I set at the beggining
ii) 9000Hz, unexpectedly
freqeuncy domain
Your data do not respect Shannon criterion. you do not set a correct frequencies axis.
It's easier also to use rfft rather than fft when the signal is real.
Your code can be adapted like :
import numpy as np
import matplotlib.pyplot as plt
sampling_rate = 10000
n = 10000
signal_freq = 4000 # must be < sampling_rate/2
amplitude = 100
t=np.arange(0,n/sampling_rate,1/sampling_rate)
sine_wave = amplitude*np.sin(2 * np.pi *signal_freq*t)
plt.subplot(211)
plt.plot(t[:30],sine_wave[:30],'ro')
spectrum = 2/n*np.abs(np.fft.rfft(sine_wave))
frequencies = np.fft.rfftfreq(n,1/sampling_rate)
plt.subplot(212)
plt.plot(frequencies,spectrum)
plt.show()
Output :
There is no information loss, even if a human eye can be troubled by the temporal representation.

Analyzing seasonality of Google trend time series using FFT

I am trying to evaluate the amplitude spectrum of the Google trends time series using a fast Fourier transformation. If you look at the data for 'diet' in the data provided here it shows a very strong seasonal pattern:
I thought I could analyze this pattern using a FFT, which presumably should have a strong peak for a period of 1 year.
However when I apply a FFT like this (a_gtrend_ham being the time series multiplied with a Hamming window):
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
# Sampling rate
fs = 12 #Points per year
a_gtrend_orig = gtrend['diet: (Worldwide)']
N_gtrend_orig = len(a_gtrend_orig)
length_gtrend_orig = N_gtrend_orig / fs
t_gtrend_orig = np.linspace(0, length_gtrend_orig, num = N_gtrend_orig, endpoint = False)
a_gtrend_sel = a_gtrend_orig.loc['2005-01-01 00:00:00':'2017-12-01 00:00:00']
N_gtrend = len(a_gtrend_sel)
length_gtrend = N_gtrend / fs
t_gtrend = np.linspace(0, length_gtrend, num = N_gtrend, endpoint = False)
a_gtrend_zero_mean = a_gtrend_sel - np.mean(a_gtrend_sel)
ham = np.hamming(len(a_gtrend_zero_mean))
a_gtrend_ham = a_gtrend_zero_mean * ham
N_gtrend = len(a_gtrend_ham)
ampl_gtrend = 1/N_gtrend * abs(fft(a_gtrend_ham))
mag_gtrend = fftshift(ampl_gtrend)
freq_gtrend = np.linspace(-0.5, 0.5, len(ampl_gtrend))
response_gtrend = 20 * np.log10(mag_gtrend)
response_gtrend = np.clip(response_gtrend, -100, 100)
My resulting amplitude spectrum does not show any dominant peak:
Where is my misunderstanding of how to use the FFT to get the spectrum of the data series?
Here is a clean implementation of what I think you are trying to accomplish. I include graphical output and a brief discussion of what it likely means.
First, we use the rfft() because the data is real valued. This saves time and effort (and reduces the bug rate) that otherwise follows from generating the redundant negative frequencies. And we use rfftfreq() to generate the frequency list (again, it is unnecessary to hand code it, and using the api reduces the bug rate).
For your data, the Tukey window is more appropriate than the Hamming and similar cos or sin based window functions. Notice also that we subtract the median before multiplying by the window function. The median() is a fairly robust estimate of the baseline, certainly more so than the mean().
In the graph you can see that the data falls quickly from its intitial value and then ends low. The Hamming and similar windows, sample the middle too narrowly for this and needlessly attenuate a lot of useful data.
For the FT graphs, we skip the zero frequency bin (the first point) since this only contains the baseline and omitting it provides a more convenient scaling for the y-axes.
You will notice some high frequency components in the graph of the FT output.
I include a sample code below that illustrates a possible origin of those high frequency components.
Okay here is the code:
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
from scipy.signal import tukey
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0,skiprows=2)
#print(gtrend)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
#print(gtrend.index)
a_gtrend_orig = gtrend['diet: (Worldwide)']
t_gtrend_orig = np.linspace( 0, len(a_gtrend_orig)/12, len(a_gtrend_orig), endpoint=False )
a_gtrend_windowed = (a_gtrend_orig-np.median( a_gtrend_orig ))*tukey( len(a_gtrend_orig) )
plt.subplot( 2, 1, 1 )
plt.plot( t_gtrend_orig, a_gtrend_orig, label='raw data' )
plt.plot( t_gtrend_orig, a_gtrend_windowed, label='windowed data' )
plt.xlabel( 'years' )
plt.legend()
a_gtrend_psd = abs(rfft( a_gtrend_orig ))
a_gtrend_psdtukey = abs(rfft( a_gtrend_windowed ) )
# Notice that we assert the delta-time here,
# It would be better to get it from the data.
a_gtrend_freqs = rfftfreq( len(a_gtrend_orig), d = 1./12. )
# For the PSD graph, we skip the first two points, this brings us more into a useful scale
# those points represent the baseline (or mean), and are usually not relevant to the analysis
plt.subplot( 2, 1, 2 )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psd[1:], label='psd raw data' )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psdtukey[1:], label='windowed psd' )
plt.xlabel( 'frequency ($yr^{-1}$)' )
plt.legend()
plt.tight_layout()
plt.show()
And here is the output displayed graphically. There are strong signals at 1/year and at 0.14 (which happens to be 1/2 of 1/14 yrs), and there is a set of higher frequency signals that at first perusal might seem quite mysterious.
We see that the windowing function is actually quite effective in bringing the data to baseline and you see that the relative signal strengths in the FT are not altered very much by applying the window function.
If you look at the data closely, there seems to be some repeated variations within the year. If those occur with some regularity, they can be expected to appear as signals in the FT, and indeed the presence or absence of signals in the FT is often used to distinguish between signal and noise. But as will be shown, there is a better explanation for the high frequency signals.
Okay, now here is a sample code that illustrates one way those high frequency components can be produced. In this code, we create a single tone, and then we create a set of spikes at the same frequency as the tone. Then we Fourier transform the two signals and finally, graph the raw and FT data.
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
t = np.linspace( 0, 1, 1000. )
y = np.cos( 50*3.14*t )
y2 = [ 1. if 1.-v < 0.01 else 0. for v in y ]
plt.subplot( 2, 1, 1 )
plt.plot( t, y, label='tone' )
plt.plot( t, y2, label='spikes' )
plt.xlabel('time')
plt.subplot( 2, 1, 2 )
plt.plot( rfftfreq(len(y),d=1/100.), abs( rfft(y) ), label='tone' )
plt.plot( rfftfreq(len(y2),d=1/100.), abs( rfft(y2) ), label='spikes' )
plt.xlabel('frequency')
plt.legend()
plt.tight_layout()
plt.show()
Okay, here are the graphs of the tone, and the spikes, and then their Fourier transforms. Notice that the spikes produce high frequency components that are very similar to those in our data.
In other words, the origin of the high frequency components is very likely in the short time scales associated with the spikey character of signals in the raw data.

Categories

Resources