what is the difference between the two datasets for numpy.fft - python

I am trying to find the period of a sin curve and can find the right periods for sin(t).
However for sin(k*t), the frequency shifts. I do not know how it shifts.
I can adjust the value of interd below to get the right signal only if I know the dataset is sin(0.6*t).
Why can I get the right result for sin(t)?
Anyone can detect the right signal just based on my code ? Or just a small change?
The figure below is the power spectral density of sin(0.6*t).
The dataset is like:
1,sin(1*0.6)
2,sin(2*0.6)
3,sin(3*0.6)
.........
2000,sin(2000*0.6)
And my code:
timepoints = np.loadtxt('dataset', usecols=(0,), unpack=True, delimiter=",")
intensity = np.loadtxt('dataset', usecols=(1,), unpack=True, delimiter=",")
binshu = 300
lastime = 2000
interd = 2000.0/300
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity),d=interd)
freqnum = np.fft.fftfreq(len(intensity),d=interd).argsort()
pl.xlabel("frequency(Hz)")
pl.plot(freq[freqnum]*6.28, np.sqrt(sp.real**2+sp.imag**2)[freqnum])

I think you're making it too complicated. If you consider timepoints to be in seconds then interd is 1 (difference between values in timepoints). This works fine for me:
import numpy as np
import matplotlib.pyplot as pl
# you can do this in one line, that's what 'unpack' is for:
timepoints, intensity = np.loadtxt('dataset', usecols=(0,1), unpack=True, delimiter=",")
interd = timepoints[1] - timepoints[0] # if this is 1, it can be ignored
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity), d=interd)
pl.plot(np.fft.fftshift(freq), np.fft.fftshift(np.abs(sp)))
pl.xlabel("frequency(Hz)")
pl.show()
You'll also note that I didn't sort the frequencies, that's what fftshift is for.
Also, don't do np.sqrt(sp.imag**2 + sp.real**2), that's what np.abs is for :)
If you're not sampling enough (the frequency is higher than your sample rate, i.e., 2*pi/interd < 0.5*k), then there's no way for fft to know how much data you're missing, so it assumes you're not missing any. You can't expect it to know a priori. This is the data you're giving it:

Related

Interpolate: spectra (wavelength, counts) at a given temperature, to create grid of temperature and counts

I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data
I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)

Problem with converting octave code to python/pandas - wrong signal processing due to incorrect float64 values

I did not find all anwsers regarding my problem or each of questions deal with just part of it. After few days of trying I decided to post a question.
I am doing biomechanical research involving computing maximum velocity of a kicks. There are three kicks captured in each file (simple finding maximum value wont do). I need to find maxium values of those kicks. With some help I manage to do it using matlab/octave but I for future work I decided to stick with python for data processing.
The point is that I have time,x,y,z data of a specific marker and I need to compute its velocity for each registered frame and pick maximum velocity from each kick.
This is a code in octave:
pkg load signal
txyz=importdata('295ltoe.txt',',',8); % read the text file
txyz=txyz.data; % all data in array time,x,y,z
dxyz=diff(txyz); % first differences of all columns
vxyz=dxyz(:,2:end)./dxyz(:,1)/1000; % compute velocity components
v=sqrt(sum(vxyz.^2,2)); % and the total velocity
[pks,locs]=findpeaks(v,'minpeakheight',6 )
I tried to convert it to pandas with this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
data1 = pd.read_csv("B0264_dollyo_air_P_T01 rtoe.txt") #examle of txt
imported from c3d file
df = data1.diff()
dfx = (df['X'] /1000) / df['T']
dfy = (df['Y'] /1000) / df['T']
dfz = (df['Z'] /1000) / df['T']
dfx1 = dfx**2
dfy1 = dfy**2
dfz1 = dfz**2
v = (dfx1 + dfy1 + dfz1)**1/2
peaks, _ = find_peaks(v, height=6)
plt.plot(v)
plt.plot(peaks, v[peaks], "x")
plt.show()
The problem is that velocity gets values like that:
0 NaN
1 6.450000e-07
2 8.237500e-07
3 1.159062e-06
4 1.250312e-06
5 1.657500e-06
instead of normal correct values it gives me values more than 60. I am attaching plots that I recived vs excel correct polot (excel computing is time consuming due to constant copypasting).
My total aim is to get 3 max peaks and 3 min peaks to compute time of kick execution, but I do not know how to obtain it.
For know, if anyone is willing to help me I can provide files I have used.

How to validate the downsampling is as intended

how to validate whether the down sampled output is correct. For example, I had make some example, however, I am not sure whether the output is correct or not?
Any idea on the validation
Code
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
import mne
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=400 # frequency of the signal
x = np.arange(fs)
y = [ np.sin(2*np.pi*fTwo * (i/fs)) for i in x]
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure(1)
plt.subplot(211)
plt.stem(x, y)
plt.subplot(212)
plt.stem(xnew, f_res, 'r')
plt.show()
Plotting the data is a good first take at a verification. Here I made regular plot with the points connected by lines. The lines are useful since they give a guide for where you expect the down-sampled data to lie, and also emphasize what the down-sampled data is missing. (It would also work to only show lines for the original data, but lines, as in a stem plot, are too confusing, imho.)
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
fs = 100 # sample rate
rsample=43 # downsample frequency
fTwo=13 # frequency of the signal
x = np.arange(fs, dtype=float)
y = np.sin(2*np.pi*fTwo * (x/fs))
print y
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xnew, f_res, 'or')
plt.show()
A few notes:
If you're trying to make a general algorithm, use non-rounded numbers, otherwise you could easily introduce bugs that don't show up when things are even multiples. Similarly, if you need to zoom in to verify, go to a few random places, not, for example, only the start.
Note that I changed fTwo to be significantly less than the number of samples. Somehow, you need at least more than one data point per oscillation if you want to make sense of it.
I also remove the loop for calculating y: in general, you should try to vectorize calculations when using numpy.
The spectrum of the resampled signal should have a tone at the same frequency as the input signal just in a smaller nyquist bandwidth.
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
import scipy.fftpack as fft
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=10 # frequency of the signal
n = np.arange(1024)
y = np.sin(2*np.pi*fTwo/fs*n)
y_res = signal.resample(y, len(n)/2)
Y = fft.fftshift(fft.fft(y))
f = -fs*np.arange(-512, 512)/1024
Y_res = fft.fftshift(fft.fft(y_res, 1024))
f_res = -fs/2*np.arange(-512, 512)/1024
plt.figure(1)
plt.subplot(211)
plt.stem(f, abs(Y))
plt.subplot(212)
plt.stem(f_res, abs(Y_res))
plt.show()
The tone is still at 10.
IF you down sample a signal both signals will still have the exact same value and a given time , so just loop through "time" and check that the values are the same. In your case you go from a sample rate of 100 to 50. Assuming you have 1 seconds worth of data from building your x from fs, then just loop through t = 0 to t=1 in 1/50'th increments and make sure that Yd(t) = Ys(t) where Yd d is the down sampled f and Ys is the original sampled frequency. Or to say it simply Yd(n) = Ys(2n) for n = 1,2,3,...n=total_samples-1.

Analyzing seasonality of Google trend time series using FFT

I am trying to evaluate the amplitude spectrum of the Google trends time series using a fast Fourier transformation. If you look at the data for 'diet' in the data provided here it shows a very strong seasonal pattern:
I thought I could analyze this pattern using a FFT, which presumably should have a strong peak for a period of 1 year.
However when I apply a FFT like this (a_gtrend_ham being the time series multiplied with a Hamming window):
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
# Sampling rate
fs = 12 #Points per year
a_gtrend_orig = gtrend['diet: (Worldwide)']
N_gtrend_orig = len(a_gtrend_orig)
length_gtrend_orig = N_gtrend_orig / fs
t_gtrend_orig = np.linspace(0, length_gtrend_orig, num = N_gtrend_orig, endpoint = False)
a_gtrend_sel = a_gtrend_orig.loc['2005-01-01 00:00:00':'2017-12-01 00:00:00']
N_gtrend = len(a_gtrend_sel)
length_gtrend = N_gtrend / fs
t_gtrend = np.linspace(0, length_gtrend, num = N_gtrend, endpoint = False)
a_gtrend_zero_mean = a_gtrend_sel - np.mean(a_gtrend_sel)
ham = np.hamming(len(a_gtrend_zero_mean))
a_gtrend_ham = a_gtrend_zero_mean * ham
N_gtrend = len(a_gtrend_ham)
ampl_gtrend = 1/N_gtrend * abs(fft(a_gtrend_ham))
mag_gtrend = fftshift(ampl_gtrend)
freq_gtrend = np.linspace(-0.5, 0.5, len(ampl_gtrend))
response_gtrend = 20 * np.log10(mag_gtrend)
response_gtrend = np.clip(response_gtrend, -100, 100)
My resulting amplitude spectrum does not show any dominant peak:
Where is my misunderstanding of how to use the FFT to get the spectrum of the data series?
Here is a clean implementation of what I think you are trying to accomplish. I include graphical output and a brief discussion of what it likely means.
First, we use the rfft() because the data is real valued. This saves time and effort (and reduces the bug rate) that otherwise follows from generating the redundant negative frequencies. And we use rfftfreq() to generate the frequency list (again, it is unnecessary to hand code it, and using the api reduces the bug rate).
For your data, the Tukey window is more appropriate than the Hamming and similar cos or sin based window functions. Notice also that we subtract the median before multiplying by the window function. The median() is a fairly robust estimate of the baseline, certainly more so than the mean().
In the graph you can see that the data falls quickly from its intitial value and then ends low. The Hamming and similar windows, sample the middle too narrowly for this and needlessly attenuate a lot of useful data.
For the FT graphs, we skip the zero frequency bin (the first point) since this only contains the baseline and omitting it provides a more convenient scaling for the y-axes.
You will notice some high frequency components in the graph of the FT output.
I include a sample code below that illustrates a possible origin of those high frequency components.
Okay here is the code:
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
from scipy.signal import tukey
from numpy.fft import fft, fftshift
import pandas as pd
gtrend = pd.read_csv('multiTimeline.csv',index_col=0,skiprows=2)
#print(gtrend)
gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
#print(gtrend.index)
a_gtrend_orig = gtrend['diet: (Worldwide)']
t_gtrend_orig = np.linspace( 0, len(a_gtrend_orig)/12, len(a_gtrend_orig), endpoint=False )
a_gtrend_windowed = (a_gtrend_orig-np.median( a_gtrend_orig ))*tukey( len(a_gtrend_orig) )
plt.subplot( 2, 1, 1 )
plt.plot( t_gtrend_orig, a_gtrend_orig, label='raw data' )
plt.plot( t_gtrend_orig, a_gtrend_windowed, label='windowed data' )
plt.xlabel( 'years' )
plt.legend()
a_gtrend_psd = abs(rfft( a_gtrend_orig ))
a_gtrend_psdtukey = abs(rfft( a_gtrend_windowed ) )
# Notice that we assert the delta-time here,
# It would be better to get it from the data.
a_gtrend_freqs = rfftfreq( len(a_gtrend_orig), d = 1./12. )
# For the PSD graph, we skip the first two points, this brings us more into a useful scale
# those points represent the baseline (or mean), and are usually not relevant to the analysis
plt.subplot( 2, 1, 2 )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psd[1:], label='psd raw data' )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psdtukey[1:], label='windowed psd' )
plt.xlabel( 'frequency ($yr^{-1}$)' )
plt.legend()
plt.tight_layout()
plt.show()
And here is the output displayed graphically. There are strong signals at 1/year and at 0.14 (which happens to be 1/2 of 1/14 yrs), and there is a set of higher frequency signals that at first perusal might seem quite mysterious.
We see that the windowing function is actually quite effective in bringing the data to baseline and you see that the relative signal strengths in the FT are not altered very much by applying the window function.
If you look at the data closely, there seems to be some repeated variations within the year. If those occur with some regularity, they can be expected to appear as signals in the FT, and indeed the presence or absence of signals in the FT is often used to distinguish between signal and noise. But as will be shown, there is a better explanation for the high frequency signals.
Okay, now here is a sample code that illustrates one way those high frequency components can be produced. In this code, we create a single tone, and then we create a set of spikes at the same frequency as the tone. Then we Fourier transform the two signals and finally, graph the raw and FT data.
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
t = np.linspace( 0, 1, 1000. )
y = np.cos( 50*3.14*t )
y2 = [ 1. if 1.-v < 0.01 else 0. for v in y ]
plt.subplot( 2, 1, 1 )
plt.plot( t, y, label='tone' )
plt.plot( t, y2, label='spikes' )
plt.xlabel('time')
plt.subplot( 2, 1, 2 )
plt.plot( rfftfreq(len(y),d=1/100.), abs( rfft(y) ), label='tone' )
plt.plot( rfftfreq(len(y2),d=1/100.), abs( rfft(y2) ), label='spikes' )
plt.xlabel('frequency')
plt.legend()
plt.tight_layout()
plt.show()
Okay, here are the graphs of the tone, and the spikes, and then their Fourier transforms. Notice that the spikes produce high frequency components that are very similar to those in our data.
In other words, the origin of the high frequency components is very likely in the short time scales associated with the spikey character of signals in the raw data.

Determining if data in a txt file obeys certain statistics

I'm working with a Geiger counter which can be hooked up to a computer and which records its output in the form of a .txt file, NC.txt, where it records the time since starting and the 'value' of the radiation it recorded. It looks like
import pylab
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
x1 = []
y1 = []
#Define a dictionary: counts
f = open("NC.txt", "r")
for line in f:
line = line.strip()
parts = line.split(",") #the columns are separated by commas and spaces
time = float(parts[1]) #time is recorded in the second column of NC.txt
value = float(parts[2]) #and the value it records is in the third
x1.append(time)
y1.append(value)
f.close()
xv = np.array(x1)
yv = np.array(y1)
#Statistics
m = np.mean(yv)
d = np.std(yv)
#Strip out background radiation
trueval = yv - m
#Basic plot of counts
num_bins = 10000
plt.hist(trueval,num_bins)
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()
So this code so far will just create a simple histogram of the radiation counts centred at zero, so the background radiation is ignored.
What I want to do now is perform a chi-squared test to see how well the data fits, say, Poisson statistics (and then go on to compare it with other distributions later). I'm not really sure how to do that. I have access to scipy and numpy, so I feel like this should be a simple task, but just learning python as I go here, so I'm not a terrific programmer.
Does anyone know of a straightforward way to do this?
Edit for clarity: I'm not asking so much about if there is a chi-squared function or not. I'm more interested in how to compare it with other statistical distributions.
Thanks in advance.
You can use SciPy library, here is documentation and examples.

Categories

Resources