How can I detect periodicity using auto-correlation automatically? - python

This is my code:
import matplotlib.pyplot as plt
import numpy as np
from pandas.plotting import autocorrelation_plot
y = np.sin(np.arange(1,6*3.14,0.1))
autocorrelation_plot(y)
plt.show()
And this is the output of the auto-correlation plot:
auto-correlation plot of y
I would like to figure out a way to classify whether the function is periodic or not automatically (without using the bare-eye to look at the autocorrelation plot). I read that it is related to the confidence interval which is the line shown in the attached plot, but still have doubt on what I should do with it to better decide. So is there an automated way for using auto-correlation to decide the perdiodicity of the data?
Though, this is my try for an automated way:
result = np.correlate(y, y, mode = "full")
ACF = result[np.round(result.size/2).astype(int):]
ACF = ACF/ACF[0]
acceptedVar = []
for i in range(len(ACF)):
if ACF[i] > 0.05:
acceptedVar = np.append(acceptedVar, ACF[i])
percent = len(acceptedVar)/len(ACF) * 100
I just made a threshold of 0.05 to detect the points for which the confidence interval is 95%. Don't know if this is right or wrong statistically and logically. I then see if percent is bigger than 95% for a periodic pattern. I'm not sure of that as well.
Credit to: the first answer to How can I use numpy.correlate to do autocorrelation?

To start, with e.g. ax = autocorrelation_plot(y) you can use ax.lines[5].get_data()[1] to use the values from the pandas autocorrelation function directly.
This may be a somewhat naïve solution, but say you are just looking for the first, most significant, periodicity, you could just grab the first index of the highest peak in the plot:
first_max = np.argmax(autocorr) + 1
Which gives you the lag for which autocorrelation is highest = the period of interest (in units of your data's sampling interval.)
Say you wanted the next most significant period:
second_max = np.argmax(autocorr[first_max:]) + first_max + 1
And so on and so forth...
To note: this wouldn't work as well if your data is not as regular and periodic as it seems to be from your autocorrelation plot.

Related

2d interpolators of scipy for scattered data going crazy

I want to find the derivatives of some scattered data. I have tried two different methods:
projecting the scattered data on a regular grid using scipy.interpolate.griddata, then computing the gradients with numpy.gradients, and then projecting values back to the scattered locations.
creating a CloughTocher2DInterpolater (but I have the same issue with others) and getting the gradients out of it
The second one is an order of magnitude faster than the first one but unfortunately, it also goes crazy quite quickly when data are a bit complex. For instance starting with this signal (called F and which is a simple addition of tanh stepwise functions along x and y):
When I process F using the two methods, I get:
Method 1 gives a good approximation. Method 2 is also good but I need force the colormap because of the existence of some extreme values.
Now, if I add a small noise (i.e. of amplitude 0.1 while the signal has amplitudes between -3 and 3), the interpolator just goes crazy giving very large extreme values:
I don't know how to deal with this. I understand the interpolator won't like irregular function or noise, but I was not expecting such discrepancy. My first idea was to smooth data first but strangely I can't find any method that would help me on this. Another idea would be to make a 2d fit of F to try to remove noise but I'm dry here too...any idea ?
Here is the corresponding python example (working on python3.6.9):
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
plt.interactive(True)
# scattered data
N = 200
coordu = np.random.rand(N**2,2)
Xu=coordu[:,0]
Yu=coordu[:,1]
noise = 0.
noise = np.random.rand(Xu.shape[0])*0.1
Zu=np.tanh((Xu-0.25)/0.01+(Yu-0.25)/0.001)+np.tanh((Xu-0.5)/0.01+(Yu-0.5)/0.001)+np.tanh((Xu-0.75)/0.001+(Yu-0.75)/0.001)+noise
plt.figure();plt.scatter(Xu,Yu,1,Zu)
plt.title('Data signal F')
#plt.savefig('signalF_noisy.png')
### get the gradient
# using griddata np.gradients
Xs,Ys=np.meshgrid(np.linspace(0,1,N),np.linspace(0,1,N))
coords = np.array([Xs,Ys]).T
Zs = interpolate.griddata(coordu,Zu,coords)
nearest = interpolate.griddata(coordu,Zu,coords,method='nearest')
znan = np.isnan(Zs)
Zs[znan] = nearest[znan]
dZs = np.gradient(Zs,np.min(np.diff(Xs[0,:])))
dZus = interpolate.griddata(coords.reshape(N*N,2),dZs[0].reshape(N*N),coordu)
hist_dzus = np.histogram(dZus,100)
plt.figure();plt.scatter(Xu,Yu,1,dZus)
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using griddata and np.gradients')
#plt.savefig('dxF_griddata_noisy.png')
# using interpolation method Clough
interp = interpolate.CloughTocher2DInterpolator(coordu,Zu)
dZuCT = interp.grad
hist_dzct = np.histogram(dZuCT[:,0,0],100)
plt.figure();plt.scatter(Xu,Yu,1,dZuCT[:,0,0])
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using CloughTocher2DInterpolator')
#plt.savefig('dxF_CT2D_noisy.png')
# histograms
plt.figure()
plt.semilogy(hist_dzus[1][:-1],hist_dzus[0],'.-')
plt.semilogy(hist_dzct[1][:-1],hist_dzct[0],'.-')
plt.title('histogram of dF/dx')
plt.legend(('griddata','ClouhTocher'))
#plt.savefig('dxF_hist_noisy.png')

Interpreting and understanding fft plots of time series data

I have a time series sensor data. It is for a period of 24 hours sampled every minute (so in total 1440 data points per day). I did a fft on this to see what are the dominant frequencies. But what I got is a very noisy fft and a strong peak at zero.
I have already subtracted the mean to remove for the DC component at bin 0. But I still get a strong peak at zero. I'm not able to figure what could be the other reason or what should I try next to remove this.
The graph is very different from I have usually seen online as I was learning about fft. In the sense, I'm not able to see dominant peaks like how it is usually seen. Is my fft wrong?
Attaching code that i tried and images:
import numpy as np
from matplotlib import pyplot as plt
from scipy.fftpack import fft,fftfreq
import random
x = np.random.default_rng().uniform(29,32,1440).tolist()
x=np.array(x)
x=x-x.mean()
N = 1440
# sample spacing
T = 1.0 / 60
yf = fft(x)
yf_abs = abs(yf).tolist()
plt.plot(abs(yf))
plt.show()
freqs = fftfreq(len(x), 60)
plt.plot(freqs,yf_abs)
plt.show()
Frequency vs amplitude
Since I'm new to this, I'm not able to figure out where I'm wrong or interpret the results. Any help will be appreciated. Thanks! :)

Integration of a function with discrete values

I want to do a integration without knowing the functional equation f(x). I also have only discrete values, which Python has connected by a plot. This one looks like this:
This is the code with the calculation for it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
# Loading of the values
dataListStride = ld.loadData("../Projektpraktikum Binder/Data/1 Fabienne/Test1/left foot/50cm")
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
print("The values are: " +str(resultsHorizontal))
print("There are " +str(len(resultsHorizontal)) + " values.")
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
plt.plot(x_values, resultsHorizontal)
After the calculation I get a list of these values (which were shown and plotted in the diagram above):
Note about the code:
The code works as follows: By using loaddataa.py a csv file is reading in. Then the formula for the calculation of the horizontal acceleration is defined, which is represented in def horizontal(yAngle,yAcceleration, xAcceleration). In the for loop, the previously determined list is run through line by line. Columns 2, 4 and 5 of the CSV file are used here. Then a 0 is added to the beginning of the resulting list of values. This is important to perform the integration from 0 to the end.
Now I want to integrate this function (which is represented in the plot at the top) with these values (which can be seen in the image after the code) after the calculation.
Is there a way to implement this? If so, how and what would the plot look like? Maybe there is the opportunity to do this with a trapeze integration? Thanks for helping me!
At the end of my task I want to do a double integration with the acceleration values to get the path length. The first (trapezoidal) integration of the acceleration should represent the velocity and the second (trapezoidal) integration the path length (location). The x-axis should remain as it is.
What I just noticed are the negative values. Theoretically the integration should always result in positive values, right? Because there are no negative areas.

Interpolating 1D nonfunction data points

I am having difficulties finding an interpolation for my data points. The line should slightly resemble a negative inverse quadratic (ie like a backwards 'c').
Since this is not a function (x can have multiple values of y), I am not sure what interpolation to use.
I was thinking that perhaps I should flip the axis to create the interpolation points/line using something like UnivariateSpline and then flip it back when I am plotting it?
This is a graph of just the individual points:
Here is my code:
import datetime as dt
import matplotlib.pyplot as plt
from scipy import interpolate
file = open_file("010217.hdf5", mode = "a", title = 'Sondrestrom1')
all_data = file.getNode('/Data/Table Layout').read()
file.close()
time = all_data['ut1_unix'] #time in seconds since 1/1/1970
alt = all_data['gdalt'] #all altitude points
electronDens = all_data['nel'] #all electron density points
x = []
y = []
positions = []
for t in range(len(time)): #Looking at this specific time, find all the respective altitude and electron density points
if time[t] == 982376726:
x.append(electronDens[t])
y.append(alt[t])
positions.append(t)
#FINDING THE DATE
datetime1970 = dt.datetime(1970,1,1,0,0,0)
seconds = long(time[t])
newDatetime = datetime1970 + dt.timedelta(0, seconds)
time1 = newDatetime.strftime('%Y-%m-%d %H:%M:%S')
title = "Electron Density vs. Altitude at "
title += time1
plt.plot(x,y,"o")
plt.title(title)
plt.xlabel('Electron Density (log_10[Ne])')
plt.ylabel('Altitude (km)')
plt.show()
As the graph heading says "electron density vs. Altidude", I suppose there's only one value per point on the vertical axis?
This means you are actually looking at a function that has been flipped, in order to make the x axis vertical because having altitude on the vertical axis is just more intuitive to humans.
Looking at your code, there seems to have been a measurement where both altitude and electron density were measured. Therefore, even if my theory above is wrong, you should still be able to interpolate everything in the time domain and create a spline from that.
... that's if you really want to have a curve that goes exactly through every point.
Seeing as how much scatter there is in the data, you should probably go for a curve fit that doesn't exactly replicate every measurement:
scipy.interpolate.Rbf should work alright, and again, for this you should switch the axes, i.e. compute electron density as function of altitude. Just be sure to use smooth=0.01 or maybe a little more (0.0 will exactly go through every point and look a little silly on noisy data).
... actually it seems most of your problem is understanding your data better :)

Manipulating the numpy.random.exponential distribution in Python

I am trying to create an array of random numbers using Numpy's random exponential distribution. I've got this working fine, however I have one extra requirement for my project and that is the ability to specify precisely how many array elements have a certain value.
Let me explain (code is below, but I'll have a go at explaining it here): I generate my random exponential distribution and plot a histogram of the data, producing a nice exponential curve. What I really want to be able to do is use a variable to specify the y-intercept of this curve (point where curve meets the y-axis). I can achieve this in a basic way by changing the number of bins in my histogram, but this only changes the plot and not the original data.
I have inserted the bones of my code here. To give some context, I am trying to create the exponential disc of a galaxy, hence the random array I want to generate is an array of radii and the variable I want to be able to specify is the number density in the centre of the galaxy:
import numpy as N
import matplotlib.pyplot as P
n = 1000
scale_radius = 2
central_surface_density = 100 #I would like this to be the controlling variable, even if it's specification had knock on effects on n.
radius_array = N.random.exponential(scale_radius,(n,1))
P.figure()
nbins = 100
number_density, radii = N.histogram(radius_array, bins=nbins,normed=False)
P.plot(radii[0:-1], number_density)
P.xlabel('$R$')
P.ylabel(r'$\Sigma$')
P.ylim(0, central_surface_density)
P.legend()
P.show()
This code creates the following histogram:
So, to summarise, I would like to be able to specify where this plot intercepts the y-axis by controlling how I've generated the data, not by changing how the histogram has been plotted.
Any help or requests for further clarification would be very much appreciated.
According to the docs for numpy.random.exponential, the input parameter beta, is 1/lambda for the definition of the exponential described in wikipedia.
What you want is this function evaluated at f(x=0)=lambda=1/beta. Therefore in a normed distribution, your y-intercept should just be the inverse of the numpy function:
import numpy as np
import pylab as plt
target = 250
beta = 1.0/target
Y = np.random.exponential(beta, 5000)
plt.hist(Y, normed=True, bins=200,lw=0,alpha=.8)
plt.plot([0,max(Y)],[target,target],'r--')
plt.ylim(0,target*1.1)
plt.show()
Yes the y-intercept of the histogram will change with different bin sizes, but this doesn't mean anything. The only thing that you can reasonably talk about here is the underlying probability distribution (hence the normed=true)

Categories

Resources