Interpolating 1D nonfunction data points - python

I am having difficulties finding an interpolation for my data points. The line should slightly resemble a negative inverse quadratic (ie like a backwards 'c').
Since this is not a function (x can have multiple values of y), I am not sure what interpolation to use.
I was thinking that perhaps I should flip the axis to create the interpolation points/line using something like UnivariateSpline and then flip it back when I am plotting it?
This is a graph of just the individual points:
Here is my code:
import datetime as dt
import matplotlib.pyplot as plt
from scipy import interpolate
file = open_file("010217.hdf5", mode = "a", title = 'Sondrestrom1')
all_data = file.getNode('/Data/Table Layout').read()
file.close()
time = all_data['ut1_unix'] #time in seconds since 1/1/1970
alt = all_data['gdalt'] #all altitude points
electronDens = all_data['nel'] #all electron density points
x = []
y = []
positions = []
for t in range(len(time)): #Looking at this specific time, find all the respective altitude and electron density points
if time[t] == 982376726:
x.append(electronDens[t])
y.append(alt[t])
positions.append(t)
#FINDING THE DATE
datetime1970 = dt.datetime(1970,1,1,0,0,0)
seconds = long(time[t])
newDatetime = datetime1970 + dt.timedelta(0, seconds)
time1 = newDatetime.strftime('%Y-%m-%d %H:%M:%S')
title = "Electron Density vs. Altitude at "
title += time1
plt.plot(x,y,"o")
plt.title(title)
plt.xlabel('Electron Density (log_10[Ne])')
plt.ylabel('Altitude (km)')
plt.show()

As the graph heading says "electron density vs. Altidude", I suppose there's only one value per point on the vertical axis?
This means you are actually looking at a function that has been flipped, in order to make the x axis vertical because having altitude on the vertical axis is just more intuitive to humans.
Looking at your code, there seems to have been a measurement where both altitude and electron density were measured. Therefore, even if my theory above is wrong, you should still be able to interpolate everything in the time domain and create a spline from that.
... that's if you really want to have a curve that goes exactly through every point.
Seeing as how much scatter there is in the data, you should probably go for a curve fit that doesn't exactly replicate every measurement:
scipy.interpolate.Rbf should work alright, and again, for this you should switch the axes, i.e. compute electron density as function of altitude. Just be sure to use smooth=0.01 or maybe a little more (0.0 will exactly go through every point and look a little silly on noisy data).
... actually it seems most of your problem is understanding your data better :)

Related

2d interpolators of scipy for scattered data going crazy

I want to find the derivatives of some scattered data. I have tried two different methods:
projecting the scattered data on a regular grid using scipy.interpolate.griddata, then computing the gradients with numpy.gradients, and then projecting values back to the scattered locations.
creating a CloughTocher2DInterpolater (but I have the same issue with others) and getting the gradients out of it
The second one is an order of magnitude faster than the first one but unfortunately, it also goes crazy quite quickly when data are a bit complex. For instance starting with this signal (called F and which is a simple addition of tanh stepwise functions along x and y):
When I process F using the two methods, I get:
Method 1 gives a good approximation. Method 2 is also good but I need force the colormap because of the existence of some extreme values.
Now, if I add a small noise (i.e. of amplitude 0.1 while the signal has amplitudes between -3 and 3), the interpolator just goes crazy giving very large extreme values:
I don't know how to deal with this. I understand the interpolator won't like irregular function or noise, but I was not expecting such discrepancy. My first idea was to smooth data first but strangely I can't find any method that would help me on this. Another idea would be to make a 2d fit of F to try to remove noise but I'm dry here too...any idea ?
Here is the corresponding python example (working on python3.6.9):
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
plt.interactive(True)
# scattered data
N = 200
coordu = np.random.rand(N**2,2)
Xu=coordu[:,0]
Yu=coordu[:,1]
noise = 0.
noise = np.random.rand(Xu.shape[0])*0.1
Zu=np.tanh((Xu-0.25)/0.01+(Yu-0.25)/0.001)+np.tanh((Xu-0.5)/0.01+(Yu-0.5)/0.001)+np.tanh((Xu-0.75)/0.001+(Yu-0.75)/0.001)+noise
plt.figure();plt.scatter(Xu,Yu,1,Zu)
plt.title('Data signal F')
#plt.savefig('signalF_noisy.png')
### get the gradient
# using griddata np.gradients
Xs,Ys=np.meshgrid(np.linspace(0,1,N),np.linspace(0,1,N))
coords = np.array([Xs,Ys]).T
Zs = interpolate.griddata(coordu,Zu,coords)
nearest = interpolate.griddata(coordu,Zu,coords,method='nearest')
znan = np.isnan(Zs)
Zs[znan] = nearest[znan]
dZs = np.gradient(Zs,np.min(np.diff(Xs[0,:])))
dZus = interpolate.griddata(coords.reshape(N*N,2),dZs[0].reshape(N*N),coordu)
hist_dzus = np.histogram(dZus,100)
plt.figure();plt.scatter(Xu,Yu,1,dZus)
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using griddata and np.gradients')
#plt.savefig('dxF_griddata_noisy.png')
# using interpolation method Clough
interp = interpolate.CloughTocher2DInterpolator(coordu,Zu)
dZuCT = interp.grad
hist_dzct = np.histogram(dZuCT[:,0,0],100)
plt.figure();plt.scatter(Xu,Yu,1,dZuCT[:,0,0])
plt.colorbar()
plt.clim([0 ,10])
plt.title('dF/dx using CloughTocher2DInterpolator')
#plt.savefig('dxF_CT2D_noisy.png')
# histograms
plt.figure()
plt.semilogy(hist_dzus[1][:-1],hist_dzus[0],'.-')
plt.semilogy(hist_dzct[1][:-1],hist_dzct[0],'.-')
plt.title('histogram of dF/dx')
plt.legend(('griddata','ClouhTocher'))
#plt.savefig('dxF_hist_noisy.png')

How can I detect periodicity using auto-correlation automatically?

This is my code:
import matplotlib.pyplot as plt
import numpy as np
from pandas.plotting import autocorrelation_plot
y = np.sin(np.arange(1,6*3.14,0.1))
autocorrelation_plot(y)
plt.show()
And this is the output of the auto-correlation plot:
auto-correlation plot of y
I would like to figure out a way to classify whether the function is periodic or not automatically (without using the bare-eye to look at the autocorrelation plot). I read that it is related to the confidence interval which is the line shown in the attached plot, but still have doubt on what I should do with it to better decide. So is there an automated way for using auto-correlation to decide the perdiodicity of the data?
Though, this is my try for an automated way:
result = np.correlate(y, y, mode = "full")
ACF = result[np.round(result.size/2).astype(int):]
ACF = ACF/ACF[0]
acceptedVar = []
for i in range(len(ACF)):
if ACF[i] > 0.05:
acceptedVar = np.append(acceptedVar, ACF[i])
percent = len(acceptedVar)/len(ACF) * 100
I just made a threshold of 0.05 to detect the points for which the confidence interval is 95%. Don't know if this is right or wrong statistically and logically. I then see if percent is bigger than 95% for a periodic pattern. I'm not sure of that as well.
Credit to: the first answer to How can I use numpy.correlate to do autocorrelation?
To start, with e.g. ax = autocorrelation_plot(y) you can use ax.lines[5].get_data()[1] to use the values from the pandas autocorrelation function directly.
This may be a somewhat naïve solution, but say you are just looking for the first, most significant, periodicity, you could just grab the first index of the highest peak in the plot:
first_max = np.argmax(autocorr) + 1
Which gives you the lag for which autocorrelation is highest = the period of interest (in units of your data's sampling interval.)
Say you wanted the next most significant period:
second_max = np.argmax(autocorr[first_max:]) + first_max + 1
And so on and so forth...
To note: this wouldn't work as well if your data is not as regular and periodic as it seems to be from your autocorrelation plot.

Using numpy/scipy to identify slope changes in digital signals?

I am trying to come up with a generalised way in Python to identify pitch rotations occurring during a set of planned spacecraft manoeuvres. You could think of it as a particular case of a shift detection problem.
Let's consider the solar_elevation_angle variable in my set of measurements, identifying the elevation angle of the sun measured from the spacecraft's instrument. For those who might want to play with the data, I saved the solar_elevation_angle.txt file here.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy.signal import argrelmax
from scipy.ndimage.filters import gaussian_filter1d
solar_elevation_angle = np.loadtxt("solar_elevation_angle.txt", dtype=np.float32)
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
plt.show()
The scanline is my time dimension. The four points where the slope changes identify the spacecraft pitch rotations.
As you can see, the solar elevation angle evolution outside the spacecraft manoeuvres regions is pretty much linear as a function of time, and that should always be the case for this particular spacecraft (except for major failures).
Note that during each spacecraft manoeuvre, the slope change is obviously continuous, although discretised in my set of angle values. That means: for each manoeuvre, it does not really make sense to try to locate a single scanline where a manoeuvre has taken place. My goal is rather to identify, for each manoeuvre, a "representative" scanline in the range of scanlines defining the interval of time where the manoeuvre occurred (e.g. middle value, or left boundary).
Once I get a set of "representative" scanline indexes where all manoeuvres have taken place, I could then use those indexes for rough estimations of manoeuvres durations, or to automatically place labels on the plot.
My solution so far has been to:
Compute the 2nd derivative of the solar elevation angle using
np.gradient.
Compute absolute value and clipping of resulting
curve. The clipping is necessary because of what I assume to be
discretisation noise in the linear segments, which would then severely affect the identification of the "real" local maxima in point 4.
Apply smoothing to the resulting curve, to get rid of multiple peaks. I'm using scipy's 1d gaussian filter with a trial-and-error sigma value for that.
Identify local maxima.
Here's my code:
fig = plt.figure(figsize=(8,12))
gs = gridspec.GridSpec(5, 1)
ax0 = plt.subplot(gs[0])
ax0.set_title('Solar elevation angle')
ax0.plot(solar_elevation_angle)
solar_elevation_angle_1stdev = np.gradient(solar_elevation_angle)
ax1 = plt.subplot(gs[1])
ax1.set_title('1st derivative')
ax1.plot(solar_elevation_angle_1stdev)
solar_elevation_angle_2nddev = np.gradient(solar_elevation_angle_1stdev)
ax2 = plt.subplot(gs[2])
ax2.set_title('2nd derivative')
ax2.plot(solar_elevation_angle_2nddev)
solar_elevation_angle_2nddev_clipped = np.clip(np.abs(np.gradient(solar_elevation_angle_2nddev)), 0.0001, 2)
ax3 = plt.subplot(gs[3])
ax3.set_title('absolute value + clipping')
ax3.plot(solar_elevation_angle_2nddev_clipped)
smoothed_signal = gaussian_filter1d(solar_elevation_angle_2nddev_clipped, 20)
ax4 = plt.subplot(gs[4])
ax4.set_title('Smoothing applied')
ax4.plot(smoothed_signal)
plt.tight_layout()
plt.show()
I can then easily identify the local maxima by using scipy's argrelmax function:
max_idx = argrelmax(smoothed_signal)[0]
print(max_idx)
# [ 689 1019 2356 2685]
Which correctly identifies the scanline indexes I was looking for:
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
ax.scatter(max_idx, solar_elevation_angle[max_idx], marker='x', color='red')
plt.show()
My question is: Is there a better way to approach this problem?
I find that having to manually specify the clipping threshold values to get rid of the noise and the sigma in the gaussian filter weakens this approach considerably, preventing it to be applied to other similar cases.
First improvement would be to use a Savitzky-Golay filter to find the derivative in a less noisy way. For example, it can fit a parabola (in the sense of least squares) to each data slice of certain size, and then take the second derivative of that parabola. The result is much nicer than just taking 2nd order difference with gradient. Here it is with window size 101:
savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
Second, instead of looking for points of maximum with argrelmax it is better to look for places where the second derivative is large; for example, at least half its maximal size. This will of course return many indexes, but we can then look at the gaps between those indexes to identify where each peak begins and ends. The midpoint of the peak is then easily found.
Here is the complete code. The only parameter is window size, which is set to 101. The approach is robust; the size 21 or 201 gives essentially the same outcome (it must be odd).
from scipy.signal import savgol_filter
window = 101
der2 = savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
max_der2 = np.max(np.abs(der2))
large = np.where(np.abs(der2) > max_der2/2)[0]
gaps = np.diff(large) > window
begins = np.insert(large[1:][gaps], 0, large[0])
ends = np.append(large[:-1][gaps], large[-1])
changes = ((begins+ends)/2).astype(np.int)
plt.plot(solar_elevation_angle)
plt.plot(changes, solar_elevation_angle[changes], 'ro')
plt.show()
The fuss with insert and append is because the first index with large derivative should qualify as "peak begins" and the last such index should qualify as "peak ends", even though they don't have a suitable gap next to them (the gap is infinite).
Piecewise linear fit
This is an alternative (not necessarily better) approach, which does not use derivatives: fit a smoothing spline of degree 1 (i.e., a piecewise linear curve), and notice where its knots are.
First, normalize the data (which I call y instead of solar_elevation_angle) to have standard deviation 1.
y /= np.std(y)
The first step is to build a piecewise linear curve that deviates from the data by at most the given threshold, arbitrarily set to 0.1 (no units here because y was normalized). This is done by calling UnivariateSpline repeatedly, starting with a large smoothing parameter and gradually reducing it until the curve fits. (Unfortunately, one can't simply pass in the desired uniform error bound).
from scipy.interpolate import UnivariateSpline
threshold = 0.1
m = y.size
x = np.arange(m)
s = m
max_error = 1
while max_error > threshold:
spl = UnivariateSpline(x, y, k=1, s=s)
interp_y = spl(x)
max_error = np.max(np.abs(interp_y - y))
s /= 2
knots = spl.get_knots()
values = spl(knots)
So far we found the knots, and noted the values of the spline at those knots. But not all of these knots are really important. To test the importance of each knot, I remove it and interpolate without it. If the new interpolant is substantially different from the old (doubling the error), the knot is considered important and is added to the list of found slope changes.
ts = knots.size
idx = np.arange(ts)
changes = []
for j in range(1, ts-1):
spl = UnivariateSpline(knots[idx != j], values[idx != j], k=1, s=0)
if np.max(np.abs(spl(x) - interp_y)) > 2*threshold:
changes.append(knots[j])
plt.plot(y)
plt.plot(changes, y[np.array(changes, dtype=int)], 'ro')
plt.show()
Ideally, one would fit piecewise linear functions to given data, increasing the number of knots until adding one more does not bring "substantial" improvement. The above is a crude approximation of that with SciPy tools, but far from best possible. I don't know of any off-the-shelf piecewise linear model selection tool in Python.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Interpolation over an irregular grid

So, I have three numpy arrays which store latitude, longitude, and some property value on a grid -- that is, I have LAT(y,x), LON(y,x), and, say temperature T(y,x), for some limits of x and y. The grid isn't necessarily regular -- in fact, it's tripolar.
I then want to interpolate these property (temperature) values onto a bunch of different lat/lon points (stored as lat1(t), lon1(t), for about 10,000 t...) which do not fall on the actual grid points. I've tried matplotlib.mlab.griddata, but that takes far too long (it's not really designed for what I'm doing, after all). I've also tried scipy.interpolate.interp2d, but I get a MemoryError (my grids are about 400x400).
Is there any sort of slick, preferably fast way of doing this? I can't help but think the answer is something obvious... Thanks!!
Try the combination of inverse-distance weighting and
scipy.spatial.KDTree
described in SO
inverse-distance-weighted-idw-interpolation-with-python.
Kd-trees
work nicely in 2d 3d ..., inverse-distance weighting is smooth and local,
and the k= number of nearest neighbours can be varied to tradeoff speed / accuracy.
There is a nice inverse distance example by Roger Veciana i Rovira along with some code using GDAL to write to geotiff if you're into that.
This is of coarse to a regular grid, but assuming you project the data first to a pixel grid with pyproj or something, all the while being careful what projection is used for your data.
A copy of his algorithm and example script:
from math import pow
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
def pointValue(x,y,power,smoothing,xv,yv,values):
nominator=0
denominator=0
for i in range(0,len(values)):
dist = sqrt((x-xv[i])*(x-xv[i])+(y-yv[i])*(y-yv[i])+smoothing*smoothing);
#If the point is really close to one of the data points, return the data point value to avoid singularities
if(dist<0.0000000001):
return values[i]
nominator=nominator+(values[i]/pow(dist,power))
denominator=denominator+(1/pow(dist,power))
#Return NODATA if the denominator is zero
if denominator > 0:
value = nominator/denominator
else:
value = -9999
return value
def invDist(xv,yv,values,xsize=100,ysize=100,power=2,smoothing=0):
valuesGrid = np.zeros((ysize,xsize))
for x in range(0,xsize):
for y in range(0,ysize):
valuesGrid[y][x] = pointValue(x,y,power,smoothing,xv,yv,values)
return valuesGrid
if __name__ == "__main__":
power=1
smoothing=20
#Creating some data, with each coodinate and the values stored in separated lists
xv = [10,60,40,70,10,50,20,70,30,60]
yv = [10,20,30,30,40,50,60,70,80,90]
values = [1,2,2,3,4,6,7,7,8,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
ZI = invDist(xv,yv,values,100,100,power,smoothing)
# Plotting the result
n = plt.normalize(0.0, 100.0)
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(xv, yv, 100, values)
plt.title('Inv dist interpolation - power: ' + str(power) + ' smoothing: ' + str(smoothing))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.colorbar()
plt.show()
There's a bunch of options here, which one is best will depend on your data...
However I don't know of an out-of-the-box solution for you
You say your input data is from tripolar data. There are three main cases for how this data could be structured.
Sampled from a 3d grid in tripolar space, projected back to 2d LAT, LON data.
Sampled from a 2d grid in tripolar space, projected into 2d LAT LON data.
Unstructured data in tripolar space projected into 2d LAT LON data
The easiest of these is 2. Instead of interpolating in LAT LON space, "just" transform your point back into the source space and interpolate there.
Another option that works for 1 and 2 is to search for the cells that maps from tripolar space to cover your sample point. (You can use a BSP or grid type structure to speed up this search) Pick one of the cells, and interpolate inside it.
Finally there's a heap of unstructured interpolation options .. but they tend to be slow.
A personal favourite of mine is to use a linear interpolation of the nearest N points, finding those N points can again be done with gridding or a BSP. Another good option is to Delauney triangulate the unstructured points and interpolate on the resulting triangular mesh.
Personally if my mesh was case 1, I'd use an unstructured strategy as I'd be worried about having to handle searching through cells with overlapping projections. Choosing the "right" cell would be difficult.
I suggest you taking a look at GRASS (an open source GIS package) interpolation features (http://grass.ibiblio.org/gdp/html_grass62/v.surf.bspline.html). It's not in python but you can reimplement it or interface with C code.
Am I right in thinking your data grids look something like this (red is the old data, blue is the new interpolated data)?
alt text http://www.geekops.co.uk/photos/0000-00-02%20%28Forum%20images%29/DataSeparation.png
This might be a slightly brute-force-ish approach, but what about rendering your existing data as a bitmap (opengl will do simple interpolation of colours for you with the right options configured and you could render the data as triangles which should be fairly fast). You could then sample pixels at the locations of the new points.
Alternatively, you could sort your first set of points spatially and then find the closest old points surrounding your new point and interpolate based on the distances to those points.
There is a FORTRAN library called BIVAR, which is very suitable for this problem. With a few modifications you can make it usable in python using f2py.
From the description:
BIVAR is a FORTRAN90 library which interpolates scattered bivariate data, by Hiroshi Akima.
BIVAR accepts a set of (X,Y) data points scattered in 2D, with associated Z data values, and is able to construct a smooth interpolation function Z(X,Y), which agrees with the given data, and can be evaluated at other points in the plane.

Categories

Resources