Fast interpolation of a scattered DataFrame - python

TL;DR: Question: Is there a fast way to interpolate a scattered 2D-dataset at specific coordinates?
And if so could someone provide an example with the provided sample data and variables used from "Current Solution" (as I'm apparently to stupid to implement it myself).
Problem:
I need to interpolate (and if possible also extrapolate) a DataFrame (size = (34, 18)) of scattered data at specific coordinate points. The DataFrame stays always the same.
The interpolation need to be fast as it is done more than 10.000 times in a loop.
The coordinates at which will be interpolated are not know in advance as they change every loop.
Current Solution:
def Interpolation(a, b):
#import external modules
import pandas as pd
from scipy import interpolate
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
#interpolation at specific coordinates
matrix = Matrix.stack().reset_index().values
value = interpolate.griddata(matrix[:,0:2], matrix[:,2], (a, b), method='cubic')
return(value)
This method is not acceptable for long time use as only the two lines of code under #interpolation at specific coordinates is more than 95% of the execution time.
My Ideas:
scipy.interpolate.Rbf seems like the best solution if the data needs to be interpolated and extrapolated but as to my understanding it only creates a finer mesh of the existing data and cannot output a interpolated value at specific coordinates
creating a smaller 4x4 matrix of the area around the specific coordinates (a,b) would maybe decrease the execution time per loop, but I do struggle how to use griddata with the smaller matrix. I created a 5x5 matrix with the first row and column being the indexes and the other 4x4 entries is the data with the specific coordinates in the middle.
But I get a TypeError: list indices must be integers or slices, not tuple which I do not understand as I did not change anything else.
Sample Data:
0.0 0.1 0.2 0.3
0.0 -407 -351 -294 -235
0.0001 -333 -285 -236 -185
0.0002 -293 -251 -206 -161
0.00021 -280 -239 -196 -151

Thanks to #Jdog's comment I was able to figure it out:
The creation of a spline once before the loop with scipy.interpolate.RectBivariateSpline and the read out of specific coordinates with scipy.interpolate.RectBivariateSpline.ev decreased the execution time of the interpolation from 255s to 289ms.
def Interpolation(mesh, a, b):
#interpolation at specific coordinates
value = mesh.ev(stroke, current)
return(value)
#%%
#import external modules
import pandas as pd
from scipy import interp
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
mesh = interp.RectBivariateSpline(a_index, b_index, matrix)
for iterations in loop:
value = Interpolation(mesh, a, b)

Related

Numpy griddata interpolation up to certain radius

I'm using griddata() to interpolate my (irregular) 2-dimensional depth-measurements; x,y,depth. The method does a great job - but it interpolates over the entire grid where it can find to opposing points. I don't want that behaviour. I'd like to have an interpolation around the existing measurements, say with up to an extent of a certain radius.
Is it possible to tell numpy/scipy: don't interpolate if you're too far from an existing measurement? Resulting in a NODATA-value? ideal = griddata(.., .., .., radius=5.0)
edit example:
In the image below; black dots are the measurements. Shades of blue are the interpolated cells by numpy. The area marked in green is in fact part of the picture but is considered as NODATA by numpy (because there's no points in between). Now, the red areas, are interpolated, but I want to get rid of them. any ideas?
Ok cool. I don't think there is a built-in option for griddata() that does what you want, so you will need to write it yourself.
This comes down to calculating the distances between N input data points and M interpolation points. This is simple enough to do but if you have a lot of points it can be slow at ~O(M*N). But here's an example that calculates the distances to allN data points, for each interpolation point. If the number of data points withing the radius is at least neighbors, it keeps the value. Otherwise is writes the value of NODATA.
neighbors is 4 because griddata() will use biilinear interpolation which needs points bounding the interpolants in each dimension (2*2 = 4).
#invec - input points Nx2 numpy array
#mvec - interpolation points Mx2 numpy array
#just some random points for example
N=100
invec = 10*np.random.random([N,2])
M=50
mvec = 10*np.random.random([M,2])
# --- here you would put your griddata() call, returning interpolated_values
interpolated_values = np.zeros(M)
NODATA=np.nan
radius = 5.0
neighbors = 4
for m in range(M):
data_in_radius = np.sqrt(np.sum( (invec - mvec[m])**2, axis=1)) <= radius
if np.sum(data_in_radius) < neighbors :
interpolated_values[m] = NODATA
Edit:
Ok re-read and noticed the input is really 2D. Example modified.
Just as an additional comment, this could be greatly accelerated if you first build a coarse mapping from each point mvec[m] to a subset of the relevant data points.
The costliest step in the loop would change from
np.sqrt(np.sum( (invec - mvec[m])**2, axis=1))
to something like
np.sqrt(np.sum( (invec[subset[m]] - mvec[m])**2, axis=1))
There are plenty of ways to do this, for example using a Quadtree, hashing function, or 2D index. But whether this gives performance advantage depends on the application, how your data is structured, etc.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Polar transformation of pandas DataFrame

I have a pandas.DataFrame 2048 by 2048 with index and columns representing y and x coordinates respectively.
I want to make an axis transformation and get to polar coordinates, making a new pandas.DataFrame with index and columns representing radius and polar angle.
The only way I can think of is access the dataframe values one by one, calculating radius and angle and then setting the according value in the new dataframe, but it is extremely slow, since element-wise operations are not that fast in pandas. It's still slow even if I perform the operation row-by-row.
Is there a better way to do that without writing my own CPython extension in C?
This should not be problematic if your dataframe is built as I understand it (though I think I am wrong). To build an example:
from __future__ import division
import numpy as np, pandas as pd
index = np.arange(1,2049,dtype=float)
cols = np.arange(2050,4098,dtype=float)
df = pd.DataFrame(index=index, columns=cols)
# now calculate angle and radius, then set in new dataframe
phi = np.arctan(df.columns/df.index)
r = np.power( np.power(df.index,2) + np.power(df.columns,2), 0.5 )
df_polar = pd.DataFrame(index=r, columns=phi)
While this agrees with what you stated, I think I have missed something here. If this is not right, can you clarify?

Python-Integrate then save text file

To all:
This question pertains to some MS research I am doing.
What I want to do is seemingly simple. I have a text file of values( time, .....values). I then want to integrate from 0 to those values for all times, then save that value to a text file.
from numpy import *
from pylab import *
import os, sys, shutil
import math
#######################
#Load Data
data = loadtxt('wh.txt')
#Open file to save plots to
shutil.rmtree("wh_files")
os.makedirs("wh_files")
os.chdir("wh_files")
for i in range(0,100,1):
int = trapz(data[i,:],axis=0)
print int
savetxt('int.txt', int)
When I run this I get the following error:
File "integral.py", line 19, in
savetxt('int.txt', int)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/lib/npyio.py", line 960, in savetxt
ncol = X.shape[1]
IndexError: tuple index out of range
I have been trying to solve this problem for a few days on an off now, but am yet to find a solution. I appreciate any help/comments you can give.
The value of int is a float but savetxt requires an array. You want to create a numpy array for all the integration results, then save it at the end using savetxt. I think something like this should do the trick:
int_array = apply_along_axis(trapz, 0, data)
savetxt('int.txt', int_array)
Keep in mind that this (along with your original example) will sum the time fields, rather than integrating across time. This will only yield valid results if the spacing along the x-axis is 1. Otherwise you'll want to do something like this:
t = data[0,:]
int_array = apply_along_axis(trapz, 0, data[1:,:], t)
savetxt('int.txt', int_array)
Assuming the time fields are numeric.
edit: Further explanation of 2nd code section follows.
You're using the trapezoidal rule to integrate a variety of values, which is an integration approximation technique that works by summing the average of successive y-values on a curve multiplied by the change in x between the two y-values. This amounts to calculating the area of a trapezoid that connects the two y-values and the x-axis, like so:
It's not completely clear from your question, but it seemed that you were integrating the values across time, so that the x-axis would represent time. The x-values need to be incorporated to get the correct area of each trapezoid (the area of each trapezoid is (x2 - x1) * (y2 + y1) / 2 and the final integration result is the sum of all such areas).
The simplest way to incorporate these x-axis values is to pass it into the trapz function as the x parameter (see the docstring). In the example above, I used t = data[0,:] as the array of x values suchly.
Another caveat: if all spacing between x values are the same (so that x2 - x1 is a constant), you can save yourself some calculation by pulling this out of the summation and simply multiplying it at the end. This functionality is available through the trapz function with the dx parameter. So, if you're time measurements were taken every 30 seconds, for example, you could replace the 2nd line in my 2nd example with:
int_array = apply_along_axis(trapz, 0, data[1:,:], None, 30)
Hope that helps.

Interpolation over an irregular grid

So, I have three numpy arrays which store latitude, longitude, and some property value on a grid -- that is, I have LAT(y,x), LON(y,x), and, say temperature T(y,x), for some limits of x and y. The grid isn't necessarily regular -- in fact, it's tripolar.
I then want to interpolate these property (temperature) values onto a bunch of different lat/lon points (stored as lat1(t), lon1(t), for about 10,000 t...) which do not fall on the actual grid points. I've tried matplotlib.mlab.griddata, but that takes far too long (it's not really designed for what I'm doing, after all). I've also tried scipy.interpolate.interp2d, but I get a MemoryError (my grids are about 400x400).
Is there any sort of slick, preferably fast way of doing this? I can't help but think the answer is something obvious... Thanks!!
Try the combination of inverse-distance weighting and
scipy.spatial.KDTree
described in SO
inverse-distance-weighted-idw-interpolation-with-python.
Kd-trees
work nicely in 2d 3d ..., inverse-distance weighting is smooth and local,
and the k= number of nearest neighbours can be varied to tradeoff speed / accuracy.
There is a nice inverse distance example by Roger Veciana i Rovira along with some code using GDAL to write to geotiff if you're into that.
This is of coarse to a regular grid, but assuming you project the data first to a pixel grid with pyproj or something, all the while being careful what projection is used for your data.
A copy of his algorithm and example script:
from math import pow
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
def pointValue(x,y,power,smoothing,xv,yv,values):
nominator=0
denominator=0
for i in range(0,len(values)):
dist = sqrt((x-xv[i])*(x-xv[i])+(y-yv[i])*(y-yv[i])+smoothing*smoothing);
#If the point is really close to one of the data points, return the data point value to avoid singularities
if(dist<0.0000000001):
return values[i]
nominator=nominator+(values[i]/pow(dist,power))
denominator=denominator+(1/pow(dist,power))
#Return NODATA if the denominator is zero
if denominator > 0:
value = nominator/denominator
else:
value = -9999
return value
def invDist(xv,yv,values,xsize=100,ysize=100,power=2,smoothing=0):
valuesGrid = np.zeros((ysize,xsize))
for x in range(0,xsize):
for y in range(0,ysize):
valuesGrid[y][x] = pointValue(x,y,power,smoothing,xv,yv,values)
return valuesGrid
if __name__ == "__main__":
power=1
smoothing=20
#Creating some data, with each coodinate and the values stored in separated lists
xv = [10,60,40,70,10,50,20,70,30,60]
yv = [10,20,30,30,40,50,60,70,80,90]
values = [1,2,2,3,4,6,7,7,8,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
ZI = invDist(xv,yv,values,100,100,power,smoothing)
# Plotting the result
n = plt.normalize(0.0, 100.0)
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(xv, yv, 100, values)
plt.title('Inv dist interpolation - power: ' + str(power) + ' smoothing: ' + str(smoothing))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.colorbar()
plt.show()
There's a bunch of options here, which one is best will depend on your data...
However I don't know of an out-of-the-box solution for you
You say your input data is from tripolar data. There are three main cases for how this data could be structured.
Sampled from a 3d grid in tripolar space, projected back to 2d LAT, LON data.
Sampled from a 2d grid in tripolar space, projected into 2d LAT LON data.
Unstructured data in tripolar space projected into 2d LAT LON data
The easiest of these is 2. Instead of interpolating in LAT LON space, "just" transform your point back into the source space and interpolate there.
Another option that works for 1 and 2 is to search for the cells that maps from tripolar space to cover your sample point. (You can use a BSP or grid type structure to speed up this search) Pick one of the cells, and interpolate inside it.
Finally there's a heap of unstructured interpolation options .. but they tend to be slow.
A personal favourite of mine is to use a linear interpolation of the nearest N points, finding those N points can again be done with gridding or a BSP. Another good option is to Delauney triangulate the unstructured points and interpolate on the resulting triangular mesh.
Personally if my mesh was case 1, I'd use an unstructured strategy as I'd be worried about having to handle searching through cells with overlapping projections. Choosing the "right" cell would be difficult.
I suggest you taking a look at GRASS (an open source GIS package) interpolation features (http://grass.ibiblio.org/gdp/html_grass62/v.surf.bspline.html). It's not in python but you can reimplement it or interface with C code.
Am I right in thinking your data grids look something like this (red is the old data, blue is the new interpolated data)?
alt text http://www.geekops.co.uk/photos/0000-00-02%20%28Forum%20images%29/DataSeparation.png
This might be a slightly brute-force-ish approach, but what about rendering your existing data as a bitmap (opengl will do simple interpolation of colours for you with the right options configured and you could render the data as triangles which should be fairly fast). You could then sample pixels at the locations of the new points.
Alternatively, you could sort your first set of points spatially and then find the closest old points surrounding your new point and interpolate based on the distances to those points.
There is a FORTRAN library called BIVAR, which is very suitable for this problem. With a few modifications you can make it usable in python using f2py.
From the description:
BIVAR is a FORTRAN90 library which interpolates scattered bivariate data, by Hiroshi Akima.
BIVAR accepts a set of (X,Y) data points scattered in 2D, with associated Z data values, and is able to construct a smooth interpolation function Z(X,Y), which agrees with the given data, and can be evaluated at other points in the plane.

Categories

Resources