Python fastKDE beyond limits of data points

Python fastKDE beyond limits of data points - python

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers

I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.

I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Related

Is there a way to use function to draw a square with a minimum size over densest region of points?

The Problem:
Using NumPy, I have created an array of random points within a range.
import numpy as np
min_square = 5
positions = (np.random.random(size=(100, 2)) - 0.5) * 2 * container_radius
Where container_radius is an integer and min_square is an integer.
Following that, using matplotlib, I plot the points on a graph.
import matplotlib.pyplot as plt
plt.plot(positions[:, 0], positions[:, 1], 'r.')
plt.show()
This graph shows me the distribution of the points in relation to each other.
What I am looking for is a method to implement something similar to or exactly a k-d tree to draw a rectangle over the densest area of the scatter plot with a defined minimum for the size.
This would be done using plt.gca().add_patch(plt.Rectangle((x, y), width=square_size, height=square_side, fill=None where square_side is the defined by the density function and is at least a minimum sizeo of min_square.
Attempts to Solve the Problem:
So far, I have created my own sort of density function that is within my understanding of Python and easy enough to code without lagging my computer too hard.
The solve comes in the form of creating an additional predefined variable intervals which is an integer.
Using what I had so far, I define a function to calculate the densities by checking if the points are within a range of floats.
# clb stands for calculate_lower_bound
def clb(x):
return -1 * container_radius + (x * 2 * container_radius - min_square) / (intervals - 1)
# crd stands for calculate_regional_density
def crd(x, y):
return np.where(np.logical_and(\
np.logical_and(positions[:, 0] >= clb(x), positions[:, 0] < clb(x) + min_square),\
np.logical_and(positions[:, 1] >= clb(y), positions[:, 1] < clb(y) + min_square)))[0].shape[0]
Then, I create a NumPy array of size size=(intervals, intervals) and pass the indices of the array (I have another question about this as I am currently using a quite inefficient method) as inputs into crd(x,y) and store the values in another array called densities. Then using some method, I calculate the maximum value in my densities array and draw the rectangle using some pretty straightforward code that I do not think is necessary to include here as it is not the problem.
What I Looking For:
I am looking for some function, f(x), that computes the dimensions and coordinates of a square encompassing the densest region on a scatterplot graph. The function would have access to all the variables it needs such as positions, min_square, etc. If you could use informative variable names or explain what each variable means, that would be a great help as well.
Other (Potentially) Important Notes:
I am looking for something that gets the job done in a reasonable time. In most scenarios, I am going to be working with around 10000 points and I need to calculate the densest region around 100 times so the function needs to be efficient enough so that the task completes within around 10-20 seconds.
As such, approximations using formulas like the example I have shown are completely valid as long as they implement well and are able to grow the dimensions of the square larger if necessary.
Thanks!

How to normalise plotted points and get a circle?

Given 2000 random points in a unit circle (using numpy.random.normal(0,1)), I want to normalize them such that the output is a circle, how do I do that?
I was requested to show my efforts. This is part of a larger question: Write a program that samples 2000 points uniformly from the circumference of a unit circle. Plot and show it is indeed picked from the circumference. To generate a point (x,y) from the circumference, sample (x,y) from std normal distribution and normalise them.
I'm almost certain my code isn't correct, but this is where I am up to. Any advice would be helpful.
This is the new updated code, but it still doesn't seem to be working.
import numpy as np
import matplotlib.pyplot as plot
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
plot.plot(xy)
plot.show()
I think the problem is in
plot.plot(xy)
even if I use
plot.plot(xy[:,0],xy[:,1])
it doesn't work.

Connected lines are not a good visualization here. You essentially connect random points on the circle. Since you do this quite often, you will get a filled circle. Try drawing points instead.
Also avoid name space mangling. You import matplotlib.pyplot as plot and also name your function plot. This will lead to name conflicts.
import numpy as np
import matplotlib.pyplot as plt
def plot():
xy = np.random.normal(0,1,(2000,2))
for i in range(2000):
s=np.linalg.norm(xy[i,])
xy[i,]=xy[i,]/s
fig, ax = plt.subplots(figsize=(5,5))
# scatter draws dots instead of lines
ax.scatter(xy[:,0], xy[:,1])
If you use dots instead, you will see that your points indeed lie on the unit circle.

Your code has many problems:
Why using np.random.normal (a gaussian distribution) when the problem text is about uniform (flat) sampling?
To pick points on a circle you need to correlate x and y; i.e. randomly sampling x and y will not give a point on the circle as x**2+y**2 must be 1 (for example for the unit circle centered in (x=0, y=0)).
A couple of ways to get the second point is to either "project" a random point from [-1...1]x[-1...1] on the unit circle or to pick instead uniformly the angle and compute a point on that angle on the circle.

First of all, if you look at the documentation for numpy.random.normal (and, by the way, you could just use numpy.random.randn), it takes an optional size parameter, which lets you create as large of an array as you'd like. You can use this to get a large number of values at once. For example: xy = numpy.random.normal(0,1,(2000,2)) will give you all the values that you need.
At that point, you need to normalize them such that xy[:,0]**2 + xy[:,1]**2 == 1. This should be relatively trivial after computing what xy[:,0]**2 + xy[:,1]**2 is. Simply using norm on each dimension separately isn't going to work.

Usual boilerplate
import numpy as np
import matplotlib.pyplot as plt
generate the random sample with two rows, so that it's more convenient to refer to x's and y's
xy = np.random.normal(0,1,(2,2000))
normalize the random sample using a library function to compute the norm, axis=0 means consider the subarrays obtained varying the first array index, the result is a (2000) shaped array that can be broadcasted to xy /= to have points with unit norm, hence lying on the unit circle
xy /= np.linalg.norm(xy, axis=0)
Eventually, the plot... here the key is the add_subplot() method, and in particular the keyword argument aspect='equal' that requires that the scale from user units to output units it's the same for both axes
plt.figure().add_subplot(111, aspect='equal').scatter(xy[0], xy[1])
pt.show()
to have

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M=[0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
ff=linspace(F[0],F[1],10)
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
aa=InterpolatedUnivariateSpline(x=F,y=M,k=2);
mm=aa(ff)
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
aa=InterpolatedUnivariateSpline(x=F,y=M,k=1)
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.

Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
F_new.append(F[-1])
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
plt.show()
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?

#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.

This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
"""
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
"""
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
))
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
picked_points,
remaining_points[i_furthest]
))
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
))
return picked_points

Interpolation over an irregular grid

So, I have three numpy arrays which store latitude, longitude, and some property value on a grid -- that is, I have LAT(y,x), LON(y,x), and, say temperature T(y,x), for some limits of x and y. The grid isn't necessarily regular -- in fact, it's tripolar.
I then want to interpolate these property (temperature) values onto a bunch of different lat/lon points (stored as lat1(t), lon1(t), for about 10,000 t...) which do not fall on the actual grid points. I've tried matplotlib.mlab.griddata, but that takes far too long (it's not really designed for what I'm doing, after all). I've also tried scipy.interpolate.interp2d, but I get a MemoryError (my grids are about 400x400).
Is there any sort of slick, preferably fast way of doing this? I can't help but think the answer is something obvious... Thanks!!

Try the combination of inverse-distance weighting and
scipy.spatial.KDTree
described in SO
inverse-distance-weighted-idw-interpolation-with-python.
Kd-trees
work nicely in 2d 3d ..., inverse-distance weighting is smooth and local,
and the k= number of nearest neighbours can be varied to tradeoff speed / accuracy.

There is a nice inverse distance example by Roger Veciana i Rovira along with some code using GDAL to write to geotiff if you're into that.
This is of coarse to a regular grid, but assuming you project the data first to a pixel grid with pyproj or something, all the while being careful what projection is used for your data.
A copy of his algorithm and example script:
from math import pow
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
def pointValue(x,y,power,smoothing,xv,yv,values):
nominator=0
denominator=0
for i in range(0,len(values)):
dist = sqrt((x-xv[i])*(x-xv[i])+(y-yv[i])*(y-yv[i])+smoothing*smoothing);
#If the point is really close to one of the data points, return the data point value to avoid singularities
if(dist<0.0000000001):
return values[i]
nominator=nominator+(values[i]/pow(dist,power))
denominator=denominator+(1/pow(dist,power))
#Return NODATA if the denominator is zero
if denominator > 0:
value = nominator/denominator
else:
value = -9999
return value
def invDist(xv,yv,values,xsize=100,ysize=100,power=2,smoothing=0):
valuesGrid = np.zeros((ysize,xsize))
for x in range(0,xsize):
for y in range(0,ysize):
valuesGrid[y][x] = pointValue(x,y,power,smoothing,xv,yv,values)
return valuesGrid
if __name__ == "__main__":
power=1
smoothing=20
#Creating some data, with each coodinate and the values stored in separated lists
xv = [10,60,40,70,10,50,20,70,30,60]
yv = [10,20,30,30,40,50,60,70,80,90]
values = [1,2,2,3,4,6,7,7,8,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
ZI = invDist(xv,yv,values,100,100,power,smoothing)
# Plotting the result
n = plt.normalize(0.0, 100.0)
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(xv, yv, 100, values)
plt.title('Inv dist interpolation - power: ' + str(power) + ' smoothing: ' + str(smoothing))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.colorbar()
plt.show()

There's a bunch of options here, which one is best will depend on your data...
However I don't know of an out-of-the-box solution for you
You say your input data is from tripolar data. There are three main cases for how this data could be structured.
Sampled from a 3d grid in tripolar space, projected back to 2d LAT, LON data.
Sampled from a 2d grid in tripolar space, projected into 2d LAT LON data.
Unstructured data in tripolar space projected into 2d LAT LON data
The easiest of these is 2. Instead of interpolating in LAT LON space, "just" transform your point back into the source space and interpolate there.
Another option that works for 1 and 2 is to search for the cells that maps from tripolar space to cover your sample point. (You can use a BSP or grid type structure to speed up this search) Pick one of the cells, and interpolate inside it.
Finally there's a heap of unstructured interpolation options .. but they tend to be slow.
A personal favourite of mine is to use a linear interpolation of the nearest N points, finding those N points can again be done with gridding or a BSP. Another good option is to Delauney triangulate the unstructured points and interpolate on the resulting triangular mesh.
Personally if my mesh was case 1, I'd use an unstructured strategy as I'd be worried about having to handle searching through cells with overlapping projections. Choosing the "right" cell would be difficult.

I suggest you taking a look at GRASS (an open source GIS package) interpolation features (http://grass.ibiblio.org/gdp/html_grass62/v.surf.bspline.html). It's not in python but you can reimplement it or interface with C code.

Am I right in thinking your data grids look something like this (red is the old data, blue is the new interpolated data)?
alt text http://www.geekops.co.uk/photos/0000-00-02%20%28Forum%20images%29/DataSeparation.png
This might be a slightly brute-force-ish approach, but what about rendering your existing data as a bitmap (opengl will do simple interpolation of colours for you with the right options configured and you could render the data as triangles which should be fairly fast). You could then sample pixels at the locations of the new points.
Alternatively, you could sort your first set of points spatially and then find the closest old points surrounding your new point and interpolate based on the distances to those points.

There is a FORTRAN library called BIVAR, which is very suitable for this problem. With a few modifications you can make it usable in python using f2py.
From the description:
BIVAR is a FORTRAN90 library which interpolates scattered bivariate data, by Hiroshi Akima.
BIVAR accepts a set of (X,Y) data points scattered in 2D, with associated Z data values, and is able to construct a smooth interpolation function Z(X,Y), which agrees with the given data, and can be evaluated at other points in the plane.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.