Finding peaks in noisy data with find_peaks_cwt - python

I am trying to find the peaks in some very noisy data such as this:
Without understanding the terminology very well, I'm defining the peaks as narrow (width<30) and more than 100000 higher than the nearby area.
I'm trying to use scipy's find_peaks_cwt but the documentation is pretty unclear to me. I tried find_peaks_cwt(my_data, np.arange(1,30)) but it returned a huge number of peaks. Then I tried adding the noise_perc=60 argument but this didn't really fix the problem. I've also tried playing around with the other parameters but I don't really understand what a 'ridge line' is.
What should I be doing differently? Is widths=np.arange(1,30) setting my width requirement like I think it is? How do I specify a height requirement?

A lot depends on what your data actually mean (or what you think they ought to mean). Here's an example with synthetic data:
from scipy.signal import find_peaks_cwt
from matplotlib.pyplot import plot, ylim
from numpy import *
N = 2000
x = arange(N)
pwid = 200.
zideal = sinc(x/pwid - 2)**2 # Vaguely similar to yours
z = zideal * random.randn(N)**2 # adding noise
plot(x, zideal, lw=4)
ylim(0, 1)
zf = find_peaks_cwt(z, pwid/4+zeros(N))
plot(x[zf], zideal[zf], '*', ms=20, color='green')
# Create averaging zones around peaks
xlow = maximum(array(zf) - pwid/2, 0)
xhigh = minimum(array(zf) + pwid/2, x.max())
zguess = 0*xlow # allocate space
for ii in range(len(zf)):
zguess[ii] = z[xlow[ii]:xhigh[ii]].mean()
plot(x[zf], zguess, 'o', ms=10, color='red')
pwidth scales the width of the peaks in the sinc() function. In the call to find_peaks_cwt(), using larger values for widths produces fewer peaks (lower density of peaks). The best result seems to be scaling values in widths to approximately half-width at half-max (HWHM) of the peaks.
find_peaks_cwt() does a pretty respectable job of finding the peaks from the ideal data. Summing around the values is a way to guess at the peak values. If you're summing spectral power, you should probably sum all the values in between rather than a fixed interval like I did for this quick-and-dirty demo.
I find the function especially impressive in the way it finds smaller peaks in the presence of much larger ones.

Related

Using numpy/scipy to identify slope changes in digital signals?

I am trying to come up with a generalised way in Python to identify pitch rotations occurring during a set of planned spacecraft manoeuvres. You could think of it as a particular case of a shift detection problem.
Let's consider the solar_elevation_angle variable in my set of measurements, identifying the elevation angle of the sun measured from the spacecraft's instrument. For those who might want to play with the data, I saved the solar_elevation_angle.txt file here.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy.signal import argrelmax
from scipy.ndimage.filters import gaussian_filter1d
solar_elevation_angle = np.loadtxt("solar_elevation_angle.txt", dtype=np.float32)
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
plt.show()
The scanline is my time dimension. The four points where the slope changes identify the spacecraft pitch rotations.
As you can see, the solar elevation angle evolution outside the spacecraft manoeuvres regions is pretty much linear as a function of time, and that should always be the case for this particular spacecraft (except for major failures).
Note that during each spacecraft manoeuvre, the slope change is obviously continuous, although discretised in my set of angle values. That means: for each manoeuvre, it does not really make sense to try to locate a single scanline where a manoeuvre has taken place. My goal is rather to identify, for each manoeuvre, a "representative" scanline in the range of scanlines defining the interval of time where the manoeuvre occurred (e.g. middle value, or left boundary).
Once I get a set of "representative" scanline indexes where all manoeuvres have taken place, I could then use those indexes for rough estimations of manoeuvres durations, or to automatically place labels on the plot.
My solution so far has been to:
Compute the 2nd derivative of the solar elevation angle using
np.gradient.
Compute absolute value and clipping of resulting
curve. The clipping is necessary because of what I assume to be
discretisation noise in the linear segments, which would then severely affect the identification of the "real" local maxima in point 4.
Apply smoothing to the resulting curve, to get rid of multiple peaks. I'm using scipy's 1d gaussian filter with a trial-and-error sigma value for that.
Identify local maxima.
Here's my code:
fig = plt.figure(figsize=(8,12))
gs = gridspec.GridSpec(5, 1)
ax0 = plt.subplot(gs[0])
ax0.set_title('Solar elevation angle')
ax0.plot(solar_elevation_angle)
solar_elevation_angle_1stdev = np.gradient(solar_elevation_angle)
ax1 = plt.subplot(gs[1])
ax1.set_title('1st derivative')
ax1.plot(solar_elevation_angle_1stdev)
solar_elevation_angle_2nddev = np.gradient(solar_elevation_angle_1stdev)
ax2 = plt.subplot(gs[2])
ax2.set_title('2nd derivative')
ax2.plot(solar_elevation_angle_2nddev)
solar_elevation_angle_2nddev_clipped = np.clip(np.abs(np.gradient(solar_elevation_angle_2nddev)), 0.0001, 2)
ax3 = plt.subplot(gs[3])
ax3.set_title('absolute value + clipping')
ax3.plot(solar_elevation_angle_2nddev_clipped)
smoothed_signal = gaussian_filter1d(solar_elevation_angle_2nddev_clipped, 20)
ax4 = plt.subplot(gs[4])
ax4.set_title('Smoothing applied')
ax4.plot(smoothed_signal)
plt.tight_layout()
plt.show()
I can then easily identify the local maxima by using scipy's argrelmax function:
max_idx = argrelmax(smoothed_signal)[0]
print(max_idx)
# [ 689 1019 2356 2685]
Which correctly identifies the scanline indexes I was looking for:
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_xlabel('Scanline')
ax.set_ylabel('Solar elevation angle [deg]')
ax.plot(solar_elevation_angle)
ax.scatter(max_idx, solar_elevation_angle[max_idx], marker='x', color='red')
plt.show()
My question is: Is there a better way to approach this problem?
I find that having to manually specify the clipping threshold values to get rid of the noise and the sigma in the gaussian filter weakens this approach considerably, preventing it to be applied to other similar cases.
First improvement would be to use a Savitzky-Golay filter to find the derivative in a less noisy way. For example, it can fit a parabola (in the sense of least squares) to each data slice of certain size, and then take the second derivative of that parabola. The result is much nicer than just taking 2nd order difference with gradient. Here it is with window size 101:
savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
Second, instead of looking for points of maximum with argrelmax it is better to look for places where the second derivative is large; for example, at least half its maximal size. This will of course return many indexes, but we can then look at the gaps between those indexes to identify where each peak begins and ends. The midpoint of the peak is then easily found.
Here is the complete code. The only parameter is window size, which is set to 101. The approach is robust; the size 21 or 201 gives essentially the same outcome (it must be odd).
from scipy.signal import savgol_filter
window = 101
der2 = savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
max_der2 = np.max(np.abs(der2))
large = np.where(np.abs(der2) > max_der2/2)[0]
gaps = np.diff(large) > window
begins = np.insert(large[1:][gaps], 0, large[0])
ends = np.append(large[:-1][gaps], large[-1])
changes = ((begins+ends)/2).astype(np.int)
plt.plot(solar_elevation_angle)
plt.plot(changes, solar_elevation_angle[changes], 'ro')
plt.show()
The fuss with insert and append is because the first index with large derivative should qualify as "peak begins" and the last such index should qualify as "peak ends", even though they don't have a suitable gap next to them (the gap is infinite).
Piecewise linear fit
This is an alternative (not necessarily better) approach, which does not use derivatives: fit a smoothing spline of degree 1 (i.e., a piecewise linear curve), and notice where its knots are.
First, normalize the data (which I call y instead of solar_elevation_angle) to have standard deviation 1.
y /= np.std(y)
The first step is to build a piecewise linear curve that deviates from the data by at most the given threshold, arbitrarily set to 0.1 (no units here because y was normalized). This is done by calling UnivariateSpline repeatedly, starting with a large smoothing parameter and gradually reducing it until the curve fits. (Unfortunately, one can't simply pass in the desired uniform error bound).
from scipy.interpolate import UnivariateSpline
threshold = 0.1
m = y.size
x = np.arange(m)
s = m
max_error = 1
while max_error > threshold:
spl = UnivariateSpline(x, y, k=1, s=s)
interp_y = spl(x)
max_error = np.max(np.abs(interp_y - y))
s /= 2
knots = spl.get_knots()
values = spl(knots)
So far we found the knots, and noted the values of the spline at those knots. But not all of these knots are really important. To test the importance of each knot, I remove it and interpolate without it. If the new interpolant is substantially different from the old (doubling the error), the knot is considered important and is added to the list of found slope changes.
ts = knots.size
idx = np.arange(ts)
changes = []
for j in range(1, ts-1):
spl = UnivariateSpline(knots[idx != j], values[idx != j], k=1, s=0)
if np.max(np.abs(spl(x) - interp_y)) > 2*threshold:
changes.append(knots[j])
plt.plot(y)
plt.plot(changes, y[np.array(changes, dtype=int)], 'ro')
plt.show()
Ideally, one would fit piecewise linear functions to given data, increasing the number of knots until adding one more does not bring "substantial" improvement. The above is a crude approximation of that with SciPy tools, but far from best possible. I don't know of any off-the-shelf piecewise linear model selection tool in Python.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Matplotlib: How to increase colormap/linewidth quality in streamplot?

I have the following code to generate a streamplot based on an interp1d-Interpolation of discrete data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from scipy.interpolate import interp1d
# CSV Import
a1array=pd.read_csv('a1.csv', sep=',',header=None).values
rv=a1array[:,0]
a1v=a1array[:,1]
da1vM=a1array[:,2]
a1 = interp1d(rv, a1v)
da1M = interp1d(rv, da1vM)
# Bx and By vector components
def bx(x ,y):
rad = np.sqrt(x**2+y**2)
if rad == 0:
return 0
else:
return x*y/rad**4*(-2*a1(rad)+rad*da1M(rad))/2.87445E-19*1E-12
def by(x ,y):
rad = np.sqrt(x**2+y**2)
if rad == 0:
return 4.02995937E-04/2.87445E-19*1E-12
else:
return -1/rad**4*(2*a1(rad)*y**2+rad*da1M(rad)*x**2)/2.87445E-19*1E-12
Bx = np.vectorize(bx, otypes=[np.float])
By = np.vectorize(by, otypes=[np.float])
# Grid
num_steps = 11
Y, X = np.mgrid[-25:25:(num_steps * 1j), 0:25:(num_steps * 1j)]
Vx = Bx(X, Y)
Vy = By(X, Y)
speed = np.sqrt(Bx(X, Y)**2+By(X, Y)**2)
lw = 2*speed / speed.max()+.5
# Star Radius
circle3 = plt.Circle((0, 0), 16.3473140, color='black', fill=False)
# Plot
fig0, ax0 = plt.subplots(num=None, figsize=(11,9), dpi=80, facecolor='w', edgecolor='k')
strm = ax0.streamplot(X, Y, Vx, Vy, color=speed, linewidth=lw,density=[1,2], cmap=plt.cm.jet)
ax0.streamplot(-X, Y, -Vx, Vy, color=speed, linewidth=lw,density=[1,2], cmap=plt.cm.jet)
ax0.add_artist(circle3)
cbar=fig0.colorbar(strm.lines,fraction=0.046, pad=0.04)
cbar.set_label('B[GT]', rotation=270, labelpad=8)
cbar.set_clim(0,1500)
cbar.draw_all()
ax0.set_ylim([-25,25])
ax0.set_xlim([-25,25])
ax0.set_xlabel('x [km]')
ax0.set_ylabel('z [km]')
ax0.set_aspect(1)
plt.title('polyEos(0.05,2), M/R=0.2, B_r(0,0)=1402GT', y=1.01)
plt.savefig('MR02Br1402.pdf',bbox_inches=0)
plt.show(fig0)
I uploaded the csv-file here if you want to try some stuff https://www.dropbox.com/s/4t7jixpglt0mkl5/a1.csv?dl=0.
Which generates the following plot:
I am actually pretty happy with the result except for one small detail, which I can not figure out: If one looks closely the linewidth and the color change in rather big steps, which is especially visible at the center:
Is there some way/option with which I can decrease the size of this steps to especially make the colormap smother?
I had another look at this and it wasnt as painful as I thought it might be.
Add:
subdiv = 15
points = np.arange(len(t[0]))
interp_points = np.linspace(0, len(t[0]), subdiv * len(t[0]))
tgx = np.interp(interp_points, points, tgx)
tgy = np.interp(interp_points, points, tgy)
tx = np.interp(interp_points, points, tx)
ty = np.interp(interp_points, points, ty)
after ty is initialised in the trajectories loop (line 164 in my version). Just substitute whatever number of subdivisions you want for subdiv = 15. All the segments in the streamplot will be subdivided into as many equally sized segments as you choose. The colors and linewidths for each will still be properly obtained from interpolating the data.
Its not as neat as changing the integration step but it does plot exactly the same trajectories.
If you don't mind changing the streamplot code (matplotlib/streamplot.py), you could simply decrease the size of the integration steps. Inside _integrate_rk12() the maximum step size is defined as:
maxds = min(1. / dmap.mask.nx, 1. / dmap.mask.ny, 0.1)
If you decrease that, lets say:
maxds = 0.1 * min(1. / dmap.mask.nx, 1. / dmap.mask.ny, 0.1)
I get this result (left = new, right = original):
Of course, this makes the code about 10x slower, and I haven't thoroughly tested it, but it seems to work (as a quick hack) for this example.
About the density (mentioned in the comments): I personally don't see the problem of that. It's not like we are trying to visualize the actual path line of (e.g.) a particle; the density is already some arbitrary (controllable) choice, and yes it is influenced by choices in the integration, but I don't thing that it changes the (not quite sure how to call this) required visualization we're after.
The results (density) do seem to converge a bit for decreasing step sizes, this shows the results for decreasing the integration step with a factor {1,5,10,20}:
You could increase the density parameter to get more smooth color transitions,
but then use the start_points parameter to reduce your overall clutter.
The start_points parameter allows you to explicity choose the location and
number of trajectories to draw. It overrides the default, which is to plot
as many as possible to fill up the entire plot.
But first you need one little fix to your existing code:
According to the streamplot documentation, the X and Y args should be 1d arrays, not 2d arrays as produced by mgrid.
It looks like passing in 2d arrays is supported, but it is undocumented
and it is currently not compatible with the start_points parameter.
Here is how I revised your X, Y, Vx, Vy and speed:
# Grid
num_steps = 11
Y = np.linspace(-25, 25, num_steps)
X = np.linspace(0, 25, num_steps)
Ygrid, Xgrid = np.mgrid[-25:25:(num_steps * 1j), 0:25:(num_steps * 1j)]
Vx = Bx(Xgrid, Ygrid)
Vy = By(Xgrid, Ygrid)
speed = np.hypot(Vx, Vy)
lw = 3*speed / speed.max()+.5
Now you can explicitly set your start_points parameter. The start points are actually
"seed" points. Any given stream trajectory will grow in both directions
from the seed point. So if you put a seed point right in the center of
the example plot, it will grow both up and down to produce a vertical
stream line.
Besides controlling the number of trajectories, using the
start_points parameter also controls the order they are
drawn. This is important when considering how trajectories terminate.
They will either hit the border of the plot, or they will terminate if
they hit a cell of the plot that already has a trajectory. That means
your first seeds will tend to grow longer and your later seeds will tend
to get limited by previous ones. Some of the later seeds may not grow
at all. The default seeding strategy is to plant a seed at every cell,
which is pretty obnoxious if you have a high density. It also orders
them by planting seeds first along the plot borders and spiraling inward.
This may not be ideal for your particular case. I found a very simple
strategy for your example was to just plant a few seeds between those
two points of zero velocity, y=0 and x from -10 to 10. Those trajectories
grow to their fullest and fill in most of the plot without clutter.
Here is how I create the seed points and set the density:
num_streams = 8
stptsy = np.zeros((num_streams,), np.float)
stptsx_left = np.linspace(0, -10.0, num_streams)
stptsx_right = np.linspace(0, 10.0, num_streams)
stpts_left = np.column_stack((stptsx_left, stptsy))
stpts_right = np.column_stack((stptsx_right, stptsy))
density = (3,6)
And here is how I modify the calls to streamplot:
strm = ax0.streamplot(X, Y, Vx, Vy, color=speed, linewidth=lw, density=density,
cmap=plt.cm.jet, start_points=stpts_right)
ax0.streamplot(-X, Y, -Vx, Vy, color=speed, linewidth=lw,density=density,
cmap=plt.cm.jet, start_points=stpts_left)
The result basically looks like the original, but with smoother color transitions and only 15 stream lines. (sorry no reputation to inline the image)
I think your best bet is to use a colormap other than jet. Perhaps cmap=plt.cmap.plasma.
Wierd looking graphs obscure understanding of the data.
For data which is ordered in some way, like by the speed vector magnitude in this case, uniform sequential colormaps will always look smoother. The brightness of sequential maps varies monotonically over the color range, removing large percieved color changes over small ranges of data. The uniform maps vary linearly over their whole range which makes the main features in the data much more visually apparent.
(source: matplotlib.org)
The jet colormap spans a very wide variety of brightnesses over its range with in inflexion in the middle. This is responsible for the particularly egregious red to blue transition around the center region of your graph.
(source: matplotlib.org)
The matplotlib user guide on choosing a color map has a few recomendations for about selecting an appropriate map for a given data set.
I dont think there is much else you can do to improve this by just changing parameters in your plot.
The streamplot divides the graph into cells with 30*density[x,y] in each direction, at most one streamline goes through each cell. The only setting which directly increases the number of segments is the density of the grid matplotlib uses. Increasing the Y density will decrease the segment length so that the middle region may transition more smoothly. The cost of this is an inevitable cluttering of the graph in regions where the streamlines are horizontal.
You could also try to normalise the speeds differently so the the change is artifically lowered in near the center. At the end of the day though it seems like it defeats the point of the graph. The graph should provide a useful view of the data for a human to understand. Using a colormap with strange inflexions or warping the data so that it looks nicer removes some understanding which could otherwise be obtained from looking at the graph.
A more detailed discussion about the issues with colormaps like jet can be found on this blog.

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M=[0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
ff=linspace(F[0],F[1],10)
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
aa=InterpolatedUnivariateSpline(x=F,y=M,k=2);
mm=aa(ff)
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
aa=InterpolatedUnivariateSpline(x=F,y=M,k=1)
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.
Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
F_new.append(F[-1])
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
plt.show()
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?
#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.
This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
"""
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
"""
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
))
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
picked_points,
remaining_points[i_furthest]
))
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
))
return picked_points

Categories

Resources