Using numpy/scipy to identify slope changes in digital signals? - python

I am trying to come up with a generalised way in Python to identify pitch rotations occurring during a set of planned spacecraft manoeuvres. You could think of it as a particular case of a shift detection problem.
Let's consider the solar_elevation_angle variable in my set of measurements, identifying the elevation angle of the sun measured from the spacecraft's instrument. For those who might want to play with the data, I saved the solar_elevation_angle.txt file here.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy.signal import argrelmax
from scipy.ndimage.filters import gaussian_filter1d
solar_elevation_angle = np.loadtxt("solar_elevation_angle.txt", dtype=np.float32)
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_ylabel('Solar elevation angle [deg]')
The scanline is my time dimension. The four points where the slope changes identify the spacecraft pitch rotations.
As you can see, the solar elevation angle evolution outside the spacecraft manoeuvres regions is pretty much linear as a function of time, and that should always be the case for this particular spacecraft (except for major failures).
Note that during each spacecraft manoeuvre, the slope change is obviously continuous, although discretised in my set of angle values. That means: for each manoeuvre, it does not really make sense to try to locate a single scanline where a manoeuvre has taken place. My goal is rather to identify, for each manoeuvre, a "representative" scanline in the range of scanlines defining the interval of time where the manoeuvre occurred (e.g. middle value, or left boundary).
Once I get a set of "representative" scanline indexes where all manoeuvres have taken place, I could then use those indexes for rough estimations of manoeuvres durations, or to automatically place labels on the plot.
My solution so far has been to:
Compute the 2nd derivative of the solar elevation angle using
Compute absolute value and clipping of resulting
curve. The clipping is necessary because of what I assume to be
discretisation noise in the linear segments, which would then severely affect the identification of the "real" local maxima in point 4.
Apply smoothing to the resulting curve, to get rid of multiple peaks. I'm using scipy's 1d gaussian filter with a trial-and-error sigma value for that.
Identify local maxima.
Here's my code:
fig = plt.figure(figsize=(8,12))
gs = gridspec.GridSpec(5, 1)
ax0 = plt.subplot(gs[0])
ax0.set_title('Solar elevation angle')
solar_elevation_angle_1stdev = np.gradient(solar_elevation_angle)
ax1 = plt.subplot(gs[1])
ax1.set_title('1st derivative')
solar_elevation_angle_2nddev = np.gradient(solar_elevation_angle_1stdev)
ax2 = plt.subplot(gs[2])
ax2.set_title('2nd derivative')
solar_elevation_angle_2nddev_clipped = np.clip(np.abs(np.gradient(solar_elevation_angle_2nddev)), 0.0001, 2)
ax3 = plt.subplot(gs[3])
ax3.set_title('absolute value + clipping')
smoothed_signal = gaussian_filter1d(solar_elevation_angle_2nddev_clipped, 20)
ax4 = plt.subplot(gs[4])
ax4.set_title('Smoothing applied')
I can then easily identify the local maxima by using scipy's argrelmax function:
max_idx = argrelmax(smoothed_signal)[0]
# [ 689 1019 2356 2685]
Which correctly identifies the scanline indexes I was looking for:
fig, ax = plt.subplots()
ax.set_title('Solar elevation angle')
ax.set_ylabel('Solar elevation angle [deg]')
ax.scatter(max_idx, solar_elevation_angle[max_idx], marker='x', color='red')
My question is: Is there a better way to approach this problem?
I find that having to manually specify the clipping threshold values to get rid of the noise and the sigma in the gaussian filter weakens this approach considerably, preventing it to be applied to other similar cases.

First improvement would be to use a Savitzky-Golay filter to find the derivative in a less noisy way. For example, it can fit a parabola (in the sense of least squares) to each data slice of certain size, and then take the second derivative of that parabola. The result is much nicer than just taking 2nd order difference with gradient. Here it is with window size 101:
savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
Second, instead of looking for points of maximum with argrelmax it is better to look for places where the second derivative is large; for example, at least half its maximal size. This will of course return many indexes, but we can then look at the gaps between those indexes to identify where each peak begins and ends. The midpoint of the peak is then easily found.
Here is the complete code. The only parameter is window size, which is set to 101. The approach is robust; the size 21 or 201 gives essentially the same outcome (it must be odd).
from scipy.signal import savgol_filter
window = 101
der2 = savgol_filter(solar_elevation_angle, window_length=window, polyorder=2, deriv=2)
max_der2 = np.max(np.abs(der2))
large = np.where(np.abs(der2) > max_der2/2)[0]
gaps = np.diff(large) > window
begins = np.insert(large[1:][gaps], 0, large[0])
ends = np.append(large[:-1][gaps], large[-1])
changes = ((begins+ends)/2).astype(
plt.plot(changes, solar_elevation_angle[changes], 'ro')
The fuss with insert and append is because the first index with large derivative should qualify as "peak begins" and the last such index should qualify as "peak ends", even though they don't have a suitable gap next to them (the gap is infinite).
Piecewise linear fit
This is an alternative (not necessarily better) approach, which does not use derivatives: fit a smoothing spline of degree 1 (i.e., a piecewise linear curve), and notice where its knots are.
First, normalize the data (which I call y instead of solar_elevation_angle) to have standard deviation 1.
y /= np.std(y)
The first step is to build a piecewise linear curve that deviates from the data by at most the given threshold, arbitrarily set to 0.1 (no units here because y was normalized). This is done by calling UnivariateSpline repeatedly, starting with a large smoothing parameter and gradually reducing it until the curve fits. (Unfortunately, one can't simply pass in the desired uniform error bound).
from scipy.interpolate import UnivariateSpline
threshold = 0.1
m = y.size
x = np.arange(m)
s = m
max_error = 1
while max_error > threshold:
spl = UnivariateSpline(x, y, k=1, s=s)
interp_y = spl(x)
max_error = np.max(np.abs(interp_y - y))
s /= 2
knots = spl.get_knots()
values = spl(knots)
So far we found the knots, and noted the values of the spline at those knots. But not all of these knots are really important. To test the importance of each knot, I remove it and interpolate without it. If the new interpolant is substantially different from the old (doubling the error), the knot is considered important and is added to the list of found slope changes.
ts = knots.size
idx = np.arange(ts)
changes = []
for j in range(1, ts-1):
spl = UnivariateSpline(knots[idx != j], values[idx != j], k=1, s=0)
if np.max(np.abs(spl(x) - interp_y)) > 2*threshold:
plt.plot(changes, y[np.array(changes, dtype=int)], 'ro')
Ideally, one would fit piecewise linear functions to given data, increasing the number of knots until adding one more does not bring "substantial" improvement. The above is a crude approximation of that with SciPy tools, but far from best possible. I don't know of any off-the-shelf piecewise linear model selection tool in Python.


Finding local extreme not working as expected in Scipy

I am writing code to find the local minima and maxima of the gradient of a signal similar to this stackoverflow question. I am using argrelextrema to do this. To test my my approach I implemented a quick test using the Cosine function.
# Get the data
x = np.arange(start=0,
y = np.cos(x)
# Calculate the gradient
gradient = []
for y1, y2 in zip(y, y[1:]):
# Append the gradient (Delta Y / Delta X) where Delta X = 1
# Turn the gradient from a list to an array
gradient = np.array(gradient)
# Calculate the maximum points of the gradient
maxima = argrelextrema(gradient, np.greater_equal, order=2)
minima = argrelextrema(gradient, np.less_equal, order=2)
# Plot the original signal
plt.plot(x, y, label="Original Signal")
plt.scatter(x[maxima], y[maxima], color="red", label="Maxima")
plt.scatter(x[minima], y[minima], color="blue", label="Minima")
plt.title("Original Graph")
plt.legend(loc='lower left')
# Plot the gradient
plt.plot(gradient, label="First Derivative")
plt.scatter(maxima, gradient[maxima], color="red", label="Maxima")
plt.scatter(minima, gradient[minima], color="blue", label="Minima")
plt.title("1st Derivative Graph")
plt.legend(loc='lower left')
This gives the following results:
All seems well. However when I change the code such that:
x = [my data of 720 points]
y = [my data of 720 points some are np.inf]
Link to the data (saved as a '.txt' file)
I get really strange results as seen below:
At first, I thought it could be due to the order=2 parameter of the argrelextrema function or noise in my signal. Changing the order to a larger window size reduces the number of points found, and so does including a digital filter. I, however, still can't understand why does it not find the maxima and minima at the peaks of the gradient rather than simply along the flat region?
Note: This question had the inverse of my problem.
Changing the parameters less_equal to less and greater_equal to greater also removes many of the points along the flat region. Although I am still confused why the maxima and minima of the gradient are not selected.
The problem has been fixed!
The first problem was that the data was noisy. That means that minima and maxima were being found all along the line. You can solve that in two ways. Firstly you can apply a filter to smooth the line:
from scipy import ndimage
# Filter the signal (to remove excess noise)
scanner_readings = ndimage.gaussian_filter(data, sigma=3)
Another option would be to increase the window size of the argrelextrema and argrelextrema functions.
# Calculate the maximum points of the gradient
maxima = argrelextrema(gradient, np.greater_equal, order=4)
minima = argrelextrema(gradient, np.less_equal, order=4)
Finally, it seems as if the functions argrelextrema and argrelextrema do not handle discontinuities well. To fix this, I replaced all inf values with the maximum value found which in this case was 10. You can see how I did this below:
# Remove discontinuity (10 is the max value in the data)
data[data == np.inf] = 10
When you do this, you get the following results:

Inverse FFT returns negative values when it should not

I have several points (x,y,z coordinates) in a 3D box with associated masses. I want to draw an histogram of the mass-density that is found in spheres of a given radius R.
I have written a code that, providing I did not make any errors which I think I may have, works in the following way:
My "real" data is something huge thus I wrote a little code to generate non overlapping points randomly with arbitrary mass in a box.
I compute a 3D histogram (weighted by mass) with a binning about 10 times smaller than the radius of my spheres.
I take the FFT of my histogram, compute the wave-modes (kx, ky and kz) and use them to multiply my histogram in Fourier space by the analytic expression of the 3D top-hat window (sphere filtering) function in Fourier space.
I inverse FFT my newly computed grid.
Thus drawing a 1D-histogram of the values on each bin would give me what I want.
My issue is the following: given what I do there should not be any negative values in my inverted FFT grid (step 4), but I get some, and with values much higher that the numerical error.
If I run my code on a small box (300x300x300 cm3 and the points of separated by at least 1 cm) I do not get the issue. I do get it for 600x600x600 cm3 though.
If I set all the masses to 0, thus working on an empty grid, I do get back my 0 without any noted issues.
I here give my code in a full block so that it is easily copied.
import numpy as np
import matplotlib.pyplot as plt
import random
from numba import njit
# 1. Generate a bunch of points with masses from 1 to 3 separated by a radius of 1 cm
radius = 1
rangeX = (0, 100)
rangeY = (0, 100)
rangeZ = (0, 100)
rangem = (1,3)
qty = 20000 # or however many points you want
# Generate a set of all points within 1 of the origin, to be used as offsets later
deltas = set()
for x in range(-radius, radius+1):
for y in range(-radius, radius+1):
for z in range(-radius, radius+1):
if x*x + y*y + z*z<= radius*radius:
X = []
Y = []
Z = []
M = []
excluded = set()
for i in range(qty):
x = random.randrange(*rangeX)
y = random.randrange(*rangeY)
z = random.randrange(*rangeZ)
m = random.uniform(*rangem)
if (x,y,z) in excluded: continue
excluded.update((x+dx, y+dy, z+dz) for (dx,dy,dz) in deltas)
print("There is ",len(X)," points in the box")
# Compute the 3D histogram
a = np.vstack((X, Y, Z)).T
b = 200
H, edges = np.histogramdd(a, weights=M, bins = b)
# Compute the FFT of the grid
Fh = np.fft.fftn(H, axes=(-3,-2, -1))
# Compute the different wave-modes
kx = 2*np.pi*np.fft.fftfreq(len(edges[0][:-1]))*len(edges[0][:-1])/(np.amax(X)-np.amin(X))
ky = 2*np.pi*np.fft.fftfreq(len(edges[1][:-1]))*len(edges[1][:-1])/(np.amax(Y)-np.amin(Y))
kz = 2*np.pi*np.fft.fftfreq(len(edges[2][:-1]))*len(edges[2][:-1])/(np.amax(Z)-np.amin(Z))
# I create a matrix containing the values of the filter in each point of the grid in Fourier space
R = 5
Kh = np.empty((len(kx),len(ky),len(kz)))
def func_njit(kx, ky, kz, Kh):
for i in range(len(kx)):
for j in range(len(ky)):
for k in range(len(kz)):
if np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2) != 0:
Kh[i][j][k] = (np.sin((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)-(np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R*np.cos((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R))*3/((np.sqrt(kx[i]**2+ky[j]**2+kz[k]**2))*R)**3
Kh[i][j][k] = 1
return Kh
Kh = func_njit(kx, ky, kz, Kh)
# I multiply each point of my grid by the associated value of the filter (multiplication in Fourier space = convolution in real space)
Gh = np.multiply(Fh, Kh)
# I take the inverse FFT of my filtered grid. I take the real part to get back floats but there should only be zeros for the imaginary part.
Density = np.real(np.fft.ifftn(Gh,axes=(-3,-2, -1)))
# Here it shows if there are negative values the magnitude of the error
D = Density.flatten()
N = np.mean(D)
# I then compute the histogram I want
hist, bins = np.histogram(D/N, bins='auto', density=True)
bin_centers = (bins[1:]+bins[:-1])*0.5
plt.plot(bin_centers, hist)
Do you know why I'm getting these negative values? Do you think there is a simpler way to proceed?
Sorry if this is a very long post, I tried to make it very clear and will edit it with your comments, thanks a lot!
A follow-up question on the issue can be found [here].1
The filter you create in the frequency domain is only an approximation to the filter you want to create. The problem is that we are dealing with the DFT here, not the continuous-domain FT (with its infinite frequencies). The Fourier transform of a ball is indeed the function you describe, however this function is infinitely large -- it is not band-limited!
By sampling this function only within a window, you are effectively multiplying it with an ideal low-pass filter (the rectangle of the domain). This low-pass filter, in the spatial domain, has negative values. Therefore, the filter you create also has negative values in the spatial domain.
This is a slice through the origin of the inverse transform of Kh (after I applied fftshift to move the origin to the middle of the image, for better display):
As you can tell here, there is some ringing that leads to negative values.
One way to overcome this ringing is to apply a windowing function in the frequency domain. Another option is to generate a ball in the spatial domain, and compute its Fourier transform. This second option would be the simplest to achieve. Do remember that the kernel in the spatial domain must also have the origin at the top-left pixel to obtain a correct FFT.
A windowing function is typically applied in the spatial domain to avoid issues with the image border when computing the FFT. Here, I propose to apply such a window in the frequency domain to avoid similar issues when computing the IFFT. Note, however, that this will always further reduce the bandwidth of the kernel (the windowing function would work as a low-pass filter after all), and therefore yield a smoother transition of foreground to background in the spatial domain (i.e. the spatial domain kernel will not have as sharp a transition as you might like). The best known windowing functions are Hamming and Hann windows, but there are many others worth trying out.
Unsolicited advice:
I simplified your code to compute Kh to the following:
kr = np.sqrt(kx[:,None,None]**2 + ky[None,:,None]**2 + kz[None,None,:]**2)
kr *= R
Kh = (np.sin(kr)-kr*np.cos(kr))*3/(kr)**3
Kh[0,0,0] = 1
I find this easier to read than the nested loops. It should also be significantly faster, and avoid the need for njit. Note that you were computing the same distance (what I call kr here) 5 times. Factoring out such computation is not only faster, but yields more readable code.
Just a guess:
Where do you get the idea that the imaginary part MUST be zero? Have you ever tried to take the absolute values (sqrt(re^2 + im^2)) and forget about the phase instead of just taking the real part? Just something that came to my mind.

Matplotlib: How to increase colormap/linewidth quality in streamplot?

I have the following code to generate a streamplot based on an interp1d-Interpolation of discrete data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from scipy.interpolate import interp1d
# CSV Import
a1array=pd.read_csv('a1.csv', sep=',',header=None).values
a1 = interp1d(rv, a1v)
da1M = interp1d(rv, da1vM)
# Bx and By vector components
def bx(x ,y):
rad = np.sqrt(x**2+y**2)
if rad == 0:
return 0
return x*y/rad**4*(-2*a1(rad)+rad*da1M(rad))/2.87445E-19*1E-12
def by(x ,y):
rad = np.sqrt(x**2+y**2)
if rad == 0:
return 4.02995937E-04/2.87445E-19*1E-12
return -1/rad**4*(2*a1(rad)*y**2+rad*da1M(rad)*x**2)/2.87445E-19*1E-12
Bx = np.vectorize(bx, otypes=[np.float])
By = np.vectorize(by, otypes=[np.float])
# Grid
num_steps = 11
Y, X = np.mgrid[-25:25:(num_steps * 1j), 0:25:(num_steps * 1j)]
Vx = Bx(X, Y)
Vy = By(X, Y)
speed = np.sqrt(Bx(X, Y)**2+By(X, Y)**2)
lw = 2*speed / speed.max()+.5
# Star Radius
circle3 = plt.Circle((0, 0), 16.3473140, color='black', fill=False)
# Plot
fig0, ax0 = plt.subplots(num=None, figsize=(11,9), dpi=80, facecolor='w', edgecolor='k')
strm = ax0.streamplot(X, Y, Vx, Vy, color=speed, linewidth=lw,density=[1,2],
ax0.streamplot(-X, Y, -Vx, Vy, color=speed, linewidth=lw,density=[1,2],
cbar=fig0.colorbar(strm.lines,fraction=0.046, pad=0.04)
cbar.set_label('B[GT]', rotation=270, labelpad=8)
ax0.set_xlabel('x [km]')
ax0.set_ylabel('z [km]')
plt.title('polyEos(0.05,2), M/R=0.2, B_r(0,0)=1402GT', y=1.01)
I uploaded the csv-file here if you want to try some stuff
Which generates the following plot:
I am actually pretty happy with the result except for one small detail, which I can not figure out: If one looks closely the linewidth and the color change in rather big steps, which is especially visible at the center:
Is there some way/option with which I can decrease the size of this steps to especially make the colormap smother?
I had another look at this and it wasnt as painful as I thought it might be.
subdiv = 15
points = np.arange(len(t[0]))
interp_points = np.linspace(0, len(t[0]), subdiv * len(t[0]))
tgx = np.interp(interp_points, points, tgx)
tgy = np.interp(interp_points, points, tgy)
tx = np.interp(interp_points, points, tx)
ty = np.interp(interp_points, points, ty)
after ty is initialised in the trajectories loop (line 164 in my version). Just substitute whatever number of subdivisions you want for subdiv = 15. All the segments in the streamplot will be subdivided into as many equally sized segments as you choose. The colors and linewidths for each will still be properly obtained from interpolating the data.
Its not as neat as changing the integration step but it does plot exactly the same trajectories.
If you don't mind changing the streamplot code (matplotlib/, you could simply decrease the size of the integration steps. Inside _integrate_rk12() the maximum step size is defined as:
maxds = min(1. / dmap.mask.nx, 1. / dmap.mask.ny, 0.1)
If you decrease that, lets say:
maxds = 0.1 * min(1. / dmap.mask.nx, 1. / dmap.mask.ny, 0.1)
I get this result (left = new, right = original):
Of course, this makes the code about 10x slower, and I haven't thoroughly tested it, but it seems to work (as a quick hack) for this example.
About the density (mentioned in the comments): I personally don't see the problem of that. It's not like we are trying to visualize the actual path line of (e.g.) a particle; the density is already some arbitrary (controllable) choice, and yes it is influenced by choices in the integration, but I don't thing that it changes the (not quite sure how to call this) required visualization we're after.
The results (density) do seem to converge a bit for decreasing step sizes, this shows the results for decreasing the integration step with a factor {1,5,10,20}:
You could increase the density parameter to get more smooth color transitions,
but then use the start_points parameter to reduce your overall clutter.
The start_points parameter allows you to explicity choose the location and
number of trajectories to draw. It overrides the default, which is to plot
as many as possible to fill up the entire plot.
But first you need one little fix to your existing code:
According to the streamplot documentation, the X and Y args should be 1d arrays, not 2d arrays as produced by mgrid.
It looks like passing in 2d arrays is supported, but it is undocumented
and it is currently not compatible with the start_points parameter.
Here is how I revised your X, Y, Vx, Vy and speed:
# Grid
num_steps = 11
Y = np.linspace(-25, 25, num_steps)
X = np.linspace(0, 25, num_steps)
Ygrid, Xgrid = np.mgrid[-25:25:(num_steps * 1j), 0:25:(num_steps * 1j)]
Vx = Bx(Xgrid, Ygrid)
Vy = By(Xgrid, Ygrid)
speed = np.hypot(Vx, Vy)
lw = 3*speed / speed.max()+.5
Now you can explicitly set your start_points parameter. The start points are actually
"seed" points. Any given stream trajectory will grow in both directions
from the seed point. So if you put a seed point right in the center of
the example plot, it will grow both up and down to produce a vertical
stream line.
Besides controlling the number of trajectories, using the
start_points parameter also controls the order they are
drawn. This is important when considering how trajectories terminate.
They will either hit the border of the plot, or they will terminate if
they hit a cell of the plot that already has a trajectory. That means
your first seeds will tend to grow longer and your later seeds will tend
to get limited by previous ones. Some of the later seeds may not grow
at all. The default seeding strategy is to plant a seed at every cell,
which is pretty obnoxious if you have a high density. It also orders
them by planting seeds first along the plot borders and spiraling inward.
This may not be ideal for your particular case. I found a very simple
strategy for your example was to just plant a few seeds between those
two points of zero velocity, y=0 and x from -10 to 10. Those trajectories
grow to their fullest and fill in most of the plot without clutter.
Here is how I create the seed points and set the density:
num_streams = 8
stptsy = np.zeros((num_streams,), np.float)
stptsx_left = np.linspace(0, -10.0, num_streams)
stptsx_right = np.linspace(0, 10.0, num_streams)
stpts_left = np.column_stack((stptsx_left, stptsy))
stpts_right = np.column_stack((stptsx_right, stptsy))
density = (3,6)
And here is how I modify the calls to streamplot:
strm = ax0.streamplot(X, Y, Vx, Vy, color=speed, linewidth=lw, density=density,, start_points=stpts_right)
ax0.streamplot(-X, Y, -Vx, Vy, color=speed, linewidth=lw,density=density,, start_points=stpts_left)
The result basically looks like the original, but with smoother color transitions and only 15 stream lines. (sorry no reputation to inline the image)
I think your best bet is to use a colormap other than jet. Perhaps cmap=plt.cmap.plasma.
Wierd looking graphs obscure understanding of the data.
For data which is ordered in some way, like by the speed vector magnitude in this case, uniform sequential colormaps will always look smoother. The brightness of sequential maps varies monotonically over the color range, removing large percieved color changes over small ranges of data. The uniform maps vary linearly over their whole range which makes the main features in the data much more visually apparent.
The jet colormap spans a very wide variety of brightnesses over its range with in inflexion in the middle. This is responsible for the particularly egregious red to blue transition around the center region of your graph.
The matplotlib user guide on choosing a color map has a few recomendations for about selecting an appropriate map for a given data set.
I dont think there is much else you can do to improve this by just changing parameters in your plot.
The streamplot divides the graph into cells with 30*density[x,y] in each direction, at most one streamline goes through each cell. The only setting which directly increases the number of segments is the density of the grid matplotlib uses. Increasing the Y density will decrease the segment length so that the middle region may transition more smoothly. The cost of this is an inevitable cluttering of the graph in regions where the streamlines are horizontal.
You could also try to normalise the speeds differently so the the change is artifically lowered in near the center. At the end of the day though it seems like it defeats the point of the graph. The graph should provide a useful view of the data for a human to understand. Using a colormap with strange inflexions or warping the data so that it looks nicer removes some understanding which could otherwise be obtained from looking at the graph.
A more detailed discussion about the issues with colormaps like jet can be found on this blog.

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.
Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?
#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.
This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
return picked_points

