Interpolate values and replace with NaNs within a long gap?

Interpolate values and replace with NaNs within a long gap? - python

I am trying to interpolate data with gaps. Sometimes the gap can be very large, and I do not want the interpolation to "succeed" within the gap; the result should be NaNs inside a large gap. For example, consider this example data set:
orig_x = [26219, 26225, 26232, 28521, 28538]
orig_y = [39, 40, 41, 72, 71]
which has clear gap between x-values 26232 and 28521. Now, I would like to have the orig_y interpolated to x-values like this:
import numpy as np
x_target = np.array(range(min(orig_x) // 10 * 10 + 10, max(orig_x) // 10 * 10 + 10, 10))
# array([26220, 26230, 26240, 26250, 26260, 26270, 26280, 26290,
# ...
# 28460, 28470, 28480, 28490, 28500, 28510, 28520, 28530])
and the output y_target should be np.nan everywhere else than at 26220, 26230 and 28520. Let's say that the condition for this would be that if there is a gap larger than 40 in the data, the interpolation should result to np.nan inside this data gap.
Goal shown as a picture
Instead of this
Get something like this
i.e. the "gap" in the data should result to np.nan instead of garbage data.
Question
What would be the best way (fastest interpolation) to achieve this kind of interpolation? The interpolation can be linear or more sophisticated (e.g. cubic spline). One possibility I have in mind would be to use the scipy.interpolate.interp1d as starting point like this
from scipy.interpolate import interp1d
f = interp1d(orig_x, orig_y, bounds_error=False)
y_target = f(x_target)
and then search for gaps in the data and replace the interpolated data with np.nan inside the gaps. Since I will be using this on fairly large dataset (~10M rows, few columns, handled in parts), performance is a key.

After some trial and error, a think I got a "fast enough" implementation using basic linear interpolation and numba for speedups. Forgive for writing everything in the same loop and same function, but it seems that is the numba way of making your code fast. (numba loves loops, and does not seem to accept nested functions)
Test data used
I added some mode data to x_target to test the algorithm performance.
orig_x = np.array([26219, 26225, 26232, 28521, 28538])
orig_y = np.array([39, 40, 41, 72, 71])
x_target = np.array(
np.arange(min(orig_x) // 10 * 10,
max(orig_x) // 10 * 10 + 10, 0.1))
Test code
from matplotlib import pyplot as plt
y_target = interpolate_with_max_gap(orig_x, orig_y, x_target, max_gap=40)
plt.scatter(x_target, y_target, label='interpolated', s=10)
plt.scatter(orig_x, orig_y, label='orig', s=10)
plt.legend()
plt.show()
Test results
The data is interpolated in regions with gap less than max_gap (40):
closeup:
Speed:
I first tried a pure python + numpy implementation, which took 49.6 ms with the same test data (using timeit). This implementation with numba takes 480µs (100x speedup!). When using target_x_is_sorted=True, the speed is 80.1µs!
The orig_x_sorted=True did not give speedup, probably since the orig_x is so short that sorting it does not make any difference in timing in this example.
Implementation
import numba
import numpy as np
#numba.njit()
def interpolate_with_max_gap(orig_x,
orig_y,
target_x,
max_gap=np.inf,
orig_x_is_sorted=False,
target_x_is_sorted=False):
"""
Interpolate data linearly with maximum gap. If there is
larger gap in data than `max_gap`, the gap will be filled
with np.nan.
The input values should not contain NaNs.
Parameters
---------
orig_x: np.array
The input x-data
orig_y: np.array
The input y-data
target_x: np.array
The output x-data; the data points in x-axis that
you want the interpolation results from.
max_gap: float
The maximum allowable gap in `orig_x` inside which
interpolation is still performed. Gaps larger than
this will be filled with np.nan in the output `target_y`.
orig_x_is_sorted: boolean, default: False
If True, the input data `orig_x` is assumed to be monotonically
increasing. Some performance gain if you supply sorted input data.
target_x_is_sorted: boolean, default: False
If True, the input data `target_x` is assumed to be
monotonically increasing. Some performance gain if you supply
sorted input data.
Returns
------
target_y: np.array
The interpolation results.
"""
if not orig_x_is_sorted:
# Sort to be monotonous wrt. input x-variable.
idx = orig_x.argsort()
orig_x = orig_x[idx]
orig_y = orig_y[idx]
if not target_x_is_sorted:
target_idx = target_x.argsort()
# Needed for sorting back the data.
target_idx_for_reverse = target_idx.argsort()
target_x = target_x[target_idx]
target_y = np.empty(target_x.size)
idx_orig = 0
orig_gone_through = False
for idx_target, x_new in enumerate(target_x):
# Grow idx_orig if needed.
while not orig_gone_through:
if idx_orig + 1 >= len(orig_x):
# Already consumed the orig_x; no more data
# so we would need to extrapolate
orig_gone_through = True
elif x_new > orig_x[idx_orig + 1]:
idx_orig += 1
else:
# x_new <= x2
break
if orig_gone_through:
target_y[idx_target] = np.nan
continue
x1 = orig_x[idx_orig]
y1 = orig_y[idx_orig]
x2 = orig_x[idx_orig + 1]
y2 = orig_y[idx_orig + 1]
if x_new < x1:
# would need to extrapolate to left
target_y[idx_target] = np.nan
continue
delta_x = x2 - x1
if delta_x > max_gap:
target_y[idx_target] = np.nan
continue
delta_y = y2 - y1
if delta_x == 0:
target_y[idx_target] = np.nan
continue
k = delta_y / delta_x
delta_x_new = x_new - x1
delta_y_new = k * delta_x_new
y_new = y1 + delta_y_new
target_y[idx_target] = y_new
if not target_x_is_sorted:
return target_y[target_idx_for_reverse]
return target_y

Related

Linear interpolation of two time series to merge data in numpy

I'm processing joystick data. There are two time series, one for the joystick's X motion and another for its Y motion. The two data sets have different time stamps. In the end, I hope to use matplotlib to plot a parametric 2D graph of the joystick data (where time is implicit, and the X and Y motion make up the points on the graph). However, before this end goal, I have to "merge" the two time series. For convenience, I'm going to assume that joystick motion is linear between timestamps.
I've coded something that can complete this (see below), but it seems needlessly complex. I'm hoping to find a more simplistic approach to accomplish this linear interpolation if possible.
import numpy as np
import matplotlib.pyplot as plt
# Example data
X = np.array([[0.98092103, 1013],
[1.01400101, 375],
[1.0561214, -8484],
[1.06982589, -17181],
[1.09453125, -16965]])
Y = np.array([[0.98092103, 534],
[1.00847602, 1690],
[1.0392499, -5327],
[1.06982589, -27921],
[1.10026598, -28915]])
data = []
# keep track of which index was used last
current_indices = [-1, -1]
# make ordered list of all timestamps between both data sets, no repeats
all_timestamps = sorted(set(X[:, 0]).union(set(Y[:, 0])))
for ts in all_timestamps:
# for each dimension (X & Y), index where timestamp exists, if timestamp exists. Else None
ts_indices = tuple(indx[0] if len(indx := np.where(Z[:, 0] == ts)[0]) > 0 else None
for Z in (X, Y))
# Out of range timesteps assumed to be zero
ts_vals = [0, 0]
for variable_indx, (current_z_indx, Z) in enumerate(zip(ts_indices, (X, Y))):
last_index_used = current_indices[variable_indx]
if current_z_indx is not None:
# If timestep is present, get value
current_indices[variable_indx] = current_z_indx
ts_vals[variable_indx] = Z[current_z_indx, 1]
elif last_index_used not in (-1, len(Z[:, 0]) - 1):
# If timestep within range of data, linearly interpolate
t0, z0 = Z[last_index_used, :]
t1, z1 = Z[last_index_used + 1, :]
ts_vals[variable_indx] = z0 + (z1 - z0) * (ts - t0) / (t1 - t0)
data.append([ts, *ts_vals])
merged_data = np.array(data)
plt.plot(merged_data[:,1],merged_data[:,2])
plt.show()

You are looking for np.interp to simplify the linear interpolation.
Following your example:
import numpy as np
import matplotlib.pyplot as plt
# Example data
X = np.array([[0.98092103, 1013],
[1.01400101, 375],
[1.0561214, -8484],
[1.06982589, -17181],
[1.09453125, -16965]])
Y = np.array([[0.98092103, 534],
[1.00847602, 1690],
[1.0392499, -5327],
[1.06982589, -27921],
[1.10026598, -28915]])
#extract all timestamps
all_timestamps = sorted(set(X[:, 0]).union(set(Y[:, 0])))
#linear interpolation
valuesX = np.interp(all_timestamps, X[:,0], X[:,1])
valuesY = np.interp(all_timestamps, Y[:,0], Y[:,1])
#plotting
plt.plot(valuesX, valuesY)
plt.show()

Fit a time series in python with a mean value as boundary condition

I have the following boundary conditions for a time series in python.
The notation I use here is t_x, where x describe the time in milliseconds (this is not my code, I just thought this notation is good to explain my issue).
t_0 = 0
t_440 = -1.6
t_830 = 0
mean_value = -0.6
I want to create a list that contains 83 values (so the spacing is 10ms for each value).
The list should descibe a "curve" that starts at zero, has the minimum value of -1.6 at 440ms (so 44 in the list), ends with 0 at 880ms (so 83 in the list) and the overall mean value of the list should be -0.6.
I absolutely could not come up with an idea how to "fit" the boundaries to create such a list.
I would really appreciate help.

It is a quick and dirty approach, but it works:
X = list(range(0, 830 +1, 10))
Y = [0.0 for x in X]
Y[44] = -1.6
b = 12.3486
for x in range(44):
Y[x] = -1.6*(b*x+x**2)/(b*44+44**2)
for x in range(83, 44, -1):
Y[x] = -1.6*(b*(83-x)+(83-x)**2)/(b*38+38**2)
print(f'{sum(Y)/len(Y)=:8.6f}, {Y[0]=}, {Y[44]=}, {Y[83]=}')
from matplotlib import pyplot as plt
plt.plot(X,Y)
plt.show()
With the code giving following output:
sum(Y)/len(Y)=-0.600000, Y[0]=-0.0, Y[44]=-1.6, Y[83]=-0.0
And showing following diagram:
The first step in coming up with the above approach was to create a linear sloping 'curve' from the minimum to the zeroes. I turned out that linear approach gives here too large mean Y value what means that the 'curve' must have a sharp peak at its minimum and need to be approached with a polynomial. To make things simple I decided to use quadratic polynomial and approach the minimum from left and right side separately as the curve isn't symmetric. The b-value was found by trial and error and its precision can be increased manually or by writing a small function finding it in an iterative way.
Update providing a generic solution as requested in a comment
The code below provides a
meanYboundaryXY(lbc = [(0,0), (440,-1.6), (830,0), -0.6], shape='saw')
function returning the X and Y lists of the time series data calculated from the passed parameter with the boundary values:
def meanYboundaryXY(lbc = [(0,0), (440,-1.6), (830,0), -0.6]):
lbcXY = lbc[0:3] ; meanY_boundary = lbc[3]
minX = min(x for x,y in lbcXY)
maxX = max(x for x,y in lbcXY)
minY = lbc[1][1]
step = 10
X = list(range(minX, maxX + 1, step))
lenX = len(X)
Y = [None for x in X]
sumY = 0
for x, y in lbcXY:
Y[x//step] = y
sumY += y
target_sumY = meanY_boundary*lenX
if shape == 'rect':
subY = (target_sumY-sumY)/(lenX-3)
for i, y in enumerate(Y):
if y is None:
Y[i] = subY
elif shape == 'saw':
peakNextY = 2*(target_sumY-sumY)/(lenX-1)
iYleft = lbc[1][0]//step-1
iYrght = iYleft+2
iYstart = lbc[0][0] // step
iYend = lbc[2][0] // step
for i in range(iYstart, iYleft+1, 1):
Y[i] = peakNextY * i / iYleft
for i in range(iYend, iYrght-1, -1):
Y[i] = peakNextY * (iYend-i)/(iYend-iYrght)
else:
raise ValueError( str(f'meanYboundaryXY() EXIT, {shape=} not in ["saw","rect"]') )
return (X, Y)
X, Y = meanYboundaryXY()
print(f'{sum(Y)/len(Y)=:8.6f}, {Y[0]=}, {Y[44]=}, {Y[83]=}')
from matplotlib import pyplot as plt
plt.plot(X,Y)
plt.show()
The code outputs:
sum(Y)/len(Y)=-0.600000, Y[0]=0, Y[44]=-1.6, Y[83]=0
and creates following two diagrams for shape='rect' and shape='saw':

As an old geek, i try to solve the question with a simple algorithm.
First calculate points as two symmetric lines from 0 to 44 and 44 to 89 (orange on the graph).
Calculate sum except middle point and its ratio with sum of points when mean is -0.6, except middle point.
Apply ratio to previous points except middle point. (blue curve on the graph)
Obtain curve which was called "saw" by Claudio.
For my own, i think quadratic interpolation of Claudio is a better curve, but needs trial and error loops.
import matplotlib
# define goals
nbPoints = 89
msPerPoint = 10
midPoint = nbPoints//2
valueMidPoint = -1.6
meanGoal = -0.6
def createSerieLinear():
# two lines 0 up to 44, 44 down to 88 (89 values centered on 44)
serie=[0 for i in range(0,nbPoints)]
interval =valueMidPoint/midPoint
for i in range(0,midPoint+1):
serie[i]=i*interval
serie[nbPoints-1-i]=i*interval
return serie
# keep an original to plot
orange = createSerieLinear()
# work on a base
base = createSerieLinear()
# total except midPoint
totalBase = (sum(base)-valueMidPoint)
#total goal except 44
totalGoal = meanGoal*nbPoints - valueMidPoint
# apply ratio to reduce
reduceRatio = totalGoal/totalBase
for i in range(0,midPoint):
base[i] *= reduceRatio
base[nbPoints-1-i] *= reduceRatio
# verify
meanBase = sum(base)/nbPoints
print("new mean:",meanBase)
# draw
from matplotlib import pyplot as plt
X =[i*msPerPoint for i in range(0,nbPoints)]
plt.plot(X,base)
plt.plot(X,orange)
plt.show()
new mean: -0.5999999999999998
Hope you enjoy simple things :)

Cublic Spline Interpolation of Phase Space Plot

I am creating a phase-space plot of first derivative of voltage against voltage:
I want to interpolate the plot so so it is smooth. So far, I have approached this by interpolating the voltage and first derivative of the voltage separately with time, then generating phase space plots.
Python Code (toy data example)
import numpy as np
import scipy.interpolate
interp_factor = 100
n = 12
time = np.linspace(0, 10, n)
voltage = np.array([0, 1, 2, 10, 30, 70, 140, 150, 140, 80, 40, 10])
voltage_diff = np.diff(voltage)
voltage = voltage[:-1]
time = time[:-1]
interp_function_voltage = scipy.interpolate.interp1d(time, voltage, kind="cubic")
interp_function_voltage_diff = scipy.interpolate.interp1d(time, voltage_diff, kind="cubic")
new_sample_num = interp_factor * (n - 1) + 1
new_time = np.linspace(np.min(time), np.max(time), new_sample_num)
interp_voltage = interp_function_voltage(new_time)
interp_voltage_diff = interp_function_voltage_diff(new_time)
I would like to ask:
a) is the method as implemented reasonable?
b) is there a better method by interpolating directly in the phase-space? e.g. interpolating with voltage as x and voltage_diff as y? I do not think this makes sense, because the voltage values are not uniformly spaced and there may be repeated voltage values. I also tried the scipy parametric interpolation methods (e.g. scipy.interpolate.splprep) but these threw input value error. I expect (it would be nice to have this clarified) because this is raw data, rather than well behaved parametric functions.
I guess more generally, I am wondering if it makes sense to somehow do the interpolation in the phase-space to make use of the direct relationship between voltage and voltage_diff for interpolating / smoothing.
Many thanks

It is reasonable, but your difference will be biased, maybe the best approximation for the difference could be (v[i+1] - v[i-1])/(2*dt)
Another approach is using Fourier transform smoothing
def smoother_phase_space(y, sps=1, T=1):
Y = np.fft.rfft(y)
yu = np.fft.irfft(Y, len(y)*sps).real * sps
dyu = np.fft.irfft(Y * (2j * np.pi * np.fft.rfftfreq(len(y))), len(y)*sps).real
k = np.arange(len(yu)+2) % len(yu)
return yu[k], dyu[k] * sps / T
v, dv = smoother_phase_space(voltage, sps=1)
plt.plot(v, dv, '-ob')
v, dv = smoother_phase_space(voltage, sps=4)
plt.plot(v, dv, '-r')
plt.plot(v[::4], dv[::4], 'or')
v, dv = smoother_phase_space(voltage, sps=32)
plt.plot(v, dv, '-g')
plt.plot(v[::32], dv[::32], 'og')
try: # the data computed in the original post
plt.plot(interp_voltage, interp_voltage_diff, '--')
except:
pass

Efficient sum of Gaussians in 3D with NumPy using large arrays

I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):

Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop

Panda dataframe column cut - add more bins more frequently around the mean

I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean.
I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.

You can make bins that increase in size linearly:
import numpy as np
def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
x_rel_lim = max(mean_x - min_x, mean_x - max_x)
num_bins_half = num_bins // 2
bins_right = np.arange(0, num_bins_half + 1)
if num_bins % 2 == 1:
bins_right = bins_right + 0.5
bins_right = np.cumsum(bins_right)
bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
return bins
And then you can use it like:
import numpy as np
import matplotlib.pyplot as plt
bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)

I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :).
# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)
num_bins = 100
# I used a square function to distribute the bins more around 0 and
# less at the outskirts of the range.
shape_func = lambda x: x**2
bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]
# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]
# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Interpolate values and replace with NaNs within a long gap? - python

Related

Linear interpolation of two time series to merge data in numpy

Fit a time series in python with a mean value as boundary condition

Cublic Spline Interpolation of Phase Space Plot

Efficient sum of Gaussians in 3D with NumPy using large arrays

Panda dataframe column cut - add more bins more frequently around the mean

Categories

Resources