Cublic Spline Interpolation of Phase Space Plot - python

I am creating a phase-space plot of first derivative of voltage against voltage:
I want to interpolate the plot so so it is smooth. So far, I have approached this by interpolating the voltage and first derivative of the voltage separately with time, then generating phase space plots.
Python Code (toy data example)
import numpy as np
import scipy.interpolate
interp_factor = 100
n = 12
time = np.linspace(0, 10, n)
voltage = np.array([0, 1, 2, 10, 30, 70, 140, 150, 140, 80, 40, 10])
voltage_diff = np.diff(voltage)
voltage = voltage[:-1]
time = time[:-1]
interp_function_voltage = scipy.interpolate.interp1d(time, voltage, kind="cubic")
interp_function_voltage_diff = scipy.interpolate.interp1d(time, voltage_diff, kind="cubic")
new_sample_num = interp_factor * (n - 1) + 1
new_time = np.linspace(np.min(time), np.max(time), new_sample_num)
interp_voltage = interp_function_voltage(new_time)
interp_voltage_diff = interp_function_voltage_diff(new_time)
I would like to ask:
a) is the method as implemented reasonable?
b) is there a better method by interpolating directly in the phase-space? e.g. interpolating with voltage as x and voltage_diff as y? I do not think this makes sense, because the voltage values are not uniformly spaced and there may be repeated voltage values. I also tried the scipy parametric interpolation methods (e.g. scipy.interpolate.splprep) but these threw input value error. I expect (it would be nice to have this clarified) because this is raw data, rather than well behaved parametric functions.
I guess more generally, I am wondering if it makes sense to somehow do the interpolation in the phase-space to make use of the direct relationship between voltage and voltage_diff for interpolating / smoothing.
Many thanks

It is reasonable, but your difference will be biased, maybe the best approximation for the difference could be (v[i+1] - v[i-1])/(2*dt)
Another approach is using Fourier transform smoothing
def smoother_phase_space(y, sps=1, T=1):
Y = np.fft.rfft(y)
yu = np.fft.irfft(Y, len(y)*sps).real * sps
dyu = np.fft.irfft(Y * (2j * np.pi * np.fft.rfftfreq(len(y))), len(y)*sps).real
k = np.arange(len(yu)+2) % len(yu)
return yu[k], dyu[k] * sps / T
v, dv = smoother_phase_space(voltage, sps=1)
plt.plot(v, dv, '-ob')
v, dv = smoother_phase_space(voltage, sps=4)
plt.plot(v, dv, '-r')
plt.plot(v[::4], dv[::4], 'or')
v, dv = smoother_phase_space(voltage, sps=32)
plt.plot(v, dv, '-g')
plt.plot(v[::32], dv[::32], 'og')
try: # the data computed in the original post
plt.plot(interp_voltage, interp_voltage_diff, '--')
except:
pass

Related

Curve_Fit not accurate

i tried to fit very fluctual data over time as good as possible. So first i smoothed the data which is working fine. The smoothed data I get from this should further be represented from a fit to get out more of the peaks. As you see in the code I want to use an log-tanh function to fit the data. I am well aware that this problem accured in some of the threads already, but I tried them already and the data is also not very small or very big which i know can also cause problems.
The polynomial fit i tried works also pretty good as you see, but it does not eliminate all the wavy values. They cause problems for the following derivative which is very bad.
import tkinter as tk
from tkinter import filedialog
import numpy as np
import scipy.signal
from scipy.optimize import curve_fit
from numpy import diff
import matplotlib.pyplot as plt
from lmfit.models import StepModel, LinearModel
def loghypfunc(x, A, B, C, D, E):
return A*np.log(1+x)+B*np.tanh(C*x)+D*x+E
def expfunc(t, c0, c1, c2, c3):
return c0+c1*t-c2*np.exp(-c3*t)
def expdecay(x, a, b, c):
return a * np.exp(-b * x) + c
path="C:/Users/Sammy/Documents/Masterarbeit WT/CSM und Kriechdaten/Kriechen/Creep_10mN_00008_LC_20210406_2121_DYN.txt"
dataFile = np.loadtxt(path, delimiter='\t', skiprows=2, usecols=(0, 1, 2, 3, 29, 30), dtype=float)
num_rows, num_cols = dataFile.shape
# time column
time = dataFile[:, [0]].transpose()
time = time.flatten()
refTime = time[0] # get first time in column (reference)
# genullte Testzeit
timeNull = time - refTime
print("time", time)
flatTimeNull = timeNull.flatten() # jetzt ein 1D array (one row)
##################################################################################
# indent displacement column
indentDis = dataFile[:, [4]].transpose()
indentDis = indentDis.flatten()
indentDis = indentDis - indentDis[0]
# the indendt data has to be smoothed so there is not such a big fluctuation
indentSmooth = scipy.signal.savgol_filter(indentDis, 2001, 3)
# null the indent Smooth data
indentSmooth_Null = indentSmooth - indentSmooth[0]
hind_Smooth_flat = indentSmooth_Null.flatten() # jetzt ein 1D array
print('indent smooth', indentSmooth)
######################################################################
p0 = [100, 0.1, 100, 0.1]
c, cov = curve_fit(expfunc, time, indentSmooth, p0)
y_indent = expfunc(indentSmooth, *c)
p0 = [70, 0.5, 50, 0.1, 100]
popt, pcov = curve_fit(loghypfunc, time, indentSmooth, p0, maxfev = 10000)
y_indentTan = loghypfunc(indentSmooth, *popt)
modelh_t = np.poly1d(np.polyfit(time, indentSmooth, 8))
plt.plot(time, indentSmooth, 'r', label="Data smoothed")
plt.scatter(time, modelh_t(time), s=0.1, label="Polyfit")
plt.plot(time, y_indentTan, label="Curve fit Tangens function")
plt.plot(time, y_indent, label="Curve fit exp function")
plt.legend(loc="lower right")
plt.xlabel("time")
plt.ylabel("indent")
plt.show()
These are the two arrays i get the data from
time [ 6.299596 6.349592 6.399589 ... 608.0109 608.060897 608.110894]
indent smooth [120.81411822 121.07093706 121.32748184 ... 476.78825661 476.89357473 476.99915287]
Here the plots
Plots
The question for me now is how to fix it. Is it because of the false optimizied parameters to fit? But python should do that automatic sufficiently good i guess?
My second guess was that the data is timed to compact along this axes, as the array is about 12000 values long. Could this be a reason?
I would be very grateful for any kind of advices regarding the fits.
Regards
Hndrx

Interpolate values and replace with NaNs within a long gap?

I am trying to interpolate data with gaps. Sometimes the gap can be very large, and I do not want the interpolation to "succeed" within the gap; the result should be NaNs inside a large gap. For example, consider this example data set:
orig_x = [26219, 26225, 26232, 28521, 28538]
orig_y = [39, 40, 41, 72, 71]
which has clear gap between x-values 26232 and 28521. Now, I would like to have the orig_y interpolated to x-values like this:
import numpy as np
x_target = np.array(range(min(orig_x) // 10 * 10 + 10, max(orig_x) // 10 * 10 + 10, 10))
# array([26220, 26230, 26240, 26250, 26260, 26270, 26280, 26290,
# ...
# 28460, 28470, 28480, 28490, 28500, 28510, 28520, 28530])
and the output y_target should be np.nan everywhere else than at 26220, 26230 and 28520. Let's say that the condition for this would be that if there is a gap larger than 40 in the data, the interpolation should result to np.nan inside this data gap.
Goal shown as a picture
Instead of this
Get something like this
i.e. the "gap" in the data should result to np.nan instead of garbage data.
Question
What would be the best way (fastest interpolation) to achieve this kind of interpolation? The interpolation can be linear or more sophisticated (e.g. cubic spline). One possibility I have in mind would be to use the scipy.interpolate.interp1d as starting point like this
from scipy.interpolate import interp1d
f = interp1d(orig_x, orig_y, bounds_error=False)
y_target = f(x_target)
and then search for gaps in the data and replace the interpolated data with np.nan inside the gaps. Since I will be using this on fairly large dataset (~10M rows, few columns, handled in parts), performance is a key.
After some trial and error, a think I got a "fast enough" implementation using basic linear interpolation and numba for speedups. Forgive for writing everything in the same loop and same function, but it seems that is the numba way of making your code fast. (numba loves loops, and does not seem to accept nested functions)
Test data used
I added some mode data to x_target to test the algorithm performance.
orig_x = np.array([26219, 26225, 26232, 28521, 28538])
orig_y = np.array([39, 40, 41, 72, 71])
x_target = np.array(
np.arange(min(orig_x) // 10 * 10,
max(orig_x) // 10 * 10 + 10, 0.1))
Test code
from matplotlib import pyplot as plt
y_target = interpolate_with_max_gap(orig_x, orig_y, x_target, max_gap=40)
plt.scatter(x_target, y_target, label='interpolated', s=10)
plt.scatter(orig_x, orig_y, label='orig', s=10)
plt.legend()
plt.show()
Test results
The data is interpolated in regions with gap less than max_gap (40):
closeup:
Speed:
I first tried a pure python + numpy implementation, which took 49.6 ms with the same test data (using timeit). This implementation with numba takes 480µs (100x speedup!). When using target_x_is_sorted=True, the speed is 80.1µs!
The orig_x_sorted=True did not give speedup, probably since the orig_x is so short that sorting it does not make any difference in timing in this example.
Implementation
import numba
import numpy as np
#numba.njit()
def interpolate_with_max_gap(orig_x,
orig_y,
target_x,
max_gap=np.inf,
orig_x_is_sorted=False,
target_x_is_sorted=False):
"""
Interpolate data linearly with maximum gap. If there is
larger gap in data than `max_gap`, the gap will be filled
with np.nan.
The input values should not contain NaNs.
Parameters
---------
orig_x: np.array
The input x-data
orig_y: np.array
The input y-data
target_x: np.array
The output x-data; the data points in x-axis that
you want the interpolation results from.
max_gap: float
The maximum allowable gap in `orig_x` inside which
interpolation is still performed. Gaps larger than
this will be filled with np.nan in the output `target_y`.
orig_x_is_sorted: boolean, default: False
If True, the input data `orig_x` is assumed to be monotonically
increasing. Some performance gain if you supply sorted input data.
target_x_is_sorted: boolean, default: False
If True, the input data `target_x` is assumed to be
monotonically increasing. Some performance gain if you supply
sorted input data.
Returns
------
target_y: np.array
The interpolation results.
"""
if not orig_x_is_sorted:
# Sort to be monotonous wrt. input x-variable.
idx = orig_x.argsort()
orig_x = orig_x[idx]
orig_y = orig_y[idx]
if not target_x_is_sorted:
target_idx = target_x.argsort()
# Needed for sorting back the data.
target_idx_for_reverse = target_idx.argsort()
target_x = target_x[target_idx]
target_y = np.empty(target_x.size)
idx_orig = 0
orig_gone_through = False
for idx_target, x_new in enumerate(target_x):
# Grow idx_orig if needed.
while not orig_gone_through:
if idx_orig + 1 >= len(orig_x):
# Already consumed the orig_x; no more data
# so we would need to extrapolate
orig_gone_through = True
elif x_new > orig_x[idx_orig + 1]:
idx_orig += 1
else:
# x_new <= x2
break
if orig_gone_through:
target_y[idx_target] = np.nan
continue
x1 = orig_x[idx_orig]
y1 = orig_y[idx_orig]
x2 = orig_x[idx_orig + 1]
y2 = orig_y[idx_orig + 1]
if x_new < x1:
# would need to extrapolate to left
target_y[idx_target] = np.nan
continue
delta_x = x2 - x1
if delta_x > max_gap:
target_y[idx_target] = np.nan
continue
delta_y = y2 - y1
if delta_x == 0:
target_y[idx_target] = np.nan
continue
k = delta_y / delta_x
delta_x_new = x_new - x1
delta_y_new = k * delta_x_new
y_new = y1 + delta_y_new
target_y[idx_target] = y_new
if not target_x_is_sorted:
return target_y[target_idx_for_reverse]
return target_y

Strange result from Fast Fourier transform signal reconstruction

I have some data which is shown in the below figure and am interested in finding some of its Fourier series coefficients.
r = np.array([119.80601628, 119.84629291, 119.85290735, 119.45778804,
115.64497439, 105.58519852, 100.72765819, 100.04327702,
100.08590518, 100.35824977, 101.58424993, 105.47976376,
112.27556007, 117.07679226, 118.99998888, 119.60458086,
119.78624424, 119.83022022, 119.36116943, 115.72323767,
106.58946834, 101.19479124, 100.11537349, 100.13313755,
100.41846106, 101.42255377, 104.33650237, 109.73625492,
115.14763728, 118.24665037, 119.35359999, 119.68061835])
z = np.array([-411.42980545, -384.98596279, -358.13032372, -330.89578468,
-303.39129113, -275.76248957, -248.24478443, -221.07069838,
-194.33260984, -168.05271807, -142.19357982, -116.62090103,
-91.15354178, -65.56745626, -39.65284757, -13.29632162,
13.54374939, 40.84929432, 68.50496394, 96.33720787,
124.08525182, 151.36802193, 177.98791952, 204.0805317 ,
229.85399128, 255.44727674, 281.02166554, 306.75399703,
332.74638285, 359.05528646, 385.74336711, 412.8189858 ])
plt.plot(z, r, label='data')
plt.legend()
Then I calculate the average sampling period, since it is not constant as seen in the Z variable:
l = []
for i in range(32-1):
l.append(z[i]-z[i+1])
Ts = np.mean(l)
Then I calculate the fft:
from scipy.fftpack import fft
rf = scipy.fftpack.fft(r)
For reconstruction of the signal then:
fs = 1/Ts
amp = np.abs(rf)/r.shape[0]
n = r.shape[0]
s = 0
for i in range(n//2):
phi = np.angle(rf[i], deg=False)
a = amp[i]
k = i*fs/n
s += a*np.cos(2*np.pi*k *(z) +phi)
plt.plot(z, s, label='fft result')
plt.plot(z, r, label='data')
plt.legend()
The result is strange however both in terms of amplitude and frequency.
The complex spectrum is a symmetric spectrum with the range of (-fMax/2, ..., +fMax/2).
You only used the right hand positive part of the spectrum. This means, your reconstructed signal contains only half of the spectrums frequencies.
Because the spectrum is symmetric, all you have to do is to double the calculated absolute values. However, there is an important exception. The DC value amplitude[0] must not be doubled.

Panda dataframe column cut - add more bins more frequently around the mean

I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean.
I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.
You can make bins that increase in size linearly:
import numpy as np
def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
x_rel_lim = max(mean_x - min_x, mean_x - max_x)
num_bins_half = num_bins // 2
bins_right = np.arange(0, num_bins_half + 1)
if num_bins % 2 == 1:
bins_right = bins_right + 0.5
bins_right = np.cumsum(bins_right)
bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
return bins
And then you can use it like:
import numpy as np
import matplotlib.pyplot as plt
bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)
I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :).
# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)
num_bins = 100
# I used a square function to distribute the bins more around 0 and
# less at the outskirts of the range.
shape_func = lambda x: x**2
bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]
# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]
# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)

Fit the gamma distribution only to a subset of the samples

I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:
You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
Minimize:
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:
Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Outline
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
plt.show()
Result
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

Categories

Resources