I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.
def get_pdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
x = np.linspace(0, max(data), max(data))
y = ag(x)
return x, y
This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)
a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)
But it gives a value close to 0.5 when I test.
But when I intergrate from -100 to max(data), it provides a value close to 1.
trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)
The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.
So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?
The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.
Quote from the doc:
Bandwidth selection strongly influences the estimate obtained from
the KDE (much more so than the actual shape of the kernel). Bandwidth selection
can be done by a "rule of thumb", by cross-validation, by "plug-in
methods" or by other means; see [3], [4] for reviews. gaussian_kde
uses a rule of thumb, the default is Scott's Rule.
Try with different bandwidth values, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
b = 1
sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)
# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))
# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();
gives:
integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957
and:
We see that the kde using the Scott method is wrong.
Related
I'm wondering if there's a way I can find the range of local maxima of a histogram. For instance, suppose I have the following histogram (just ignore the orange curve):
The histogram is actually obtained from a dictionary. I'm hoping to find the range of local maxima of this histogram (on the horizontal axis), which are, say, 1.3-1.6, and 2.1-2.4 in this case. I have no idea which tools would be helpful or which techniques I may want to use. I know there's a tool to find local maxima of a 1-D array:
from scipy.signal import argrelextrema
x = np.random.random(12)
argrelextrema(x, np.greater)
but I don't think it would work here since I'm looking for a range, and there're some 'wiggles' on the histogram. Can anyone give me some suggestions/examples about how I can obtain the range I'm looking for? Thanks a lot for the help
PS: I trying to not just search for the ranges of x whose y values are above a certain limit:)
I don't know if I correctly understand what you want to do, but you can treat the histogram as a Probability Density Function (PDF) of a bimodal distribution, then find the modes and the Highest Density Intervals (HDIs) around the two modes.
So, I create some sample data
import numpy as np
import pandas as pd
import scipy.stats as sps
from scipy.signal import find_peaks, argrelextrema
import matplotlib.pyplot as plt
d1 = sps.norm(loc=1.3, scale=.2)
d2 = sps.norm(loc=2.2, scale=.3)
r1 = d1.rvs(size=5000, random_state=1)
r2 = d2.rvs(size=5000, random_state=1)
r = np.concatenate((r1, r2))
h = plt.hist(r, bins=100, density=True);
We have only h, the result of the hist function that will contains the density (100) and the ranges of the bins (101).
print(h[0].size)
100
print(h[1].size)
101
So we first need to choose the mean of each bin
density = h[0]
values = h[1][:-1] + np.diff(h[1])[0] / 2
plt.hist(r, bins=100, density=True, alpha=.25)
plt.plot(values, density);
Now we can normalize the PDF (to sum to 1) and smooth the data with moving average that we'll use only to get the peaks (maxima) and minima
norm_density = density / density.sum()
norm_density_ma = pd.Series(norm_density).rolling(7, center=True).mean().values
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density);
Now we can obtain indexes of maxima
peaks = find_peaks(norm_density_ma)[0]
peaks
array([24, 57])
and minima
minima = argrelextrema(norm_density_ma, np.less)[0]
minima
array([40])
and check they're correct
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
for peak in peaks:
plt.axvline(values[peak], color='r')
plt.axvline(values[minima], color='k', ls='--');
Finally, we have to find out the HDIs around the two modes (peaks) from the normalized h histogram data. We can use a simple function to get the HDI of grid (see HDI_of_grid for details and Doing Bayesian Data Analysis by John K. Kruschke)
def HDI_of_grid(probMassVec, credMass=0.95):
sortedProbMass = np.sort(probMassVec, axis=None)[::-1]
HDIheightIdx = np.min(np.where(np.cumsum(sortedProbMass) >= credMass))
HDIheight = sortedProbMass[HDIheightIdx]
HDImass = np.sum(probMassVec[probMassVec >= HDIheight])
idx = np.where(probMassVec >= HDIheight)[0]
return {'indexes':idx, 'mass':HDImass, 'height':HDIheight}
Let's say we want the HDIs to contain a mass of 0.3
# HDI around the 1st mode
hdi1 = HDI_of_grid(norm_density, credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
for the 2nd mode, we'll get HDI from minima to avoid the 1st mode
# HDI around the 2nd mode
hdi2 = HDI_of_grid(norm_density[minima[0]:], credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
plt.fill_between(
values[hdi2['indexes']+minima],
0, norm_density[hdi2['indexes']+minima],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
And we have the values of the two HDIs
# 1st mode
values[peaks[0]]
1.320249129265321
# 0.3 HDI
values[hdi1['indexes']].take([0, -1])
array([1.12857599, 1.45715851])
# 2nd mode
values[peaks[1]]
2.2238510564735363
# 0.3 HDI
values[hdi2['indexes']+minima].take([0, -1])
array([1.95003229, 2.47028795])
I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks
I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.
I am completely new to pymc3, so please excuse the fact that this is likely trivial. I have a very simple model where I am predicting a binary response function. The model is almost a verbatim copy of this example: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/gelman_bioassay.py
I get back the model parameters (alpha, beta, and theta), but I can't seem to figure out how to overplot the predictions of the model vs. the input data. I tried doing this (using the parlance of the bioassay model):
from scipy.stats import binom
mean_alpha = mean(trace['alpha'])
mean_beta = mean(trace['beta'])
pred_death = binom.rvs(n, 1./(1.+np.exp(-(mean_alpha + mean_beta * dose))))
and then plotting dose vs. pred_death, but this is manifestly not correct as I get different draws of the binomial distribution every time.
Related to this is another question, how do I evaluate the goodness of fit? I couldn't seem to find anything to that effect in the "getting started" pymc3 tutorial.
Thanks very much for any advice!
Hi a simple way to do it is as follows:
from pymc3 import *
from numpy import ones, array
# Samples for each dose level
n = 5 * ones(4, dtype=int)
# Log-dose
dose = array([-.86, -.3, -.05, .73])
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
with Model() as model:
# Logit-linear model parameters
alpha = Normal('alpha', 0, 0.01)
beta = Normal('beta', 0, 0.01)
# Calculate probabilities of death
theta = Deterministic('theta', invlogit(alpha + beta * dose))
# Data likelihood
deaths = Binomial('deaths', n=n, p=theta, observed=[0, 1, 3, 5])
start = find_MAP()
step = NUTS(scaling=start)
trace = sample(2000, step, start=start, progressbar=True)
import matplotlib.pyplot as plt
death_fit = np.percentile(trace.theta,50,axis=0)
plt.plot(dose, death_fit,'g', marker='.', lw='1.25', ls='-', ms=5, mew=1)
plt.show()
If you want to plot dose vs pred_death, where pred_death is computed from the mean estimated values of alpha and beta, then do:
pred_death = 1./(1. + np.exp(-(mean_alpha + mean_beta * dose)))
plt.plot(dose, pred_death)
instead if you want to plot dose vs pred_death, where pred_death is computed taking into account the uncertainty in posterior for alpha and beta. Then probably the easiest way is to use the function sample_ppc:
May be something like
ppc = pm.sample_ppc(trace, samples=100, model=pmmodel)
for i in range(100):
plt.plot(dose, ppc['deaths'][i], 'bo', alpha=0.5)
Using Posterior Predictive Checks (ppc) is a way to check how well your model behaves by comparing the predictions of the model to your actual data. Here you have an example of sample_ppc
Other options could be to plot the mean value plus some interval of interest.
I have a TOF spectrum and I would like to implement an algorithm using python (numpy) that finds all the maxima of the spectrum and returns the corresponding x values.
I have looked up online and I found the algorithm reported below.
The assumption here is that near the maximum the difference between the value before and the value at the maximum is bigger than a number DELTA. The problem is that my spectrum is composed of points equally distributed, even near the maximum, so that DELTA is never exceeded and the function peakdet returns an empty array.
Do you have any idea how to overcome this problem? I would really appreciate comments to understand better the code since I am quite new in python.
Thanks!
import sys
from numpy import NaN, Inf, arange, isscalar, asarray, array
def peakdet(v, delta, x = None):
maxtab = []
mintab = []
if x is None:
x = arange(len(v))
v = asarray(v)
if len(v) != len(x):
sys.exit('Input vectors v and x must have same length')
if not isscalar(delta):
sys.exit('Input argument delta must be a scalar')
if delta <= 0:
sys.exit('Input argument delta must be positive')
mn, mx = Inf, -Inf
mnpos, mxpos = NaN, NaN
lookformax = True
for i in arange(len(v)):
this = v[i]
if this > mx:
mx = this
mxpos = x[i]
if this < mn:
mn = this
mnpos = x[i]
if lookformax:
if this < mx-delta:
maxtab.append((mxpos, mx))
mn = this
mnpos = x[i]
lookformax = False
else:
if this > mn+delta:
mintab.append((mnpos, mn))
mx = this
mxpos = x[i]
lookformax = True
return array(maxtab), array(mintab)
Below is shown part of the spectrum. I actually have more peaks than those shown here.
This, I think could work as a starting point. I'm not a signal-processing expert, but I tried this on a generated signal Y that looks quite like yours and one with much more noise:
from scipy.signal import convolve
import numpy as np
from matplotlib import pyplot as plt
#Obtaining derivative
kernel = [1, 0, -1]
dY = convolve(Y, kernel, 'valid')
#Checking for sign-flipping
S = np.sign(dY)
ddS = convolve(S, kernel, 'valid')
#These candidates are basically all negative slope positions
#Add one since using 'valid' shrinks the arrays
candidates = np.where(dY < 0)[0] + (len(kernel) - 1)
#Here they are filtered on actually being the final such position in a run of
#negative slopes
peaks = sorted(set(candidates).intersection(np.where(ddS == 2)[0] + 1))
plt.plot(Y)
#If you need a simple filter on peak size you could use:
alpha = -0.0025
peaks = np.array(peaks)[Y[peaks] < alpha]
plt.scatter(peaks, Y[peaks], marker='x', color='g', s=40)
The sample outcomes:
For the noisy one, I filtered peaks with alpha:
If the alpha needs more sophistication you could try dynamically setting alpha from the peaks discovered using e.g. assumptions about them being a mixed gaussian (my favourite being the Otsu threshold, exists in cv and skimage) or some sort of clustering (k-means could work).
And for reference, this I used to generate the signal:
Y = np.zeros(1000)
def peaker(Y, alpha=0.01, df=2, loc=-0.005, size=-.0015, threshold=0.001, decay=0.5):
peaking = False
for i, v in enumerate(Y):
if not peaking:
peaking = np.random.random() < alpha
if peaking:
Y[i] = loc + size * np.random.chisquare(df=2)
continue
elif Y[i - 1] < threshold:
peaking = False
if i > 0:
Y[i] = Y[i - 1] * decay
peaker(Y)
EDIT: Support for degrading base-line
I simulated a slanting base-line by doing this:
Z = np.log2(np.arange(Y.size) + 100) * 0.001
Y = Y + Z[::-1] - Z[-1]
Then to detect with a fixed alpha (note that I changed sign on alpha):
from scipy.signal import medfilt
alpha = 0.0025
Ybase = medfilt(Y, 51) # 51 should be large in comparison to your peak X-axis lengths and an odd number.
peaks = np.array(peaks)[Ybase[peaks] - Y[peaks] > alpha]
Resulting in the following outcome (the base-line is plotted as dashed black line):
EDIT 2: Simplification and a comment
I simplified the code to use one kernel for both convolves as #skymandr commented. This also removed the magic number in adjusting the shrinkage so that any size of the kernel should do.
For the choice of "valid" as option to convolve. It would probably have worked just as well with "same", but I choose "valid" so I didn't have to think about the edge-conditions and if the algorithm could detect spurios peaks there.
As of SciPy version 1.1, you can also use find_peaks:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
np.random.seed(0)
Y = np.zeros(1000)
# insert #deinonychusaur's peaker function here
peaker(Y)
# make data noisy
Y = Y + 10e-4 * np.random.randn(len(Y))
# find_peaks gets the maxima, so we multiply our signal by -1
Y *= -1
# get the actual peaks
peaks, _ = find_peaks(Y, height=0.002)
# multiply back for plotting purposes
Y *= -1
plt.plot(Y)
plt.plot(peaks, Y[peaks], "x")
plt.show()
This will plot (note that we use height=0.002 which will only find peaks higher than 0.002):
In addition to height, we can also set the minimal distance between two peaks. If you use distance=100, the plot then looks as follows:
You can use
peaks, _ = find_peaks(Y, height=0.002, distance=100)
in the code above.
After looking at the answers and suggestions I decided to offer a solution I often use because it is straightforward and easier to tweak.
It uses a sliding window and counts how many times a local peak appears as a maximum as window shifts along the x-axis. As #DrV suggested, no universal definition of "local maximum" exists, meaning that some tuning parameters are unavoidable. This function uses "window size" and "frequency" to fine tune the outcome. Window size is measured in number of data points of independent variable (x) and frequency counts how sensitive should peak detection be (also expressed as a number of data points; lower values of frequency produce more peaks and vice versa). The main function is here:
def peak_finder(x0, y0, window_size, peak_threshold):
# extend x, y using window size
y = numpy.concatenate([y0, numpy.repeat(y0[-1], window_size)])
x = numpy.concatenate([x0, numpy.arange(x0[-1], x0[-1]+window_size)])
local_max = numpy.zeros(len(x0))
for ii in range(len(x0)):
local_max[ii] = x[y[ii:(ii + window_size)].argmax() + ii]
u, c = numpy.unique(local_max, return_counts=True)
i_return = numpy.where(c>=peak_threshold)[0]
return(list(zip(u[i_return], c[i_return])))
along with a snippet used to produce the figure shown below:
import numpy
from matplotlib import pyplot
def plot_case(axx, w_f):
p = peak_finder(numpy.arange(0, len(Y)), -Y, w_f[0], w_f[1])
r = .9*min(Y)/10
axx.plot(Y)
for ip in p:
axx.text(ip[0], r + Y[int(ip[0])], int(ip[0]),
rotation=90, horizontalalignment='center')
yL = pyplot.gca().get_ylim()
axx.set_ylim([1.15*min(Y), yL[1]])
axx.set_xlim([-50, 1100])
axx.set_title(f'window: {w_f[0]}, count: {w_f[1]}', loc='left', fontsize=10)
return(None)
window_frequency = {1:(15, 15), 2:(100, 100), 3:(100, 5)}
f, ax = pyplot.subplots(1, 3, sharey='row', figsize=(9, 4),
gridspec_kw = {'hspace':0, 'wspace':0, 'left':.08,
'right':.99, 'top':.93, 'bottom':.06})
for k, v in window_frequency.items():
plot_case(ax[k-1], v)
pyplot.show()
Three cases show parameter values that render (from left to right panel):
(1) too many, (2) too few, and (3) an intermediate amount of peaks.
To generate Y data, I used the function #deinonychusaur gave above, and added some noise to it from #Cleb's answer.
I hope some might find this useful, but it's efficiency primarily depends on actual peak shapes and distances.
Finding a minimum or a maximum is not that simple, because there is no universal definition for "local maximum".
Your code seems to look for a miximum and then accept it as a maximum if the signal falls after the maximum below the maximum minus some delta value. After that it starts to look for a minimum with similar criteria. It does not really matter if your data falls or rises slowly, as the maximum is recorded when it is reached and appended to the list of maxima once the level fallse below the hysteresis threshold.
This is a possible way to find local minima and maxima, but it has several shortcomings. One of them is that the method is not symmetric, i.e. if the same data is run backwards, the results are not necessarily the same.
Unfortunately, I cannot help much more, because the correct method really depends on the data you are looking at, its shape and its noisiness. If you have some samples, then we might be able to come up with some suggestions.
I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter.
Here is my code:
# Imports
import scipy
import pylab
# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")
# Plot estimates using splines with a range of degrees
for k in range(1, 4):
mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
xi = range(0, 15100, 20)
yi = mySpline(xi)
pylab.plot(xi, yi, label="Predicted k=%d" % k)
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()
Here is the result:
I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over.
Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?
Short answer: you need to choose the value for s more carefully.
The documentation for UnivariateSpline states that:
Positive smoothing factor used to choose the number of knots. Number of
knots will be increased until the smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s
From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.
#Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact).
# Imports
import numpy
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()
While I'm not aware of any library which will do it for you off-hand, I'd try a bit more DIY approach: I'd start from making a spline with knots in between the raw data points, in both x and y. In your particular example, having a single knot in between the 4th and 5th points should do the trick, since it'd remove the huge derivative at around x=8000.
I had trouble getting BigChef's answer running, here is a variation that works on python 3.6:
# Imports
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
import numpy
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(key=lambda li: li[0])
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.show()