I'm trying to port some MatLab code over to Scipy, and I've tried two different functions from scipy.interpolate, interp1d and UnivariateSpline. The interp1d results match the interp1d MatLab function, but the UnivariateSpline numbers come out different - and in some cases very different.
f = interp1d(row1,row2,kind='cubic',bounds_error=False,fill_value=numpy.max(row2))
return f(interp)
f = UnivariateSpline(row1,row2,k=3,s=0)
return f(interp)
Could anyone offer any insight? My x vals aren't equally spaced, although I'm not sure why that would matter.
I just ran into the same issue.
Short answer
Use InterpolatedUnivariateSpline instead:
f = InterpolatedUnivariateSpline(row1, row2)
return f(interp)
Long answer
UnivariateSpline is a 'one-dimensional smoothing spline fit to a given set of data points' whereas InterpolatedUnivariateSpline is a 'one-dimensional interpolating spline for a given set of data points'. The former smoothes the data whereas the latter is a more conventional interpolation method and reproduces the results expected from interp1d. The figure below illustrates the difference.
The code to reproduce the figure is shown below.
import scipy.interpolate as ip
#Define independent variable
sparse = linspace(0, 2 * pi, num = 20)
dense = linspace(0, 2 * pi, num = 200)
#Define function and calculate dependent variable
f = lambda x: sin(x) + 2
fsparse = f(sparse)
fdense = f(dense)
ax = subplot(2, 1, 1)
#Plot the sparse samples and the true function
plot(sparse, fsparse, label = 'Sparse samples', linestyle = 'None', marker = 'o')
plot(dense, fdense, label = 'True function')
#Plot the different interpolation results
interpolate = ip.InterpolatedUnivariateSpline(sparse, fsparse)
plot(dense, interpolate(dense), label = 'InterpolatedUnivariateSpline', linewidth = 2)
smoothing = ip.UnivariateSpline(sparse, fsparse)
plot(dense, smoothing(dense), label = 'UnivariateSpline', color = 'k', linewidth = 2)
ip1d = ip.interp1d(sparse, fsparse, kind = 'cubic')
plot(dense, ip1d(dense), label = 'interp1d')
ylim(.9, 3.3)
legend(loc = 'upper right', frameon = False)
#Plot the fractional error
subplot(2, 1, 2, sharex = ax)
plot(dense, smoothing(dense) / fdense - 1, label = 'UnivariateSpline')
plot(dense, interpolate(dense) / fdense - 1, label = 'InterpolatedUnivariateSpline')
plot(dense, ip1d(dense) / fdense - 1, label = 'interp1d')
ylabel('Fractional error')
legend(loc = 'upper left', frameon = False)
The reason why the results are different (but both likely correct) is that the interpolation routines used by UnivariateSpline and interp1d are different.
interp1d constructs a smooth B-spline using the x-points you gave to it as knots
UnivariateSpline is based on FITPACK, which also constructs a smooth B-spline. However, FITPACK tries to choose new knots for the spline, to fit the data better (probably to minimize chi^2 plus some penalty for curvature, or something similar). You can find out what knot points it used via g.get_knots().
So the reason why you get different results is that the interpolation algorithm is different. If you want B-splines with knots at data points, use interp1d or splmake. If you want what FITPACK does, use UnivariateSpline. In the limit of dense data, both methods give same results, but when data is sparse, you may get different results.
(How do I know all this: I read the code :-)
Works for me,
from scipy import allclose, linspace
from scipy.interpolate import interp1d, UnivariateSpline
from numpy.random import normal
from pylab import plot, show
n = 2**5
x = linspace(0,3,n)
y = (2*x**2 + 3*x + 1) + normal(0.0,2.0,n)
i = interp1d(x,y,kind=3)
u = UnivariateSpline(x,y,k=3,s=0)
m = 2**4
t = linspace(1,2,m)
print allclose(i(t),u(t)) # evaluates to True
This gives me,
UnivariateSpline: A more recent
wrapper of the FITPACK routines.
this might explain the slightly different values? (I also experienced that UnivariateSpline is much faster than interp1d.)
I have two arrays with x- and y- data.
This data shows lognormal behavior. I need a graph of the fit, as well as the mu and the sigma to do some statistics.
I did a fit, in order to calculate the mu, the sigma, and further on some statistical values of it. (See code below)
I obtain the scaling factor, with which I have to multiply the distribution with an integral over the datapoints.
The code below, does work. My question now is, if (I am sure) there is a better way to do this? It feels like a workaround, that will work sometimes. I want a better way to do this, because I have to plot hundreds of these.
My code (sorry, that it is this long, wanted to include everything except import of crude data):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# produce plot True/False
ploton = True
x0=np.array([3.58381e+01, 3.27125e+01, 2.98680e+01, 2.72888e+01, 2.49364e+01,
2.27933e+01, 2.08366e+01, 1.90563e+01, 1.74380e+01, 1.59550e+01,
1.45904e+01, 1.33460e+01, 1.22096e+01, 1.11733e+01, 1.02262e+01,
9.35893e+00, 8.56556e+00, 7.86688e+00, 7.20265e+00, 6.59782e+00,
6.01571e+00, 5.53207e+00, 5.03979e+00, 4.64415e+00, 4.19920e+00,
3.83595e+00, 3.50393e+00, 3.28070e+00, 3.00930e+00, 2.75634e+00,
2.52050e+00, 2.31349e+00, 2.12280e+00, 1.92642e+00, 1.77820e+00,
1.61692e+00, 1.49094e+00, 1.36233e+00, 1.22935e+00, 1.14177e+00,
1.03078e+00, 9.39603e-01, 8.78425e-01, 1.01490e+00, 1.07461e-01,
4.81523e-02, 4.81523e-02, 1.00000e-02, 1.00000e-02])
y0=np.array([3.94604811e+04, 2.78223936e+04, 1.95979179e+04, 2.14447807e+04,
1.68677487e+04, 1.79429516e+04, 1.73589776e+04, 2.16101026e+04,
3.79705638e+04, 6.83622301e+04, 1.73687772e+05, 5.74854475e+05,
1.69497465e+06, 3.79135941e+06, 7.76757753e+06, 1.33429094e+07,
1.96096415e+07, 2.50403065e+07, 2.72818618e+07, 2.53120387e+07,
1.93102362e+07, 1.22219224e+07, 4.96725699e+06, 1.61174658e+06,
3.19352386e+05, 1.80305856e+05, 1.41728002e+05, 1.66191809e+05,
1.33223816e+05, 1.31384905e+05, 2.49100945e+05, 2.28300583e+05,
3.01063903e+05, 1.84271914e+05, 1.26412781e+05, 8.57488083e+04,
1.35536571e+05, 4.50076293e+04, 1.98080100e+05, 2.27630303e+05,
1.89484527e+05, 0.00000000e+00, 1.36543525e+05, 2.20677520e+05,
3.60100586e+05, 1.62676486e+05, 1.90105093e+04, 9.27461467e+05,
Dnm = x0
dndlndp = y0
#lognormal PDF:
def f(x, mu, sigma) :
return 1/(np.sqrt(2*np.pi)*sigma*x)*np.exp(-((np.log(x)-mu)**2)/(2*sigma**2))
#normalizing y-values to obtain lognormal distributed data:
y0_normalized = y0/np.trapz(x0.ravel(), y0.ravel())
#calculating mu/sigma of this distribution:
params, extras = curve_fit(f, x0.ravel(), y0_normalized.ravel())
median = np.exp(params[0])
mu = params[0]
sigma = params[1]
#output of mu / sigma / calculated median:
print "mu=%g, sigma=%g" % (params[0], params[1])
print "median=%g" % median
#new variable z for smooth fit-curve:
z = np.linspace(0.1, 100, 10000)
Dnm = np.ravel(Dnm)
dndlndp = np.ravel(dndlndp)
Dnm_rev = list(reversed(Dnm))
dndlndp_rev = list(reversed(dndlndp))
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
if ploton:
plt.plot(z, f(z, mu, sigma)*scalingfactor, label="fit", color = "red")
plt.scatter(x0, y0, label="data")
EDIT1: Maybe I should add that I have no idea, why the scaling factor calculated with
scalingfactor = np.trapz(dndlndp_rev, Dnm_rev, dx = np.log(Dnm_rev))
is right. It was simply try and error. I really want to know, why this does the trick, since the "area" of all bins combined is:
N = np.trapz(dndlndp_rev, np.log(Dnm_rev), dx = np.log(Dnm_rev))
because the width of the bins is log(Dnm).
EDIT2: Thank you for all answers. I copied the arrays into the code, which is now runable. I want to simplify the question, since i think, due to my poor english, i was not able to say what i really want:
I have lognormal set of data. The code above allows me to calculate the mu and the sigma. To do so, i need to normalize the data, and the area under the function is from now on = 1.
In order to plot a lognormal function with the calculated mu and sigma, i need to multiply the function with an (unknown) factor, because the area under the real function is something like 1e8, but sure not one. I did a workaround by calculating this "scalingfactor" via the trapz integral of the diskrete crude data.
There has to be a better way to plot the fitted function, when mu and sigma are already known.
I am using Isomap from scikit-learn manifold learning. I reduce to two dimension, and observe that with every run of the algorthm on the same data set without any changes the resulting vectors change. I assume there are some random numbers used in the algorithm, but there is no way to set a seed. Random_state is not a variable to pass in Isomap. Am I missing something?
The random you've seen is about the sign of your result. The sign is not (in my opinion) 100% random. Signs within each component are consistent so that the relative relation is consistent in your result. Signs between components are random. In other words, which component got multiplied by -1 or 1 are random. This behavior comes from the KernelPCA function used by Isomap when the arpack kernel is used.
To give you a solution first, you can use eigen_solver='dense' when using Isomap. That may slow down your algorithm but should remove this randomness. I know this explanation above might be confusing. Let me give more details and show this by plot.
First, what is a visualized consequence of the "sign randomness"? Using the following code (modified from this official example) with eigen_solver = 'arpack', you can see two fit_transform using the same Isomap class may (or may not) give you different results. However, as you can see in the plot, the relative location maintains. It's just the whole plot getting flipped. If you use eigen_solver='dense' and run the code multiple times, you won't see this randomness:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
def plot_embedding(X, ax, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
eigen_solver = 'arpack'
#eigen_solver = 'dense'
iso = manifold.Isomap(n_neighbors, n_components=2, eigen_solver=eigen_solver)
X_iso1 = iso.fit_transform(X)
X_iso2 = iso.fit_transform(X)
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(121)
plot_embedding(X_iso1, ax1)
ax2 = fig.add_subplot(122)
plot_embedding(X_iso2, ax2)
Secondly, is there a way to set a seed to "stabilize" the random state? No, there is currently no way to set a seed for KernelPCA from Isomap. With KernelPCA, however, there is a kwarg random_state which is "A pseudo random number generator used for the initialization of the residuals when eigen_solver == ‘arpack’". Play with the following code (modified from this official test code) and you can see this randomness is gone (blue dots cover red dots) even with eigen_solver = 'arpack':
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import KernelPCA
X_fit = np.random.rand(100, 4)
X = np.dot(X_fit, X_fit.T)
eigen_solver = 'arpack'
#eigen_solver = 'dense'
#random_state = None
random_state = 0
kpca = KernelPCA(n_components=2, kernel='precomputed',
eigen_solver=eigen_solver, random_state=random_state)
X_kpca1 = kpca.fit_transform(X)
X_kpca2 = kpca.fit_transform(X)
plt.plot(X_kpca1[:,0], X_kpca1[:,1], 'ro')
plt.plot(X_kpca2[:,0], X_kpca2[:,1], 'bo')
I'm using scipy.stats.binned_statistic_2d and then plotting the output. When I use stat="count", I have no problems. When I use stat="mean" (or np.max() for that matter), I end up with negative values in each bin (as identified by the color bar), which should not be the case because I have constructed zvals such that it is always greater than zero. Does anyone know why this is the case? I've included the minimal code I use to generate the plots. I also get an invalid value RunTime warning, which makes me think that something strange is going on in binned_statistic_2d. The following code should just copy and run.
From the documentation:
'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
which leads me to believe that there might be something going on in binned_statistic_2d and how it handles z-values.
import numbers as _numbers
import numpy as _np
import scipy as _scipy
import matplotlib as _mpl
import types as _types
import scipy.stats
from matplotlib import pyplot as _plt
norm_args = (0, 3, int(1e5)) # loc, scale, size
x = _np.random.random(norm_args[-1]) # xvals can be log scaled.
y = _np.random.normal(*norm_args) #_np.random.random(norm_args[-1]) #
z = _np.abs(_np.random.normal(1e2, *norm_args[1:]))
nbins = 1e2
kwargs = {}
stat = _np.max
fig, ax = _plt.subplots()
binned_stats = _scipy.stats.binned_statistic_2d(x, y, z, stat,
H, xedges, yedges, binnumber = binned_stats
Hplot = H
if isinstance(stat, str):
cbar_title = stat.title()
elif isinstance(stat, _types.FunctionType):
cbar_title = stat.__name__.title()
XX, YY = _np.meshgrid(xedges, yedges)
Image = ax.pcolormesh(XX, YY, Hplot.T) #norm=norm,
grid_kargs = {'orientation': 'vertical'}
cax, kw = _mpl.colorbar.make_axes_gridspec(ax, **grid_kargs)
cbar = fig.colorbar(Image, cax=cax)
Here's the runtime warning:
/Users/balterma/Library/Enthought/Canopy_64bit/User/lib/python2.7/sitepackages/matplotlib/colors.py:584: RuntimeWarning: invalid value encountered in less cbook._putmask(xa, xa < 0.0, -1)
Image with mean:
Image with max:
Image with count:
Turns out the problem was interfacing with plt.pcolormesh. I had to convert the output array from binned_statistic_2d to a masked array that masked the NaNs.
Here's the question that gave me the answer:
pcolormesh with missing values?
I have four hexbin plots which have all been normalized. How do I add them together to make one big distribution?
I have tried concatenating the input vectors and then creating the hexbin plot, but this throws off the normalization of the individual distributions:
So how do I add the individual hexbin distributions whilst still maintainging the induvidual normalization?
The relevant part of my code is:
def hex_plot(x,y,max_v):
bounds = [0,max_v*m.exp(-(3**2)/2),max_v*m.exp(-2),max_v*m.exp(-0.5),max_v] # The sigma bounds
norm = mpl.colors.BoundaryNorm(bounds, ncolors=4)
hex_ = plt.hexbin(x, y, C=None, gridsize=gridsize,reduce_C_function=np.mean,cmap=cmap,mincnt=1,norm=norm)
print "Hex plot max: ",hex_.norm.vmax
return hex_
cmap = mpl.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
Thank you.
I've written a bit of code that does what you're after. From the snippet in your question, it looks like you already know the height (max_v) of your distribution given your binning scheme, so I worked under that assumption. Depending on the data you're applying this to, this might not actually be the case, in which case the following will fail (it's only as good as your guess/knowledge of the height of the distributions). For the purposes of my example data, I've just taken a reasonable guess (based on a quick plot) for the values of max_v1 and max_v2. Switching the c1 and c2 I've defined for the commented versions should reproduce your original problem.
import scipy
import matplotlib.pyplot as pyplot
import matplotlib.colors
import math
#need to know the height of the distributions a priori
max_v1 = 850 #approximate height of distribution 1 (defined below) with binning defined below
max_v2 = 400 #approximate height of distribution 2 (defined below) with binning defined below
max_v = max(max_v1,max_v2)
#make 2 differently sized datasets (so will require different normalizations)
#all normal distributions with assorted means/variances
x1 = scipy.randn(50000)/6.0+0.5
y1 = scipy.randn(50000)/3.0+0.5
x2 = scipy.randn(100000)/2.0-0.5
y2 = scipy.randn(100000)/2.0-0.5
#c1 = scipy.ones(len(x1)) #I don't assign meaningful weights here
#c2 = scipy.ones(len(x2)) #I don't assign meaningful weights here
c1 = scipy.ones(len(x1))*(max_v/max_v1) #highest distribution: no net change in normalization here
c2 = scipy.ones(len(x2))*(max_v/max_v2) #renormalized to same height as highest distribution
#define plot boundaries
#custom colormap
cmap = matplotlib.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
#the bounds of 1sigma, 2sigma, etc. regions
bounds = [0,max_v*math.exp(-(3**2)/2),max_v*math.exp(-2),max_v*math.exp(-0.5),max_v]
norm = matplotlib.colors.BoundaryNorm(bounds, ncolors=4)
#make the hexbin plot
normalized = pyplot
hexplot = normalized.subplot(111)
normalized.hexbin(scipy.concatenate((x1,x2)), scipy.concatenate((y1,y2)), C=scipy.concatenate((c1,c2)), cmap=cmap, mincnt=1, extent=(xmin,xmax,ymin,ymax),gridsize=50, reduce_C_function=scipy.sum, norm=norm) #combine distributions and weights
cax = pyplot.axes([0.86, 0.1, 0.03, 0.85])
clims = cax.axis()
cb = normalized.colorbar(cax=cax)
cax.set_yticklabels([' ','3','2','1',' '])
normalized.subplots_adjust(wspace=0, hspace=0, bottom=0.1, right=0.78, top=0.95, left=0.12)
Here's the result without the fix (commented c1 and c2 used),
and the result with the fix (code as-is);
Hope that helps.
I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter.
Here is my code:
# Imports
import scipy
import pylab
# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")
# Plot estimates using splines with a range of degrees
for k in range(1, 4):
mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
xi = range(0, 15100, 20)
yi = mySpline(xi)
pylab.plot(xi, yi, label="Predicted k=%d" % k)
# Show the plot
pylab.legend( loc="lower right" )
Here is the result:
I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over.
Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?
Short answer: you need to choose the value for s more carefully.
The documentation for UnivariateSpline states that:
Positive smoothing factor used to choose the number of knots. Number of
knots will be increased until the smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s
From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.
#Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact).
# Imports
import numpy
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot
pylab.legend( loc="lower right" )
While I'm not aware of any library which will do it for you off-hand, I'd try a bit more DIY approach: I'd start from making a spline with knots in between the raw data points, in both x and y. In your particular example, having a single knot in between the 4th and 5th points should do the trick, since it'd remove the huge derivative at around x=8000.
I had trouble getting BigChef's answer running, here is a variation that works on python 3.6:
# Imports
import pylab
import scipy
import sklearn.cluster
# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
# Cluster data, sort it and and save
import numpy
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(key=lambda li: li[0])
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]
# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20)
yi = mySpline(xi)
# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')
# Show the plot