Regression line for holoviews scatter plot? - python

I'm creating a scatter plot from an xarray Dataset using
scat = ds.hvplot.scatter(x='a', y='b', groupby='c', height=900, width=900)
How can I add a regression line to this plot?
I'm also using this to set some of the properties in the plot and I could add the Slope within the hook function but I can't figure out how to access x and y from the plot.state. This also might be completely the wrong way of doing it.
scat = scat.opts(hooks=[hook])
def hook(plot, element):
print('plot.state: ', plot.state)
print('plot.handles: ', sorted(plot.handles.keys()))
par = np.polyfit(x, y, 1, full=True)
gradient=par[0][0]
y_intercept=par[0][1]
slope = Slope(gradient=gradient, y_intercept=y_intercept,
line_color='orange', line_dash='dashed', line_width=3.5)
plot.state.add_layout(slope)
scat = scat.opts(hooks=[hook])

HoloViews >= 1.13 now has support for adding a regression line to your plot, so you don't need hooks anymore.
1) You can either add the regression line yourself by specifying keywords slope and y_intercept:
gradient = 2
y_intercept = 15
# create random data
xpts = np.arange(0, 20)
ypts = gradient * xpts + y_intercept + np.random.normal(0, 4, 20)
scatter = hv.Scatter((xpts, ypts))
# create slope with hv.Slope()
slope = hv.Slope(gradient, y_intercept)
scatter.opts(size=10) * slope.opts(color='red', line_width=6)
2) Or you can have HoloViews calculate it for you with hv.Slope.from_scatter():
normal = hv.Scatter(np.random.randn(20, 2))
normal.opts(size=10) * hv.Slope.from_scatter(normal)
Resulting plot:

The plot hooks is given two arguments, the second of which is the element being displayed. Since the element contains the data being displayed we can write a callback to compute the slope using the dimension_values method to get the values of the 'a' and 'b' dimensions in your data. Additionally, in order to avoid the Slope glyph being added multiple times, we can cache it on the plot and update its attributes:
def hook(plot, element):
x, y = element.dimension_values('a'), element.dimension_values('b')
par = np.polyfit(x, y, 1, full=True)
gradient=par[0][0]
y_intercept=par[0][1]
if 'slope' in plot.handles:
slope = plot.handles['slope']
slope.gradient = gradient
slope.y_intercept = y_intercept
else:
slope = Slope(gradient=gradient, y_intercept=y_intercept,
line_color='orange', line_dash='dashed', line_width=3.5)
plot.handles['slope'] = slope
plot.state.add_layout(slope)

Related

How to plot 2 trendlines on a single scatterplot? (python)

I want to plot 2 trendlines for one scatterplot using Matplotlib in Python but I don't know how. The graph should be similar to this target plot (from here, fig.2).
I managed to plot 1 trendline on a scatterplot here but can't figure out how to plot another trend.
Underneath is what I tried until now:
This proved ok for other parameters that I plotted, but not for this case, which led me to the conclusion that it's not too correct.
X = vO2.reshape(-1, 1)
Y = ve.reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
y_pred = linear_regressor.predict(X)
x_pred = linear_regressor.predict(Y)
plt.scatter(X, Y)
plt.plot(X, y_pred, '-*',label="O2")
plt.plot(x_pred, Y, '-*',label="vent")
plt.xlabel("VO2 (L/min)")
plt.ylabel("VE (L/min)")
plt.show()
and also
z1 = np.polyfit(vO2, ve, 1)
p1 = np.poly1d(z1)
z2 = np.polyfit(ve, vO2, 1)
p2 = np.poly1d(z2)
plt.scatter(vO2, ref_vent, label='original')
plt.plot(vO2, p1(vO2), label='trendline')
plt.plot(ve, p2(ve), label='trendline')
plt.show()
which also didn't look similar to the target plot.
I don't know how to continue. Thanks in advance!
example dataset:
vo2 = [1.673925 1.9015125 1.981775 2.112875 2.1112625 2.086375 2.13475
2.1777 2.176975 2.1857125 2.258925 2.2718375 2.3381 2.3330875
2.353725 2.4879625 2.448275 2.4829875 2.5084375 2.511275 2.5511
2.5678375 2.5844625 2.6101875 2.6457375 2.6602125 2.6939875 2.7210625
2.720475 2.767025 2.751375 2.7771875 2.776025 2.7319875 2.564
2.3977625 2.4459125 2.42965 2.401275 2.387175 2.3544375]
ve = [ 3.93125 7.1975 9.04375 14.06125 14.11875 13.24375
14.6625 15.3625 15.2 15.035 17.7625 17.955
19.2675 19.875 21.1575 22.9825 23.75625 23.30875
25.9925 25.6775 27.33875 27.7775 27.9625 29.35
31.86125 32.2425 33.7575 34.69125 36.20125 38.6325
39.4425 42.085 45.17 47.18 42.295 37.5125
38.84375 37.4775 34.20375 33.18 32.67708333]
OK, so you need to find the point, where slope of line changes. I tried 2nd derivative, but it was noisy and I coulnd't find the right spot.
Another way is to try all possible points, calculate left and right regression lines and find pair with best fit (r2 coeff). Give this code a try. It is not complete. I do not know, how to force regression lines to go through point in the middle. And it might be better to work with interpolated data, if there are not enough datapoints.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
vo2 = [1.673925,1.9015125,1.981775,2.112875,2.1112625,2.086375,2.13475,2.1777,2.176975,2.1857125,2.258925,2.2718375,2.3381,2.3330875,2.353725,2.4879625,2.448275,2.4829875,2.5084375,2.511275,2.5511,2.5678375,2.5844625,2.6101875,2.6457375,2.6602125,2.6939875,2.7210625,2.720475,2.767025,2.751375,2.7771875,2.776025,2.7319875,2.564,2.3977625,2.4459125,2.42965,2.401275,2.387175,2.3544375]
ve = [ 3.93125,7.1975,9.04375,14.06125,14.11875,13.24375,14.6625,15.3625,15.2,15.035,17.7625,17.955,19.2675,19.875,21.1575,22.9825,23.75625,23.30875,25.9925,25.6775,27.33875,27.7775,27.9625,29.35,31.86125,32.2425,33.7575,34.69125,36.20125,38.6325,39.4425,42.085,45.17,47.18,42.295,37.5125,38.84375,37.4775,34.20375,33.18,32.67708333]
x = np.array(vo2)
y = np.array(ve)
sort_idx = x.argsort()
x = x[sort_idx]
y = y[sort_idx]
assert len(x) == len(y)
def fit(x,y):
p = np.polyfit(x, y, 1)
f = np.poly1d(p)
r2 = r2_score(y, f(x))
return p, f, r2
skip = 5 # minimal length of split data
r2 = [0] * len(x)
funcs = {}
for i in range(len(x)):
if i < skip or i > len(x) - skip:
continue
_, f_left, r2_left = fit(x[:i], y[:i])
_, f_right, r2_right = fit(x[i:], y[i:])
r2[i] = r2_left * r2_right
funcs[i] = (f_left, f_right)
split_ix = np.argmax(r2) # index of split
f_left,f_right = funcs[split_ix]
print(f"split point index: {split_ix}, x: {x[split_ix]}, y: {y[split_ix]}")
xd = np.linspace(min(x), max(x), 100)
plt.plot(x, y, "o")
plt.plot(xd, f_left(xd))
plt.plot(xd, f_right(xd))
plt.plot(x[split_ix], y[split_ix], "x")
plt.show()

How to set widgets to link to array for jupyternotebooks

I am trying to set an interactive notebook up that plots some interpolated GPS data. I have the plotting working by itself, but I am trying to use the ipython widgets to make it more interactive for others.
Currently, my plotting looks like this
def create_grid(array,spacing=.01):
'''
creates evenly spaced grid from the min and max of an array
'''
grid = np.arange(np.amin(array), np.amax(array),spacing)
return grid
def interpolate(x, y, z, grid_spacing = .01, model='spherical',returngrid = False):
'''Interpolates z value and uses create_grid to create a grid of values based on min and max of x and y'''
grid_x = create_grid(x,spacing = grid_spacing)
grid_y = create_grid(y, spacing = grid_spacing)
OK = OrdinaryKriging(x, y, z, variogram_model=model, verbose = False,\
enable_plotting=False, nlags = 20)
z1, ss1 = OK.execute('grid', grid_x,grid_y,mask = False)
print('Interpolation Complete')
vals=np.ma.getdata(z1)
sigma = np.ma.getdata(ss1)
if returngrid == False:
return vals,sigma
else:
return vals, sigma, grid_x, grid_y
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plt.colorbar(plot)
cb.set_label('Northing Change')
plt.show()
'''
This works currently, but I am trying to set up a widget to change the variogram model in the kriging interpolation, as well as change the field to be interpolated.
Currently, to do that I have:
def update_plot(zfield,variogram):
plt.clf()
z1, ss1, grid_x,grid_y =interpolate(lon,lat,zfield,returngrid= True,model=variogram)
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plot.colorbar(plot)
cb.set_label('Interpolated Value')
variogram = widgets.Dropdown(options = ['linear', 'power', 'gaussian', 'spherical', 'exponential', 'hole-effect'],
value = 'spherical', description = "Variogram model for interpolation")
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v},value = 'Delta N',
description = 'Interpolated value')
widgets.interactive(update_plot, variogram = variogram,zfield =zfield)
Which brings up the error
TraitError: Invalid selection: value not found
the values delta_n, delta_e and delta_v are numpy arrays. I have tried looking at documentation but it is not as detailed as something like matplotlibs documentation or something so I feel like I am kind of flying blind here.
Thank you
In this line, you specify the possible values of the Dropdown as:
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v}
When a mapping is used, the values of the dict are interpreted as the possible options. So value = 'Delta N' causes an error as this is not one of the possible values of the Dropdown (although it is one of the keys in the mapping dict). I believe you want value = delta_n instead.

How to fix this likelihood in python and plot it?

For a linear logistic regression model formulated as follows:
I was trying to code the likelihood and plot it to see what it looks like. This is how I've implemented it:
# Define likelihood function
def likelihood(beta):
# reshape correctly
beta = np.array(beta).reshape(-1, 1)
# X beta has x_i^T * \beta in every entry
Xbeta = np.matmul(X, beta).flatten()
exp_argument = np.multiply(newrows.y.values, Xbeta)
numerator = np.exp(exp_argument.sum())
denominator = np.prod((1 + np.exp(Xbeta)))
return numerator / denominator
# Prepare a grid
b1 = np.linspace(-10, -8.5, 100) # these values chosen by trial and error
b2 = np.linspace(4.5, 6.2, 100) # these values chosen by trial and error
B1, B2 = np.meshgrid(b1, b2)
# Evaluate function at the gridpoints
Zbeta = np.array([likelihood(thing) for thing in zip(B1.ravel(), B2.ravel())])
Zbeta = Zbeta.reshape(B1.shape)
# Surface plot
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot(121, projection='3d')
ax.plot_surface(B1, B2, Zbeta)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax = fig.add_subplot(122)
ax.contour(B1, B2, Zbeta)
plt.show()
Here the matrix X is a matrix containing "fake" Bernoulli observations. It is a well known dataset called BeetleMortality, which looks like this and I've stored it in a Pandas Dataframe called data.
which are basically Binomial observations. I've created Bernoulli observations as follows:
newrows = np.zeros((data.NumberExposed.sum(), 2))
for index, row in data.iterrows():
obs = np.zeros(int(row.NumberExposed), dtype=int)
obs[:int(row.NumberKilled)] = 1
np.random.shuffle(obs)
obs = obs.reshape(-1, 1)
doses = np.repeat(row.Dose, len(obs)).reshape(-1, 1)
obs = np.hstack((doses, obs))
from_index = int(row.CumulativeN - row.NumberExposed)
to_index = int(row.CumulativeN)
newrows[from_index:to_index] = obs
# Save it as a dataframe
newrows = pd.DataFrame(newrows, columns=['x', 'y'])
# Notice that if you wanted to check that these calculations are correct use: newrows.groupby('x').sum()
newrows.head()
Everything seems to work fine, except that when I try to plot this I obtain the following plot:
Which makes absolutely no sense.. Any help? Maybe I'm implementing the likelihood function in the wrong way?

In matplotlib, how can I plot a multi-colored line, like a rainbow

I would like to plot parallel lines with different colors. E.g. rather than a single red line of thickness 6, I would like to have two parallel lines of thickness 3, with one red and one blue.
Any thoughts would be appreciated.
Merci
Even with the smart offsetting (s. below), there is still an issue in a view that has sharp angles between consecutive points.
Zoomed view of smart offsetting:
Overlaying lines of varying thickness:
Plotting parallel lines is not an easy task. Using a simple uniform offset will of course not show the desired result. This is shown in the left picture below.
Such a simple offset can be produced in matplotlib as shown in the transformation tutorial.
Method1
A better solution may be to use the idea sketched on the right side. To calculate the offset of the nth point we can use the normal vector to the line between the n-1st and the n+1st point and use the same distance along this normal vector to calculate the offset point.
The advantage of this method is that we have the same number of points in the original line as in the offset line. The disadvantage is that it is not completely accurate, as can be see in the picture.
This method is implemented in the function offset in the code below.
In order to make this useful for a matplotlib plot, we need to consider that the linewidth should be independent of the data units. Linewidth is usually given in units of points, and the offset would best be given in the same unit, such that e.g. the requirement from the question ("two parallel lines of width 3") can be met.
The idea is therefore to transform the coordinates from data to display coordinates, using ax.transData.transform. Also the offset in points o can be transformed to the same units: Using the dpi and the standard of ppi=72, the offset in display coordinates is o*dpi/ppi. After the offset in display coordinates has been applied, the inverse transform (ax.transData.inverted().transform) allows a backtransformation.
Now there is another dimension of the problem: How to assure that the offset remains the same independent of the zoom and size of the figure?
This last point can be addressed by recalculating the offset each time a zooming of resizing event has taken place.
Here is how a rainbow curve would look like produced by this method.
And here is the code to produce the image.
import numpy as np
import matplotlib.pyplot as plt
dpi = 100
def offset(x,y, o):
""" Offset coordinates given by array x,y by o """
X = np.c_[x,y].T
m = np.array([[0,-1],[1,0]])
R = np.zeros_like(X)
S = X[:,2:]-X[:,:-2]
R[:,1:-1] = np.dot(m, S)
R[:,0] = np.dot(m, X[:,1]-X[:,0])
R[:,-1] = np.dot(m, X[:,-1]-X[:,-2])
On = R/np.sqrt(R[0,:]**2+R[1,:]**2)*o
Out = On+X
return Out[0,:], Out[1,:]
def offset_curve(ax, x,y, o):
""" Offset array x,y in data coordinates
by o in points """
trans = ax.transData.transform
inv = ax.transData.inverted().transform
X = np.c_[x,y]
Xt = trans(X)
xto, yto = offset(Xt[:,0],Xt[:,1],o*dpi/72. )
Xto = np.c_[xto, yto]
Xo = inv(Xto)
return Xo[:,0], Xo[:,1]
# some single points
y = np.array([1,2,2,3,3,0])
x = np.arange(len(y))
#or try a sinus
x = np.linspace(0,9)
y=np.sin(x)*x/3.
fig, ax=plt.subplots(figsize=(4,2.5), dpi=dpi)
cols = ["#fff40b", "#00e103", "#ff9921", "#3a00ef", "#ff2121", "#af00e7"]
lw = 2.
lines = []
for i in range(len(cols)):
l, = plt.plot(x,y, lw=lw, color=cols[i])
lines.append(l)
def plot_rainbow(event=None):
xr = range(6); yr = range(6);
xr[0],yr[0] = offset_curve(ax, x,y, lw/2.)
xr[1],yr[1] = offset_curve(ax, x,y, -lw/2.)
xr[2],yr[2] = offset_curve(ax, xr[0],yr[0], lw)
xr[3],yr[3] = offset_curve(ax, xr[1],yr[1], -lw)
xr[4],yr[4] = offset_curve(ax, xr[2],yr[2], lw)
xr[5],yr[5] = offset_curve(ax, xr[3],yr[3], -lw)
for i in range(6):
lines[i].set_data(xr[i], yr[i])
plot_rainbow()
fig.canvas.mpl_connect("resize_event", plot_rainbow)
fig.canvas.mpl_connect("button_release_event", plot_rainbow)
plt.savefig(__file__+".png", dpi=dpi)
plt.show()
Method2
To avoid overlapping lines, one has to use a more complicated solution.
One could first offset every point normal to the two line segments it is part of (green points in the picture below). Then calculate the line through those offset points and find their intersection.
A particular case would be when the slopes of two subsequent line segments equal. This has to be taken care of (eps in the code below).
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
dpi = 100
def intersect(p1, p2, q1, q2, eps=1.e-10):
""" given two lines, first through points pn, second through qn,
find the intersection """
x1 = p1[0]; y1 = p1[1]; x2 = p2[0]; y2 = p2[1]
x3 = q1[0]; y3 = q1[1]; x4 = q2[0]; y4 = q2[1]
nomX = ((x1*y2-y1*x2)*(x3-x4)- (x1-x2)*(x3*y4-y3*x4))
denom = float( (x1-x2)*(y3-y4) - (y1-y2)*(x3-x4) )
nomY = (x1*y2-y1*x2)*(y3-y4) - (y1-y2)*(x3*y4-y3*x4)
if np.abs(denom) < eps:
#print "intersection undefined", p1
return np.array( p1 )
else:
return np.array( [ nomX/denom , nomY/denom ])
def offset(x,y, o, eps=1.e-10):
""" Offset coordinates given by array x,y by o """
X = np.c_[x,y].T
m = np.array([[0,-1],[1,0]])
S = X[:,1:]-X[:,:-1]
R = np.dot(m, S)
norm = np.sqrt(R[0,:]**2+R[1,:]**2) / o
On = R/norm
Outa = On+X[:,1:]
Outb = On+X[:,:-1]
G = np.zeros_like(X)
for i in xrange(0, len(X[0,:])-2):
p = intersect(Outa[:,i], Outb[:,i], Outa[:,i+1], Outb[:,i+1], eps=eps)
G[:,i+1] = p
G[:,0] = Outb[:,0]
G[:,-1] = Outa[:,-1]
return G[0,:], G[1,:]
def offset_curve(ax, x,y, o, eps=1.e-10):
""" Offset array x,y in data coordinates
by o in points """
trans = ax.transData.transform
inv = ax.transData.inverted().transform
X = np.c_[x,y]
Xt = trans(X)
xto, yto = offset(Xt[:,0],Xt[:,1],o*dpi/72., eps=eps )
Xto = np.c_[xto, yto]
Xo = inv(Xto)
return Xo[:,0], Xo[:,1]
# some single points
y = np.array([1,1,2,0,3,2,1.,4,3]) *1.e9
x = np.arange(len(y))
x[3]=x[4]
#or try a sinus
#x = np.linspace(0,9)
#y=np.sin(x)*x/3.
fig, ax=plt.subplots(figsize=(4,2.5), dpi=dpi)
cols = ["r", "b"]
lw = 11.
lines = []
for i in range(len(cols)):
l, = plt.plot(x,y, lw=lw, color=cols[i], solid_joinstyle="miter")
lines.append(l)
def plot_rainbow(event=None):
xr = range(2); yr = range(2);
xr[0],yr[0] = offset_curve(ax, x,y, lw/2.)
xr[1],yr[1] = offset_curve(ax, x,y, -lw/2.)
for i in range(2):
lines[i].set_data(xr[i], yr[i])
plot_rainbow()
fig.canvas.mpl_connect("resize_event", plot_rainbow)
fig.canvas.mpl_connect("button_release_event", plot_rainbow)
plt.show()
Note that this method should work well as long as the offset between the lines is smaller then the distance between subsequent points on the line. Otherwise method 1 may be better suited.
The best that I can think of is to take your data, generate a series of small offsets, and use fill_between to make bands of whatever color you like.
I wrote a function to do this. I don't know what shape you're trying to plot, so this may or may not work for you. I tested it on a parabola and got decent results. You can also play around with the list of colors.
def rainbow_plot(x, y, spacing=0.1):
fig, ax = plt.subplots()
colors = ['red', 'yellow', 'green', 'cyan','blue']
top = max(y)
lines = []
for i in range(len(colors)+1):
newline_data = y - top*spacing*i
lines.append(newline_data)
for i, c in enumerate(colors):
ax.fill_between(x, lines[i], lines[i+1], facecolor=c)
return fig, ax
x = np.linspace(0,1,51)
y = 1-(x-0.5)**2
rainbow_plot(x,y)

Fit the gamma distribution only to a subset of the samples

I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:
You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
Minimize:
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:
Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Outline
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
plt.show()
Result
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

Categories

Resources