Finding first derivative using DFT in Python - python

I want to find the first derivative of exp(sin(x)) on the interval [0, 2/pi] using a discrete Fourier transform. The basic idea is to first evaluate the DFT of exp(sin(x)) on the given interval, giving you say v_k, followed by computing the inverse DFT of ikv_k giving you the desired answer. In reality, due to the implementations of Fourier transforms in programming languages, you might need to reorder the output somewhere and/or multiply by different factors here and there.
I first did it in Mathematica, where there is an option FourierParameters, which enables you to specify a convention for the transform. Firstly, I obtained the Fourier series of a Gaussian, in order to see what the normalisation factors are that I have to multiply by and then went on finding the derivative. Unfortunately, translating my Mathematica code into Python thereafter (whereby again I first did the Fourier series of a Gaussian - this was successful), I didn't get the same results. Here is my code:
N=1000
xmin=0
xmax=2.0*np.pi
step = (xmax-xmin)/(N)
xdata = np.linspace(xmin, xmax-step, N)
v = np.exp(np.sin(xdata))
derv = np.cos(xdata)*v
vhat = np.fft.fft(v)
kvals1 = np.arange(0, N/2.0, 1)
kvals2 = np.arange(-N/2.0, 0, 1)
what1 = np.zeros(kvals1.size+1)
what2 = np.empty(kvals2.size)
it = np.nditer(kvals1, flags=['f_index'])
while not it.finished:
np.put(what1, it.index, 1j*(2.0*np.pi)/((xmax-xmin))*it[0]*vhat[[int(it[0])]])
it.iternext()
it = np.nditer(kvals2, flags=['f_index'])
while not it.finished:
np.put(what2, it.index, 1j*(2.0*np.pi)/((xmax-xmin))*it[0]*vhat[[int(it[0])]])
it.iternext()
xdatafull = np.concatenate((xdata, [2.0*np.pi]))
what = np.concatenate((what1, what2))
w = np.real(np.fft.ifft(what))
fig = plt.figure()
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
plt.plot(xdata, derv, color='blue')
plt.plot(xdatafull, w, color='red')
plt.show()
I can post the Mathematica code, if people want me to.

Turns out the problem is that np.zeros gives you an array of real zeroes and not complex ones, hence the assignments after that don't change anything, as they are imaginary.
Thus the solution is quite simply
import numpy as np
N=100
xmin=0
xmax=2.0*np.pi
step = (xmax-xmin)/(N)
xdata = np.linspace(step, xmax, N)
v = np.exp(np.sin(xdata))
derv = np.cos(xdata)*v
vhat = np.fft.fft(v)
what = 1j*np.zeros(N)
what[0:N/2.0] = 1j*np.arange(0, N/2.0, 1)
what[N/2+1:] = 1j*np.arange(-N/2.0 + 1, 0, 1)
what = what*vhat
w = np.real(np.fft.ifft(what))
# Then plotting
whereby the np.zeros is replaced by 1j*np.zeros

Related

Converting indices in marching cubes to original x,y,z space - visualizing isosurface 3d skimage

I want to draw a volume in x1,x2,x3-space. The volume is an isocurve found by the marching cubes algorithm in skimage. The function generating the volume is pdf_grid = f(x1,x2,x3) and
I want to draw the volume where pdf = 60% max(pdf).
My issue is that the marching cubes algorithm generates vertices and faces, but how do I map those back to the x1, x2, x3-space?
My (rather limited) understanding of marching cubes is that "vertices" refer to the indices in the volume (pdf_grid in my case). If "vertices" contained only the exact indices in the grid this would have been easy, but "vertices" contains floats and not integers. It seems like marching cubes do some interpolation between grid points (according to https://www.cs.carleton.edu/cs_comps/0405/shape/marching_cubes.html), so the question is then how to recover exactly the values of x1,x2,x3?
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
#Make some random data
cov = np.array([[1, .2, -.5],
[.2, 1.2, .1],
[-.5, .1, .8]])
dist = scipy.stats.multivariate_normal(mean = [1., 3., 2], cov = cov)
N = 500
x_samples = dist.rvs(size=N).T
#Create the kernel density estimator - approximation of a pdf
kernel = scipy.stats.gaussian_kde(x_samples)
x_mean = x_samples.mean(axis=1)
#Find the mode
res = scipy.optimize.minimize(lambda x: -kernel.logpdf(x),
x_mean #x0, initial guess
)
x_mode = res["x"]
num_el = 50 #number of elements in the grid
x_min = np.min(x_samples, axis = 1)
x_max = np.max(x_samples, axis = 1)
x1g, x2g, x3g = np.mgrid[x_min[0]:x_max[0]:num_el*1j,
x_min[1]:x_max[1]:num_el*1j,
x_min[2]:x_max[2]:num_el*1j
]
pdf_grid = np.zeros(x1g.shape) #implicit function/grid for the marching cubes
for an in range(x1g.shape[0]):
for b in range(x1g.shape[1]):
for c in range(x1g.shape[2]):
pdf_grid[a,b,c] = kernel(np.array([x1g[a,b,c],
x2g[a,b,c],
x3g[a,b,c]]
))
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from skimage import measure
iso_level = .6 #draw a volume which contains pdf_val(mode)*60%
verts, faces, normals, values = measure.marching_cubes(pdf_grid, kernel(x_mode)*iso_level)
#How to convert the figure back to x1,x2,x3 space? I just draw the output as it was done in the skimage example here https://scikit-image.org/docs/0.16.x/auto_examples/edges/plot_marching_cubes.html#sphx-glr-auto-examples-edges-plot-marching-cubes-py so you can see the volume
# Fancy indexing: `verts[faces]` to generate a collection of triangles
mesh = Poly3DCollection(verts[faces],
alpha = .5,
label = f"KDE = {iso_level}"+r"$x_{mode}$",
linewidth = .1)
mesh.set_edgecolor('k')
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
c1 = ax.add_collection3d(mesh)
c1._facecolors2d=c1._facecolor3d
c1._edgecolors2d=c1._edgecolor3d
#Plot the samples. Marching cubes volume does not capture these samples
pdf_val = kernel(x_samples) #get density value for each point (for color-coding)
x1, x2, x3 = x_samples
scatter_plot = ax.scatter(x1, x2, x3, c=pdf_val, alpha = .2, label = r" samples")
ax.scatter(x_mode[0], x_mode[1], x_mode[2], c = "r", alpha = .2, label = r"$x_{mode}$")
ax.set_xlabel(r"$x_1$")
ax.set_ylabel(r"$x_2$")
ax.set_zlabel(r"$x_3$")
# ax.set_box_aspect([np.ptp(i) for me in x_samples]) # equal aspect ratio
cbar = fig.color bar(scatter_plot, ax=ax)
cbar.set_label(r"$KDE(w) \approx pdf(w)$")
ax.legend()
#Make the axis limit so that the volume and samples are shown.
ax.set_xlim(- 5, np.max(verts, axis=0)[0] + 3)
ax.set_ylim(- 5, np.max(verts, axis=0)[1] + 3)
ax.set_zlim(- 5, np.max(verts, axis=0)[2] + 3)
This is probably way too late of an answer to help OP, but in case anyone else comes across this post looking for a solution to this problem, the issue stems from the marching cubes algorithm outputting the relevant vertices in array space. This space is defined by the number of elements per dimension of the mesh grid and the marching cubes algorithm does indeed do some interpolation in this space (explaining the presence of floats).
Anyways, in order to transform the vertices back into x1,x2,x3 space you just need to scale and shift them by the appropriate quantities. These quantities are defined by the range, number of elements of the mesh grid, and the minimum value in each dimension respectively. So using the variables defined in the OP, the following will provide the actual location of the vertices:
verts_actual = verts*((x_max-x_min)/pdf_grid.shape) + x_min

Estimate joint density with 2d Gaussian kernel

I have the following data set where I have to estimate the joint density of 'bwt' and 'age' using kernel density estimation with a 2-dimensional Gaussian kernel and width h=5. I can't use modules such as scipy where there are ready functions to do this and I have to built functions to calculate the density. Here's what I've gotten so far.
import numpy as np
import pandas as pd
babies_full = pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
#2d Gaussian kernel
def k_2dgauss(x):
return np.exp(-np.sum(x**2, 1)/2) / np.sqrt(2*np.pi)
#Multivariate kernel density
def mv_kernel_density(t, x, h):
d = x.shape[1]
return np.mean(k_2dgauss((t - x)/h))/h**d
t = np.linspace(1.0, 5.0, 50)
h=5
print(mv_kernel_density(t, x, h))
However, I get a value error 'ValueError: operands could not be broadcast together with shapes (50,) (1173,2)' which think is because different shape of the matrices. I also don't understand why k_2dgauss(x) for me returns an array of zeros since it should only return one value. In general, I am new to the concept of kernel density estimation I don't really know if I've written the functions right so any hints would help!
Following on from my comments on your original post, I think this is what you want to do, but if not then come back to me and we can try again.
# info supplied by OP
import numpy as np
import pandas as pdbabies_full = \
pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
# my contributions
from math import floor, ceil
def binMaker(arr, base):
"""function I already use for this sort of thing.
arr is the arr I want to make bins for
base is the bin separation, but does require you to import floor and ceil
otherwise you can make these bins manually yourself"""
binMin = floor(arr.min() / base) * base
binMax = ceil(arr.max() / base) * base
return np.arange(binMin, binMax + base, base)
bins1 = binMaker(x[:,0], 20.) # bins from 140. to 360. spaced 20 apart
bins2 = binMaker(x[:,1], 5.) # bins from 15. to 45. spaced 5. apart
counts = np.zeros((len(bins1)-1, len(bins2)-1)) # empty array for counts to go in
for i in range(0, len(bins1)-1): # loop over the intervals, hence the -1
boo = (x[:,0] >= bins1[i]) * (x[:,0] < bins1[i+1])
for j in range(0, len(bins2)-1): # loop over the intervals, hence the -1
counts[i,j] = np.count_nonzero((x[boo,1] >= bins2[j]) *
(x[boo,1] < bins2[j+1]))
# if you want your PDF to be a fraction of the total
# rather than the number of counts, do the next line
counts /= x.shape[0]
# plotting
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
# setting the levels so that each number in counts has its own colour
levels = np.linspace(-0.5, counts.max()+0.5, int(counts.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, counts, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('Manually making a 2D (joint) PDF')
If this is what you wanted, then there is an easier way with np.histgoram2d, although I think you specified it had to be using your own methods, and not built in functions. I've included it anyway for completeness' sake.
pdf = np.histogram2d(x[:,0], x[:,1], bins=(bins1,bins2))[0]
pdf /= x.shape[0] # again for normalising and making a percentage
levels = np.linspace(-0.5, pdf.max()+0.5, int(pdf.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, pdf, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('using np.histogram2d to make a 2D (joint) PDF')
Final note - in this example, the only place where counts doesn't equal pdf is for the bin between 40 <= age < 45 and 280 <= gestation 300, which I think is due to how, in my manual case, I've used <= and <, and I'm a little unsure how np.histogram2d handles values outside the bin ranges, or on the bin edges etc. We can see the element of x that is responsible
>>> print(x[1011])
[280 45]

Spline in 3D can not be differentiated due to an AttributeError

I am trying to fit a smoothing B-spline to some data and I found this very helpful post on here. However, I not only need the spline, but also its derivatives, so I tried to add the following code to the example:
tck_der = interpolate.splder(tck, n=1)
x_der, y_der, z_der = interpolate.splev(u_fine, tck_der)
For some reason this does not seem to work due to some data type issues. I get the following traceback:
Traceback (most recent call last):
File "interpolate_point_trace.py", line 31, in spline_example
tck_der = interpolate.splder(tck, n=1)
File "/home/user/anaconda3/lib/python3.7/site-packages/scipy/interpolate/fitpack.py", line 657, in splder
return _impl.splder(tck, n)
File "/home/user/anaconda3/lib/python3.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 1206, in splder
sh = (slice(None),) + ((None,)*len(c.shape[1:]))
AttributeError: 'list' object has no attribute 'shape'
The reason for this seems to be that the second argument of the tck tuple contains a list of numpy arrays. I thought turning the input data to be a numpy array as well would help, but it does not change the data types of tck.
Does this behavior reflect an error in scipy, or is the input malformed?
I tried manually turning the list into an array:
tck[1] = np.array(tck[1])
but this (which didn't surprise me) also gave an error:
ValueError: operands could not be broadcast together with shapes (0,8) (7,1)
Any ideas of what the problem could be? I have used scipy before and on 1D splines the splder function works just fine, so I assume it has something to do with the spline being a line in 3D.
------- edit --------
Here is a minimum working example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
from mpl_toolkits.mplot3d import Axes3D
total_rad = 10
z_factor = 3
noise = 0.1
num_true_pts = 200
s_true = np.linspace(0, total_rad, num_true_pts)
x_true = np.cos(s_true)
y_true = np.sin(s_true)
z_true = s_true / z_factor
num_sample_pts = 80
s_sample = np.linspace(0, total_rad, num_sample_pts)
x_sample = np.cos(s_sample) + noise * np.random.randn(num_sample_pts)
y_sample = np.sin(s_sample) + noise * np.random.randn(num_sample_pts)
z_sample = s_sample / z_factor + noise * np.random.randn(num_sample_pts)
tck, u = interpolate.splprep([x_sample, y_sample, z_sample], s=2)
x_knots, y_knots, z_knots = interpolate.splev(tck[0], tck)
u_fine = np.linspace(0, 1, num_true_pts)
x_fine, y_fine, z_fine = interpolate.splev(u_fine, tck)
# this is the part of the code I inserted: the line under this causes the crash
tck_der = interpolate.splder(tck, n=1)
x_der, y_der, z_der = interpolate.splev(u_fine, tck_der)
# end of the inserted code
fig2 = plt.figure(2)
ax3d = fig2.add_subplot(111, projection='3d')
ax3d.plot(x_true, y_true, z_true, 'b')
ax3d.plot(x_sample, y_sample, z_sample, 'r*')
ax3d.plot(x_knots, y_knots, z_knots, 'go')
ax3d.plot(x_fine, y_fine, z_fine, 'g')
fig2.show()
plt.show()
Stumbled into the same problem...
I circumvented the error by using interpolate.splder(tck, n=1) and instead used interpolate.splev(spline_ev, tck, der=1) which returns the derivatives at the points spline_ev (see Scipy Doku).
If you need the spline I think you can then use interpolate.splprep() again.
In total something like:
import numpy as np
from scipy import interpolate
import matplotlib.pyplot as plt
points = np.random.rand(10,2) * 10
(tck, u), fp, ier, msg = interpolate.splprep(points.T, s=0, k=3, full_output=True)
spline_ev = np.linspace(0.0, 1.0, 100, endpoint=True)
spline_points = interpolate.splev(spline_ev, tck)
# Calculate derivative
spline_der_points = interpolate.splev(spline_ev, tck, der=1)
spline_der = interpolate.splprep(spline_der_points.T, s=0, k=3, full_output=True)
# Plot the data and derivative
fig = plt.figure()
plt.plot(points[:,0], points[:,1], '.-', label="points")
plt.plot(spline_points[0], spline_points[1], '.-', label="tck")
plt.plot(spline_der_points[0], spline_der_points[1], '.-', label="tck_der")
# Show tangent
plt.arrow(spline_points[0][23]-spline_der_points[0][23], spline_points[1][23]-spline_der_points[1][23], 2.0*spline_der_points[0][23], 2.0*spline_der_points[1][23])
plt.legend()
plt.show()
EDIT:
I also opened an Issue on Github and according to ev-br the usage of interpolate.splprep is depreciated and one should use make_interp_spline / BSpline instead.
As noted in other answers, splprep output is incompatible with splder, but is compatible with splev. And the latter can evaluate the derivatives.
However, for interpolation, there is an alternative approach, which avoids splprep altogether. I'm basically copying a reply on the SciPy issue tracker (https://github.com/scipy/scipy/issues/10389):
Here's an example of replicating the splprep outputs. First let's make sense out of the splprep output:
# start with the OP example
import numpy as np
from scipy import interpolate
points = np.random.rand(10,2) * 10
(tck, u), fp, ier, msg = interpolate.splprep(points.T, s=0, k=3, full_output=True)
# check the meaning of the `u` array: evaluation of the spline at `u`
# gives back the original points (up to a list/transpose)
xy = interpolate.splev(u, tck)
xy = np.asarray(xy)
np.allclose(xy.T, points)
Next, let's replicate it without splprep. First, build the u array: the curve is represented parametrically, and u is essentially an approximation for the arc length. Other parametrizations are possible, but here let's stick to what splprep does. Translating the pseudocode from the doc page, https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.splprep.html
vv = np.sum((points[1:, :] - points[:-1, :])**2, axis=1)
vv = np.sqrt(vv).cumsum()
vv/= vv[-1]
vv = np.r_[0, vv]
# check:
np.allclose(u, vv)
Now, interpolate along the parametric curve: points vs vv:
spl = interpolate.make_interp_spline(vv, points)
# check spl.t vs knots from splPrep
spl.t - tck[0]
The result, spl, is a BSpline object which you can evaluate, differentiate etc in a usual way:
np.allclose(points, spl(vv))
# differentiate
spl_derivative = spl.derivative(vv)

Fourier deconvolution with numpy

I am attempting to remove my probes function from a signal using Fourier deconvolution, but I can not get a correct output with test signals.
t = np.zeros(30)
t = np.append(t, np.arange(0, 20, 0.1))
sigma = 2
mu = 5.
g = 1/np.sqrt(2*np.pi*sigma**2) * np.exp(-(np.arange(mu-3*sigma,mu+3*sigma,0.1)-mu)**2/(2*sigma**2))
def pad_signals(s1, s2):
size = t.size +g.size - 1
size = int(2 ** np.ceil(np.log2(size)))
s1 = np.pad(s1, ((size-s1.size)//2, int(np.ceil((size-s1.size)/2))), 'constant', constant_values=(0, 0))
s2 = np.pad(s2, ((size-s2.size)//2, int(np.ceil((size-s2.size)/2))), 'constant', constant_values=(0, 0))
return s1, s2
def decon_fourier_ratio(signal, removed_signal):
signal, removed_signal = pad_signals(signal, removed_signal)
recovered = np.fft.fftshift(np.fft.ifft(np.fft.fft(signal)/np.fft.fft(removed_signal)))
return np.real(recovered)
gt = (np.convolve(t, g, mode='full') / g.sum())[:230]
tr = decon_fourier_ratio(gt, g)
fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True)
ax[0,0].plot(np.arange(0,np.fft.irfft(np.fft.rfft(t)).size), np.fft.irfft(np.fft.rfft(t)), label='thickness')
ax[0,1].plot(np.arange(0,np.fft.irfft(np.fft.rfft(g)).size), np.fft.irfft(np.fft.rfft(g)), label='probe shape')
ax[1,0].plot(np.arange(0,gt.size),gt, label='recorded signal')
ax[1,1].plot(np.arange(0,tr.size),tr, label='deconvolved signal')
plt.show()
The above script creates a demo sample (t), and a probe with Gaussian shape (g). Then, it convolves them to a signal gt, which is what a sample would look like when probed. I pad the signal to the nearest 2^N with pad_signals(), for efficiency and to fix any non-periodicity. Then I try to remove the gaussian probe with decon_fourier_ratio(). As is clear from the images, I do not recover the initial thickness gradient. Any ideas why the deconvolution is not working?
Note: I have also tried SciPy's deconvolve. But, this function only works for gaussians of certain widths.
Any help is greatly appreciated,
Eric
Any reason you are not doing the full convolution? If I change the construction of gt to:
g /= g.sum() # so the deconvolved signal has the same amplitude
gt = np.convolve(t, g, mode='full')
Then I get the following plots:
I can't quite tell you why your seeing this behavior, other than the partial convolution is probably altering the frequency content. Alternatively, you can pad your input signal with zeros if you want to get the same behavior and use same.

Fit the gamma distribution only to a subset of the samples

I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:
You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
Minimize:
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:
Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Outline
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
plt.show()
Result
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

Categories

Resources