Fit a distribution to a histogram - python

I want to know the distribution of my data points, so first I plotted the histogram of my data. My histogram looks like the following:
Second, in order to fit them to a distribution, here's the code I wrote:
size = 20000
x = scipy.arange(size)
# fit
param = scipy.stats.gamma.fit(y)
pdf_fitted = scipy.stats.gamma.pdf(x, *param[:-2], loc = param[-2], scale = param[-1]) * size
plt.plot(pdf_fitted, color = 'r')
# plot the histogram
plt.hist(y)
plt.xlim(0, 0.3)
plt.show()
The result is:
What am I doing wrong?

Your data does not appear to be gamma-distributed, but assuming it is, you could fit it like this:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
gamma = stats.gamma
a, loc, scale = 3, 0, 2
size = 20000
y = gamma.rvs(a, loc, scale, size=size)
x = np.linspace(0, y.max(), 100)
# fit
param = gamma.fit(y, floc=0)
pdf_fitted = gamma.pdf(x, *param)
plt.plot(x, pdf_fitted, color='r')
# plot the histogram
plt.hist(y, normed=True, bins=30)
plt.show()
The area under the pdf (over the entire domain) equals 1.
The area under the histogram equals 1 if you use normed=True.
x has length size (i.e. 20000), and pdf_fitted has the same shape as x. If we call plot and specify only the y-values, e.g. plt.plot(pdf_fitted), then values are plotted over the x-range [0, size].
That is much too large an x-range. Since the histogram is going to use an x-range of [min(y), max(y)], we much choose x to span a similar range: x = np.linspace(0, y.max()), and call plot with both the x- and y-values specified, e.g. plt.plot(x, pdf_fitted).
As Warren Weckesser points out in the comments, for most applications you know the gamma distribution's domain begins at 0. If that is the case, use floc=0 to hold the loc parameter to 0. Without floc=0, gamma.fit will try to find the best-fit value for the loc parameter too, which given the vagaries of data will generally not be exactly zero.

Related

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

How to plot the angle frequency distribution curve in python

I have only the angle values for a set of data. Now i need to plot a angle distribution curve ie., angle on the x axis v/s no.of times/frequency of angle occurring on the y axis.
These are the angles sorted out for a set of data:-
[98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381, 101.70374296,
103.15715294, 104.4653976,105.50441485, 106.82885361, 107.4605319, 108.93228646,
111.22463712, 112.23658018, 113.31223886, 113.4000603, 114.14565594, 114.79809084,
115.15788861, 115.42991416, 115.66216071, 115.69821092, 116.56319054, 117.09232139,
119.30835385, 119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387,]
How can i do this..??
How can i write a python program for this...?? and plot it in a graph as a distribution curve
Function hist actually returns the x and y coordinates of the bins. You can use this function to prepare the data for the line plot:
y, x, _ = plt.hist(angles) # No need for the 3rd return value
xc = (x[:-1] + x[1:]) / 2 # Take centerpoints
# plt.clf()
plt.plot(xc, y)
plt.show() # Etc.
You will end up having both the histogram and the line plot. If this is not desirable, clean the canvas before plotting the line by uncommenting the call to clf().
EDIT:
If you want a line plot as well, it is better to generate the histogram with numpy and then use that information also for the line:
from matplotlib import pyplot as plt
import numpy as np
angles = [98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381,
101.70374296, 103.15715294, 104.4653976, 105.50441485, 106.82885361,
107.4605319, 108.93228646, 111.22463712, 112.23658018, 113.31223886,
113.4000603, 114.14565594, 114.79809084, 115.15788861, 115.42991416,
115.66216071, 115.69821092, 116.56319054, 117.09232139, 119.30835385,
119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387, ]
hist,edges = np.histogram(angles, bins=20)
bin_centers = 0.5*(edges[:-1] + edges[1:])
bin_widths = (edges[1:]-edges[:-1])
plt.bar(bin_centers,hist,width=bin_widths)
plt.plot(bin_centers, hist,'r')
plt.xlabel('angle [$^\circ$]')
plt.ylabel('frequency')
plt.show()
this looks like this:
If you are not interested in the histogram itself, leave out the line plt.bar(bin_centers,hist,width=bin_widths).
EDIT2:
I don't really see the scientific value in a smoothed histogram. If you increase the resolution of the histogram (the bins parameter in the np.histogram command), it can change quite considerably. For instance, new peaks may occur if you increase the bin count, or two peaks may merge into one if you decrease the bin count. Keeping this in mind, smoothing the histogram curve suggests that you have more data than you do. However, if you really must, you can smooth a curve as explained in this answer, i.e.
from scipy.interpolate import spline
x = np.linspace(edges[0], edges[-1], 500)
y = spline(bin_centers, hist, x)
and then plot y over x.

Setting axis values in numpy/matplotlib.plot

I am in the process of learning numpy. I wish to plot a graph of Planck's law for different temperatures and so have two np.arrays, T and l for temperature and wavelength respectively.
import scipy.constants as sc
import numpy as np
import matplotlib.pyplot as plt
lhinm = 10000 # Highest wavelength in nm
T = np.linspace(200, 1000, 10) # Temperature in K
l = np.linspace(0, lhinm*1E-9, 101) # Wavelength in m
labels = np.linspace(0, lhinm, 6) # Axis labels giving l in nm
B = (2*sc.h*sc.c**2/l[:, np.newaxis]**5)/(np.exp((sc.h*sc.c)/(T*l[:, np.newaxis]*sc.Boltzmann))-1)
for ticks in [True, False]:
plt.plot(B)
plt.xlabel("Wavelength (nm)")
if ticks:
plt.xticks(l, labels)
plt.title("With xticks()")
plt.savefig("withticks.png")
else:
plt.title("Without xticks()")
plt.savefig("withoutticks.png")
plt.show()
I would like to label the x-axis with the wavelength in nm. If I don't call plt.xitcks() the labels on the x-axis would appear to be the index in to the array B (which holds the caculated values).
I've seen answer 7559542, but when I call plt.xticks() all the values are scrunched up on the left of the axis, rather than being evenly spread along it.
So what's the best way to define my own set of values (in this case a subset of the values in l) and place them on the axis?
The problem is that you're not giving your wavelength values to plt.plot(), so Matplotlib puts the index into the array on the horizontal axis as a default. Quick solution:
plt.plot(l, B)
Without explicitly setting tick labels, that gives you this:
Of course, the values on the horizontal axis in this plot are actually in meters, not nanometers (despite the labeling), because the values you passed as the first argument to plot() (namely the array l) are in meters. That's where xticks() comes in. The two-argument version xticks(locations, labels) places the labels at the corresponding locations on the x axis. For example, xticks([1], 'one') would put a label "one" at the location x=1, if that location is in the plot.
However, it doesn't change the range displayed on the axis. In your original example, your call to xticks() placed a bunch of labels at coordinates like 10-9, but it didn't change the axis range, which was still 0 to 100. No wonder all the labels were squished over to the left.
What you need to do is call xticks() with the points at which you want to place the labels, and the desired text of the labels. The way you were doing it, xticks(l, labels), would work except that l has length 101 and labels only has length 6, so it only uses the first 6 elements of l. To fix that, you can do something like
plt.xticks(labels * 1e-9, labels)
where the multiplication by 1e-9 converts from nanometers (what you want displayed) to meters (which are the coordinates Matplotlib actually uses in the plot).
You can supply the x values to plt.plot, and let matplotlib take care of setting the tick labels.
In your case, you could plot plt.plot(l, B), but then you still have the ticks in m, not nm.
You could therefore convert your l array to nm before plotting (or during plotting). Here's a working example:
import scipy.constants as sc
import numpy as np
import matplotlib.pyplot as plt
lhinm = 10000 # Highest wavelength in nm
T = np.linspace(200, 1000, 10) # Temperature in K
l = np.linspace(0, lhinm*1E-9, 101) # Wavelength in m
l_nm = l*1e9 # Wavelength in nm
labels = np.linspace(0, lhinm, 6) # Axis labels giving l in nm
B = (2*sc.h*sc.c**2/l[:, np.newaxis]**5)/(np.exp((sc.h*sc.c)/(T*l[:, np.newaxis]*sc.Boltzmann))-1)
plt.plot(l_nm, B)
# Alternativly:
# plt.plot(l*1e9, B)
plt.xlabel("Wavelength (nm)")
plt.title("With xticks()")
plt.savefig("withticks.png")
plt.show()
You need to use same size lists at the xtick. Try setting the axis values separately from the plot value as below.
import scipy.constants as sc
import numpy as np
import matplotlib.pyplot as plt
lhinm = 10000 # Highest wavelength in nm
T = np.linspace(200, 1000, 10) # Temperature in K
l = np.linspace(0, lhinm*1E-9, 101) # Wavelength in m
ll = np.linspace(0, lhinm*1E-9, 6) # Axis values
labels = np.linspace(0, lhinm, 6) # Axis labels giving l in nm
B = (2*sc.h*sc.c**2/l[:, np.newaxis]**5)/(np.exp((sc.h*sc.c)/(T*l[:, np.newaxis]*sc.Boltzmann))-1)
for ticks in [True, False]:
plt.plot(B)
plt.xlabel("Wavelength (nm)")
if ticks:
plt.xticks(ll, labels)
plt.title("With xticks()")
plt.savefig("withticks.png")
else:
plt.title("Without xticks()")
plt.savefig("withoutticks.png")
plt.show()

Filling under histogram until exact point with fill_between python

Currently, I am trying to fill under the histogram with fill_between function in python until 10 and 90 percentile in the original numbers.
However, the problem is the histogram curve is not a "function' but the series of discrete number with the interval of bin size. I couldn't fill exactly up to 10 or 90 percentile. I have tried several tries, I failed.
The code bellow is what I tried:
S1 = [0.34804491 0.18036933 0.41111951 0.31947523 .........
0.46212255 0.39229157 0.28937502 0.22095423 0.52415083]
N, bins = np.histogram(S1, bins=np.linspace(0.1,0.7,20), normed=False)
bincenters = 0.5*(bins[1:]+bins[:-1])
ax.fill_between(bincenters,N,0,where=bincenters<=np.percentile(S1,10),interpolate=True,facecolor='r', alpha=0.5)
ax.fill_between(bincenters,N,0,where=bincenters>=np.percentile(S1,90),interpolate=True, facecolor='r', alpha=0.5,label = "Summer 10 P")
It seems to fill only until bincenter before or after given percentile number, not until up to those.
Any idea or help would be really appreciated.
Isaac
Try changing your last two lines to:
ax.fill_between(bincenters, 0, N, interpolate=True,
where=((bincenters>=np.percentile(bincenters, 10)) &
(bincenters<=np.percentile(bincenters, 90))))
I believe you want to call np.percentile on bincenters since that is your effective x-axis.
The other difference is that you want to want fill between regions where 10<x<90, which necessitates the use of & in the where parameter.
Edit based on comment from OP:
I think to achieve what you want, you have to do some minimal interpolation of your own. See my example below using a random, normal distribution in which I'm using interp1d from scipy.interpolate to interpolate over bincenters.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
# create normally distributed random data
n = 10000
data = np.random.normal(0, 1, n)
bins = np.linspace(-data.max(), data.max(), 20)
hist = np.histogram(data, bins=bins)[0]
bincenters = 0.5 * (bins[1:] + bins[:-1])
# create interpolation function and dense x-axis to interpolate over
f = interp1d(bincenters, hist, kind='cubic')
x = np.linspace(bincenters.min(), bincenters.max(), n)
plt.plot(bincenters, hist, '-o')
# calculate greatest bincenter < 10th percentile
bincenter_under10thPerc = bincenters[bincenters < np.percentile(bincenters, 10)].max()
bincenter_10thPerc = np.percentile(bincenters, 10)
bincenter_90thPerc = np.percentile(bincenters, 90)
# calculate smallest bincenter > 90th percentile
bincenter_above90thPerc = bincenters[bincenters > np.percentile(bincenters, 90)].min()
# fill between 10th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_under10thPerc) &
(x<=bincenter_10thPerc)))
# fill between 90th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_90thPerc) &
(x<=bincenter_above90thPerc)))
The figure I get out is below. Note that I changed the percentiles from 10/90% to 30/70% so that they show up better in the plot. Again, I hope that this is what you're trying to do
I have a version of this that uses axvspan to make a Rectangle and then uses the hist as a clip_path:
def hist(sample, low=None, high=None):
# draw the histogram
options = dict(alpha=0.5, color='C0')
xs, ys, patches = plt.hist(sample,
density=True,
histtype='step',
linewidth=3,
**options)
# fill in the histogram, if desired
if low is not None:
x1 = low
if high is not None:
x2 = high
else:
x2 = np.max(sample)
fill = plt.axvspan(x1, x2,
clip_path=patches[0],
**options)
Would something like that work for you?

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.
The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html
Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

Categories

Resources