Filling under histogram until exact point with fill_between python - python

Currently, I am trying to fill under the histogram with fill_between function in python until 10 and 90 percentile in the original numbers.
However, the problem is the histogram curve is not a "function' but the series of discrete number with the interval of bin size. I couldn't fill exactly up to 10 or 90 percentile. I have tried several tries, I failed.
The code bellow is what I tried:
S1 = [0.34804491 0.18036933 0.41111951 0.31947523 .........
0.46212255 0.39229157 0.28937502 0.22095423 0.52415083]
N, bins = np.histogram(S1, bins=np.linspace(0.1,0.7,20), normed=False)
bincenters = 0.5*(bins[1:]+bins[:-1])
ax.fill_between(bincenters,N,0,where=bincenters<=np.percentile(S1,10),interpolate=True,facecolor='r', alpha=0.5)
ax.fill_between(bincenters,N,0,where=bincenters>=np.percentile(S1,90),interpolate=True, facecolor='r', alpha=0.5,label = "Summer 10 P")
It seems to fill only until bincenter before or after given percentile number, not until up to those.
Any idea or help would be really appreciated.
Isaac

Try changing your last two lines to:
ax.fill_between(bincenters, 0, N, interpolate=True,
where=((bincenters>=np.percentile(bincenters, 10)) &
(bincenters<=np.percentile(bincenters, 90))))
I believe you want to call np.percentile on bincenters since that is your effective x-axis.
The other difference is that you want to want fill between regions where 10<x<90, which necessitates the use of & in the where parameter.
Edit based on comment from OP:
I think to achieve what you want, you have to do some minimal interpolation of your own. See my example below using a random, normal distribution in which I'm using interp1d from scipy.interpolate to interpolate over bincenters.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
# create normally distributed random data
n = 10000
data = np.random.normal(0, 1, n)
bins = np.linspace(-data.max(), data.max(), 20)
hist = np.histogram(data, bins=bins)[0]
bincenters = 0.5 * (bins[1:] + bins[:-1])
# create interpolation function and dense x-axis to interpolate over
f = interp1d(bincenters, hist, kind='cubic')
x = np.linspace(bincenters.min(), bincenters.max(), n)
plt.plot(bincenters, hist, '-o')
# calculate greatest bincenter < 10th percentile
bincenter_under10thPerc = bincenters[bincenters < np.percentile(bincenters, 10)].max()
bincenter_10thPerc = np.percentile(bincenters, 10)
bincenter_90thPerc = np.percentile(bincenters, 90)
# calculate smallest bincenter > 90th percentile
bincenter_above90thPerc = bincenters[bincenters > np.percentile(bincenters, 90)].min()
# fill between 10th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_under10thPerc) &
(x<=bincenter_10thPerc)))
# fill between 90th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_90thPerc) &
(x<=bincenter_above90thPerc)))
The figure I get out is below. Note that I changed the percentiles from 10/90% to 30/70% so that they show up better in the plot. Again, I hope that this is what you're trying to do

I have a version of this that uses axvspan to make a Rectangle and then uses the hist as a clip_path:
def hist(sample, low=None, high=None):
# draw the histogram
options = dict(alpha=0.5, color='C0')
xs, ys, patches = plt.hist(sample,
density=True,
histtype='step',
linewidth=3,
**options)
# fill in the histogram, if desired
if low is not None:
x1 = low
if high is not None:
x2 = high
else:
x2 = np.max(sample)
fill = plt.axvspan(x1, x2,
clip_path=patches[0],
**options)
Would something like that work for you?

Related

Histogram line of best fit line is jagged and not smooth?

I can't quite seem to figue out how to get my curve to be displayed smoothly instead of having so many sharp turns.
I am hoping to show a boltzmann probability distribution. With a nice smooth curve.
I'll expect it is a simple fix but I can't see it. Can someone please help?
My code is below:
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
dE = 1
N = 500
n = 10000
# This is creating an array filled with all twos
def Create_Array(N):
Particle_State_List_set = np.ones(N, dtype = int)
Particle_State_List_twos = Particle_State_List_set + 1
return(Particle_State_List_twos)
Array = Create_Array(N)
def Select_Random_index(N):
Seed = np.random.default_rng()
Partcle_Index = Seed.integers(low=0, high= N - 1)
return(Partcle_Index)
def Exchange(N):
Particle_Index_A = Select_Random_index(N) #Selects a particle to be used as particle "a"
Particle_Index_B = Select_Random_index(N) #Selects a particle to be used as particle "b"
# Checks to see if the energy on particle "a" is zero, if so it selects anbother until it isn't.
while Array[Particle_Index_A] == 1:
Particle_Index_A = Select_Random_index(N)
#This loop is making sure that Particle "a" and "b" aren't the same particle, it chooses again until the are diffrent.
while Particle_Index_B == Particle_Index_A:
Particle_Index_B = Select_Random_index(N)
# This assignes variables to the chosen particle's energy values
a = Array[Particle_Index_A]
b = Array[Particle_Index_B]
# This updates the values of the Energy levels of the interacting particles
Array[Particle_Index_A] = a - dE
Array[Particle_Index_B] = b + dE
return (Array[Particle_Index_A], Array[Particle_Index_B])
for i in range(n):
Exchange(N)
# This part is making the histogram the curve will be made from
_, bins, _ = plt.hist(Array, 12, density=1, alpha=0.15, color="g")
# This is using scipy to find the mean and standard deviation in order to plot the curve
mean, std = scipy.stats.norm.fit(Array)
# This part is drawing the best fit line, using the established bins value and the std and mean from before
best_fit = scipy.stats.norm.pdf(bins, mean, std)
# Plotting the best fit curve
plt.plot(bins, best_fit, color="r", linewidth=2.5)
#These are instructions on how python with show the graph
plt.title("Boltzmann Probablitly Curve")
plt.xlabel("Energy Value")
plt.ylabel('Percentage at this Energy Value')
plt.tick_params(top=True, right=True)
plt.tick_params(direction='in', length=6, width=1, colors='0')
plt.grid()
plt.show()
Whats happening is that in these lines:
best_fit = scipy.stats.norm.pdf(bins, mean, std)
plt.plot(bins, best_fit, color="r", linewidth=2.5)
'bins' the histogram bin edges is being used as the x coordinates of the data points forming the best fit line. The resulting plot is jagged because they are so widely spaced. Instead you can define a tighter packed set of x coordinates and use that:
bfX = np.arange(bins[0],bins[-1],.05)
best_fit = scipy.stats.norm.pdf(bfX, mean, std)
plt.plot(bfX, best_fit, color="r", linewidth=2.5)
For me that gives a nice smooth curve, but you can always use a tighter packing than .05 if its not to your liking yet.

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

Better visualization of matplotlib plot

I want to include a plot in my thesis (document will be standard a4 page pdf) for which I have data of two time series, both a continuous values expressed as percentages.
Both time series are over one year without sundays, so something of about 310 data points for each of them.
I tried to come with something like this,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ts = day_agg_plan_temp.set_index('Date')
ts = ts['2018-01-01': '2019-01-01']
plt.figure(figsize=(20,15))
ax1 = ts.label.plot(grid=True, label='Ground Truth', marker='.')
ax2 = ts.pred.plot(grid=True, label='Prediction', marker='.')
plt.legend()
plt.show()
resulting in this:
This is not really appealing, as there is too much going on and I want to point the difference for each of the data points of the blue and orange line.
So my question is, is there a way to do it better other than shrinking the date range (which I really don't want because this plot is already a snippet of the actual time series which covers almost 3 years)
Here is some code that generates data using Fractional Brownian motion, calculates a trend using a Savitzky–Golay filter (but use whatever is best for you case study), and plots it in a way the user can see the original data and the trend clearly at the same time.
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
# Generating some Random Data
def brownian(x0, n, dt, delta, out=None):
x0 = np.asarray(x0)
r = norm.rvs(size=x0.shape + (n,), scale=delta * sqrt(dt))
if out is None:
out = np.empty(r.shape)
np.cumsum(r, axis=-1, out=out)
out += np.expand_dims(x0, axis=-1)
return out
delta = 2
T = 10.0
N = 500
dt = T/N
m = 2
x = np.empty((m,N+1))
x[:, 0] = 50
brownian(x[:,0], N, dt, delta, out=x[:,1:])
t = np.linspace(0.0, N*dt, N+1)
# Obtaining the trend using some arbitrary filter
y1 = savgol_filter(x[0], 51, 3)
y2 = savgol_filter(x[1], 51, 3)
# Plotting the raw data (transparent)
plt.plot(t, x[0], color="red", alpha=0.2)
plt.plot(t, x[1], color="blue", alpha=0.2)
# Plotting the trend data (opaque)
plt.plot(t, y1, color="red")
plt.plot(t, y2, color="blue")
# Calling the plot
plt.show()
The result is this:
My point is that by playing with the colors (or transparency) you can make some data appear as if in a background, and other (the most relevant usually) as if appearing in the foreground. It's an UX technique (like blurring, darkening, or make the background paler).
You can also play with the line width (or style) if the vertical variability of the data is not enough to clearly separate the sets. In your case I don't think it will be necessary.

Fit a distribution to a histogram

I want to know the distribution of my data points, so first I plotted the histogram of my data. My histogram looks like the following:
Second, in order to fit them to a distribution, here's the code I wrote:
size = 20000
x = scipy.arange(size)
# fit
param = scipy.stats.gamma.fit(y)
pdf_fitted = scipy.stats.gamma.pdf(x, *param[:-2], loc = param[-2], scale = param[-1]) * size
plt.plot(pdf_fitted, color = 'r')
# plot the histogram
plt.hist(y)
plt.xlim(0, 0.3)
plt.show()
The result is:
What am I doing wrong?
Your data does not appear to be gamma-distributed, but assuming it is, you could fit it like this:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
gamma = stats.gamma
a, loc, scale = 3, 0, 2
size = 20000
y = gamma.rvs(a, loc, scale, size=size)
x = np.linspace(0, y.max(), 100)
# fit
param = gamma.fit(y, floc=0)
pdf_fitted = gamma.pdf(x, *param)
plt.plot(x, pdf_fitted, color='r')
# plot the histogram
plt.hist(y, normed=True, bins=30)
plt.show()
The area under the pdf (over the entire domain) equals 1.
The area under the histogram equals 1 if you use normed=True.
x has length size (i.e. 20000), and pdf_fitted has the same shape as x. If we call plot and specify only the y-values, e.g. plt.plot(pdf_fitted), then values are plotted over the x-range [0, size].
That is much too large an x-range. Since the histogram is going to use an x-range of [min(y), max(y)], we much choose x to span a similar range: x = np.linspace(0, y.max()), and call plot with both the x- and y-values specified, e.g. plt.plot(x, pdf_fitted).
As Warren Weckesser points out in the comments, for most applications you know the gamma distribution's domain begins at 0. If that is the case, use floc=0 to hold the loc parameter to 0. Without floc=0, gamma.fit will try to find the best-fit value for the loc parameter too, which given the vagaries of data will generally not be exactly zero.

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.
The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html
Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

Categories

Resources