Setting a relative frequency in a matplotlib histogram - python

I have data as a list of floats and I want to plot it as a histogram. Hist() function does the job perfectly for plotting the absolute histogram. However, I cannot figure out how to represent it in a relative frequency format - I would like to have it as a fraction or ideally as a percentage on the y-axis.
Here is the code:
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, patches = ax.hist(mydata, bins=100, normed=1, cumulative=0)
ax.set_xlabel('Bins', size=20)
ax.set_ylabel('Frequency', size=20)
ax.legend
plt.show()
I thought normed=1 argument would do it, but it gives fractions that are too high and sometimes are greater than 1. They also seem to depend on the bin size, as if they are not normalized by the bin size or something. Nevertheless, when I set cumulative=1, it nicely sums up to 1. So, where is the catch? By the way, when I feed the same data into Origin and plot it, it gives me perfectly correct fractions. Thank you!

Because normed option of hist returns the density of points, e.g dN/dx
What you need is something like that:
# assuming that mydata is an numpy array
ax.hist(mydata, weights=np.zeros_like(mydata) + 1. / mydata.size)
# this will give you fractions

Or you can use set_major_formatter to adjust the scale of the y-axis, as follows:
from matplotlib import ticker as tick
def adjust_y_axis(x, pos):
return x / (len(mydata) * 1.0)
ax.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))
just call adjust_y_axis as above before plt.show().

For relative frequency format set the option density=True. The figure below shows a histogram for 1000 samples taken from a normal distribution with mean 5 and standard deviation 2.0.
The code is
import numpy as np
import matplotlib.pyplot as plt
# Generate data from normal distibution
mu, sigma = 5, 2.0 # mean and standard deviation
mydata = np.random.normal(mu, sigma, 1000)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(mydata,bins=100,density=True);
plt.show()
If you want % on the y-axis you can use PercentFormatter as shown below
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
# Generate data from normal distibution
mu, sigma = 5, 2.0 # mean and standard deviation
mydata = np.random.normal(mu, sigma, 1000)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(mydata,bins=100,density=False);
ax.yaxis.set_major_formatter(PercentFormatter(xmax=100))
plt.show()

You can use numpy.histogram to get the histogram value and bins, and then calculate frequency by yourself. Finally, use bar plot to get the frequency histogram.
hist, edges = np.histogram(p_hat)
freq = hist / float(hist.sum())
width = np.diff(edges) # edges is bins
plt.bar(edges[1:], freq, width=width, align="edge", ec="k")
plt.set(xlabel='x', ylabel='frequency')

Related

Plot density histogram of Bernoulli sample and a Bernoulli pmf together

Summary of Question:
Why is my density from my sample so different to the pmf and how can I perform this simulation so that the pmf and the sample estimates are similar.
Question:
I have simulated a sample of independent Bernoulli trials using scipy. I am now trying to take a density histogram of the sample that I created and compare it to the pmf (probability mass function). I would expect the density histogram to show two bins each hovering near the pmf but instead, I have 2 bins above the pmf values at 5. Could someone please show me how to create a density histogram that does not do this for the Bernoulli? I tried a similar simulation with a few other distributions and it seemed to work fine. What am I missing here and could you show me how to manipulate my code to make this work?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10**3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')
# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none')
plt.show()
I must apologize if this is a simple or trivial question but I couldn't find a solution online and found the issue interesting. Any help at all would be appreciated.
The reason is that plt.hist is primarily meant to work with continuous distributions. If you don't provide explicit bin boundaries, plt.hist just creates 10 equally spaced bins between the minimum and maximum value. Most of these bins will be empty. With only two possible data values, there should be just two bins, so 3 boundaries:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10**3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')
# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none', bins=np.linspace(-0.5, 1.5, 3))
plt.show()
Here is a visualization of the default bin boundaries and how the samples fit into the bins. Note that with density=True, the histogram is normalized such that the area of all the bars sums to 1. In this case two bars are 0.1 wide and about 5.0 high, while 8 others have height zero. So, the total area is 2*0.1*5 + 8*0.0 = 1.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10 ** 3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
# Density histogram of generated values with default bins
values, binbounds, bars = plt.hist(sample_bernoulli, density=True, alpha=0.2, color='steelblue', edgecolor='none')
# show the bin boundaries
plt.vlines(binbounds, 0, max(values) * 1.05, color='crimson', ls=':')
# show the sample values with a random displacement
plt.scatter(sample_bernoulli * 0.9 + np.random.uniform(0, 0.1, trials),
np.random.uniform(0, max(values), trials), color='lime')
# show the index of each bin
for i in range(len(binbounds) - 1):
plt.text((binbounds[i] + binbounds[i + 1]) / 2, max(values) / 2, i, ha='center', va='center', fontsize=20, color='crimson')
plt.show()

Fit histogram log scale python

I need to fit a curve with my histogram in python. I did this before with normal histograms, this time I am trying to do the same with a logarithmic plot in x.
This is my code:
import numpy as np
import matplotlib.pyplot as plt
//radius is my np.array
Rmin = min(radius)
Rmax = max(radius)
logmin = np.log(Rmin)
logmax = np.log(Rmax)
bins = 10**(np.arange(logmin,logmax,0.1))
plt.figure()
plt.xscale("log")
plt.hist(radius, bins, color = 'red')
plt.show()
This is showing a gaussian distribution. I am trying to fit a curve with it and what I did is computing the following before the show() command.
(mu, sigma) = np.log(norm.fit((radius)))
y = (mlab.normpdf(np.log(bins), mu, sigma))
plt.plot(bins, y, 'b--', linewidth=2)
My result is a very flattened curve with respect to my distribution.
Can someone help me?
I can not add the whole array r(50000 points), therefore I have added a picture showing my result. See image

Plot a density function above a histogram

In Python, I have estimated the parameters for the density of a model of my distribution and I would like to plot the density function above the histogram of the distribution. In R it is similar to using the option prop=TRUE.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# initialization of the list "data"
# estimation of the parameter, in my case, mean and variance of a normal distribution
plt.hist(data, bins="auto") # data is the list of data
# here I would like to draw the density above the histogram
plt.show()
I guess the trickiest part is to make it fit.
Edit: I have tried this according to the first answer:
mean = np.mean(logdata)
var = np.var(logdata)
std = np.sqrt(var) # standard deviation, used by numpy as a replacement of the variance
plt.hist(logdata, bins="auto", alpha=0.5, label="données empiriques")
x = np.linspace(min(logdata), max(logdata), 100)
plt.plot(x, mlab.normpdf(x, mean, std))
plt.xlabel("log(taille des fichiers)")
plt.ylabel("nombre de fichiers")
plt.legend(loc='upper right')
plt.grid(True)
plt.show()
But it doesn't fit the graph, here is how it looks:
** Edit 2 ** Works with the option normed=True in the histogram function.
If I understand you correctly you have the mean and standard deviation of some data. You have plotted a histogram of this and would like to plot the normal distribution line over the histogram. This line can be generated using matplotlib.mlab.normpdf(), the documentation can be found here.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 100)
plt.hist(data, bins="auto",normed=True)
plt.plot(x, mlab.normpdf(x, mean, sigma))
plt.show()
Which gives the following figure:
Edit: The above only works with normed = True. If this is not an option, we can define our own function:
def gauss_function(x, a, x0, sigma):
return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 1000)
test = gauss_function(x, max(data), mean, sigma)
plt.hist(data, bins="auto")
plt.plot(x, test)
plt.show()
All what you are looking for, already are in seaborn.
You just have to use distplot
import seaborn as sns
import numpy as np
data = np.random.normal(5, 2, size=1000)
sns.distplot(data)

How to plot a hanging rootogram in python?

Inspired by this question, how do you make the same kind of plot in python? This plot aims at having a nice visual representation of how your distribution is off of the expected distribution. It hangs the bars of your histogram to the expected distribution line, so the difference to the expected value is read between the bottom of the bar and the x-axis, instead of between the top of the bar and the expected distribution curve.
I could not find any built in function.
The idea is to just move each bar of the histogram plot where the top of the bar is at the expected value:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
fig, ax = plt.subplots(1, 2)
mu = 10
sig = 0.3
my_data = np.random.normal(mu, sig, 200)
x = np.linspace(9, 11, 100)
# I plot the data twice, one for the histogram only for comparison,
# and one for the rootogram.
# The trick will be to modify the histogram to make it hang to
# the expected distribution curve:
for a in ax:
a.hist(my_data, normed=True)
a.plot(x, mlab.normpdf(x, mu, sig))
a.set_ylim(-0.2)
a.set_xlim(9, 11)
a.hlines(0, 9, 11, linestyle="--")
for rectangle in ax[1].patches:
# expected value in the middle of the bar
exp = mlab.normpdf(rectangle.get_x() + rectangle.get_width()/2., mu, sig)
# difference to the expected value
diff = exp - rectangle.get_height()
rectangle.set_y(diff)
ax[1].plot(rectangle.get_x() + rectangle.get_width()/2., exp, "ro")
ax[0].set_title("histogram")
ax[1].set_title("hanging rootogram")
plt.tight_layout()
Which gives:
HTH

Matplotlib: avoiding overlapping datapoints in a "scatter/dot/beeswarm" plot

When drawing a dot plot using matplotlib, I would like to offset overlapping datapoints to keep them all visible. For example, if I have:
CategoryA: 0,0,3,0,5
CategoryB: 5,10,5,5,10
I want each of the CategoryA "0" datapoints to be set side by side, rather than right on top of each other, while still remaining distinct from CategoryB.
In R (ggplot2) there is a "jitter" option that does this. Is there a similar option in matplotlib, or is there another approach that would lead to a similar result?
Edit: to clarify, the "beeswarm" plot in R is essentially what I have in mind, and pybeeswarm is an early but useful start at a matplotlib/Python version.
Edit: to add that Seaborn's Swarmplot, introduced in version 0.7, is an excellent implementation of what I wanted.
Extending the answer by #user2467675, here’s how I did it:
def rand_jitter(arr):
stdev = .01 * (max(arr) - min(arr))
return arr + np.random.randn(len(arr)) * stdev
def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
return scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=cmap, norm=norm, vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, **kwargs)
The stdev variable makes sure that the jitter is enough to be seen on different scales, but it assumes that the limits of the axes are zero and the max value.
You can then call jitter instead of scatter.
Seaborn provides histogram-like categorical dot-plots through sns.swarmplot() and jittered categorical dot-plots via sns.stripplot():
import seaborn as sns
sns.set(style='ticks', context='talk')
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
sns.despine()
sns.stripplot('species', 'sepal_length', data=iris, jitter=0.2)
sns.despine()
I used numpy.random to "scatter/beeswarm" the data along X-axis but around a fixed point for each category, and then basically do pyplot.scatter() for each category:
import matplotlib.pyplot as plt
import numpy as np
#random data for category A, B, with B "taller"
yA, yB = np.random.randn(100), 5.0+np.random.randn(1000)
xA, xB = np.random.normal(1, 0.1, len(yA)),
np.random.normal(3, 0.1, len(yB))
plt.scatter(xA, yA)
plt.scatter(xB, yB)
plt.show()
One way to approach the problem is to think of each 'row' in your scatter/dot/beeswarm plot as a bin in a histogram:
data = np.random.randn(100)
width = 0.8 # the maximum width of each 'row' in the scatter plot
xpos = 0 # the centre position of the scatter plot in x
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
yvals = centres.repeat(counts)
max_offset = width / counts.max()
offsets = np.hstack((np.arange(cc) - 0.5 * (cc - 1)) for cc in counts)
xvals = xpos + (offsets * max_offset)
fig, ax = plt.subplots(1, 1)
ax.scatter(xvals, yvals, s=30, c='b')
This obviously involves binning the data, so you may lose some precision. If you have discrete data, you could replace:
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
with:
centres, counts = np.unique(data, return_counts=True)
An alternative approach that preserves the exact y-coordinates, even for continuous data, is to use a kernel density estimate to scale the amplitude of random jitter in the x-axis:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
density = kde(data) # estimate the local density at each datapoint
# generate some random jitter between 0 and 1
jitter = np.random.rand(*data.shape) - 0.5
# scale the jitter by the KDE estimate and add it to the centre x-coordinate
xvals = 1 + (density * jitter * width * 2)
ax.scatter(xvals, data, s=30, c='g')
for sp in ['top', 'bottom', 'right']:
ax.spines[sp].set_visible(False)
ax.tick_params(top=False, bottom=False, right=False)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Histogram', 'KDE'], fontsize='x-large')
fig.tight_layout()
This second method is loosely based on how violin plots work. It still cannot guarantee that none of the points are overlapping, but I find that in practice it tends to give quite nice-looking results as long as there are a decent number of points (>20), and the distribution can be reasonably well approximated by a sum-of-Gaussians.
Not knowing of a direct mpl alternative here you have a very rudimentary proposal:
from matplotlib import pyplot as plt
from itertools import groupby
CA = [0,4,0,3,0,5]
CB = [0,0,4,4,2,2,2,2,3,0,5]
x = []
y = []
for indx, klass in enumerate([CA, CB]):
klass = groupby(sorted(klass))
for item, objt in klass:
objt = list(objt)
points = len(objt)
pos = 1 + indx + (1 - points) / 50.
for item in objt:
x.append(pos)
y.append(item)
pos += 0.04
plt.plot(x, y, 'o')
plt.xlim((0,3))
plt.show()
Seaborn's swarmplot seems like the most apt fit for what you have in mind, but you can also jitter with Seaborn's regplot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
sns.regplot(x='sepal_length',
y='sepal_width',
data=iris,
fit_reg=False, # do not fit a regression line
x_jitter=0.1, # could also dynamically set this with range of data
y_jitter=0.1,
scatter_kws={'alpha': 0.5}) # set transparency to 50%
Extending the answer by #wordsforthewise (sorry, can't comment with my reputation), if you need both jitter and the use of hue to color the points by some categorical (like I did), Seaborn's lmplot is a great choice instead of reglpot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.lmplot(x='sepal_length', y='sepal_width', hue='species', data=iris, fit_reg=False, x_jitter=0.1, y_jitter=0.1)

Categories

Resources