Is there a way to tell matplotlib to "normalize" a histogram such that its area equals a specified value (other than 1)?
The option "normed = 0" in
n, bins, patches = plt.hist(x, 50, normed=0, histtype='stepfilled')
just brings it back to a frequency distribution.
Just calculate it and normalize it to any value you'd like, then use bar to plot the histogram.
On a side note, this will normalize things such that the area of all the bars is normed_value. The raw sum will not be normed_value (though it's easy to have that be the case, if you'd like).
E.g.
import numpy as np
import matplotlib.pyplot as plt
x = np.random.random(100)
normed_value = 2
hist, bins = np.histogram(x, bins=20, density=True)
widths = np.diff(bins)
hist *= normed_value
plt.bar(bins[:-1], hist, widths)
plt.show()
So, in this case, if we were to integrate (sum the height multiplied by the width) the bins, we'd get 2.0 instead of 1.0. (i.e. (hist * widths).sum() will yield 2.0)
You can pass a weights argument to hist instead of using normed. For example, if your bins cover the interval [minval, maxval], you have n bins, and you want to normalize the area to A, then I think
weights = np.empty_like(x)
weights.fill(A * n / (maxval-minval) / x.size)
plt.hist(x, bins=n, range=(minval, maxval), weights=weights)
should do the trick.
EDIT: The weights argument must be the same size as x, and its effect is to make each value in x contribute the corresponding value in weights towards the bin count, instead of 1.
I think the hist function could probably do with a greater ability to control normalization, though. For example, I think as it stands, values outside the binned range are ignored when normalizing, which isn't generally what you want.
Related
I have a range of positive integers ranging from 250-1200, with a normal distribution. I have found the answer to creating bins of equal density (Matplotlib: How to make a histogram with bins of equal area?). What I am actually looking for is to be able to retrieve the upper and lower boundaries of each bin. Is there a library/function that exists for this? or can this information be pulled out from matplotlib?
Let's take a look at the code provided in the question you linked:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10))
bins is actually giving you the edges of each bin as you can read in the docs of hist function:
I'm trying to make a scaled scatter plot from a histogram. The scatter plot is fairly straight-forward, make the histogram, find bin centers, scatter plot.
nbins=7
# Some example data
A = np.random.randint(0, 10, 100)
B = np.random.rand(100)
counts, binEdges=np.histogram(A,bins=nbins)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)
ax.scatter(bincenters,counts,c='k', marker='.')
ax_setup(ax, 'X', 'Y')
plt.show()
but I want each element of A to only contribute a scaled value to it's bin, that scaled value is stored in B. (i.e. instead of each bin being the count of elements from A for that bin, I want each bin to be the sum of corresponding values from B)
To do this I tried creating a list C (same length as A, and B) that had the bin number allocation for each element of A, then summing all of the values from B that go into the same bin. I thought numpy.searchsorted() is what I needed e.g.,
C = bincenters.searchsorted(A, 'right')
but this doesn't get the allocation right, and doesn't seem to return the correct number of bins.
So, how do I create a list that tells me which histogram bin each element of my data goes into?
You write
but I want each element of A to only contribute a scaled value to it's bin, that scaled value is stored in B. (i.e. instead of each bin being the count of elements from A for that bin, I want each bin to be the sum of corresponding values from B)
IIUC, this functionality is already supported in numpy.histogram via the weights parameter:
An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1). If normed is True, the weights are normalized, so that the integral of the density over the range remains 1.
So, for your case, it would just be
counts, binEdges=np.histogram(A, bins=nbins, weights=B)
Another point: if your intent is to plot the histogram, note that you can directly use matplotlib.pyplot's utility functions for this (which take weights as well):
from matplotlib import pyplot as plt
plt.hist(A, bins=nbins, weights=B);
Finally, if you're intent on getting the assignments to bins, then that's exactly what numpy.digitize does:
nbins=7
# Some example data
A = np.random.randint(0, 10, 10)
B = np.random.rand(10)
counts, binEdges=np.histogram(A,bins=nbins)
>>> binEdges, np.digitize(A, binEdges)
array([ 0. , 1.28571429, 2.57142857, 3.85714286, 5.14285714,
6.42857143, 7.71428571, 9. ])
I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?
Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()
How can I create a histogram that shows the probability distribution given an array of numbers x ranging from 0-1? I expect each bar to be <= 1 and that if I sum the y values of every bar they should add up to 1.
For example, if x=[.2, .2, .8] then I would expect a graph showing 2 bars, one at .2 with height .66, one at .8 with height .33.
I've tried:
matplotlib.pyplot.hist(x, bins=50, normed=True)
which gives me a histogram with bars that go above 1. I'm not saying that's wrong since that's what the normed parameter will do according to documentation, but that doesn't show the probabilities.
I've also tried:
counts, bins = numpy.histogram(x, bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2
matplotlib.pyplot.bar(bins, counts, 1.0/50)
which also gives me bars whose y values sum to greater than 1.
I think my original terminology was off. I have an array of continuous values [0-1) which I want to discretize and use to plot a probability mass function. I thought this might be common enough to warrant a single method to do it.
Here's the code:
x = [random.random() for r in xrange(1000)]
num_bins = 50
counts, bins = np.histogram(x, bins=num_bins)
bins = bins[:-1] + (bins[1] - bins[0])/2
probs = counts/float(counts.sum())
print probs.sum() # 1.0
plt.bar(bins, probs, 1.0/num_bins)
plt.show()
I think you are mistaking a sum for an integral. A proper PDF (probability distribution function) integrates to unity; if you simply take the sum you may be missing out on the size of the rectangle.
import numpy as np
import pylab as plt
N = 10**5
X = np.random.normal(size=N)
counts, bins = np.histogram(X,bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2
print np.trapz(counts, bins)
Gives .999985, which is close enough to unity.
EDIT: In response to the comment below:
If x=[.2, .2, .8] and I'm looking for a graph with two bars, one at .2 with height .66 because 66% of the values are at .2 and one bar at .8 with height .33, what would that graph be called and how do I generate it?
The following code:
from collections import Counter
x = [.2,.2,.8]
C = Counter(x)
total = float(sum(C.values()))
for key in C: C[key] /= total
Gives a "dictionary" C=Counter({0.2: 0.666666, 0.8: 0.333333}). From here one could construct a bar graph, but this would only work if the PDF is discrete and takes only a finite fixed set of values that are well separated from each other.
I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()