How can I create a histogram that shows the probability distribution given an array of numbers x ranging from 0-1? I expect each bar to be <= 1 and that if I sum the y values of every bar they should add up to 1.
For example, if x=[.2, .2, .8] then I would expect a graph showing 2 bars, one at .2 with height .66, one at .8 with height .33.
I've tried:
matplotlib.pyplot.hist(x, bins=50, normed=True)
which gives me a histogram with bars that go above 1. I'm not saying that's wrong since that's what the normed parameter will do according to documentation, but that doesn't show the probabilities.
I've also tried:
counts, bins = numpy.histogram(x, bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2
matplotlib.pyplot.bar(bins, counts, 1.0/50)
which also gives me bars whose y values sum to greater than 1.
I think my original terminology was off. I have an array of continuous values [0-1) which I want to discretize and use to plot a probability mass function. I thought this might be common enough to warrant a single method to do it.
Here's the code:
x = [random.random() for r in xrange(1000)]
num_bins = 50
counts, bins = np.histogram(x, bins=num_bins)
bins = bins[:-1] + (bins[1] - bins[0])/2
probs = counts/float(counts.sum())
print probs.sum() # 1.0
plt.bar(bins, probs, 1.0/num_bins)
plt.show()
I think you are mistaking a sum for an integral. A proper PDF (probability distribution function) integrates to unity; if you simply take the sum you may be missing out on the size of the rectangle.
import numpy as np
import pylab as plt
N = 10**5
X = np.random.normal(size=N)
counts, bins = np.histogram(X,bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2
print np.trapz(counts, bins)
Gives .999985, which is close enough to unity.
EDIT: In response to the comment below:
If x=[.2, .2, .8] and I'm looking for a graph with two bars, one at .2 with height .66 because 66% of the values are at .2 and one bar at .8 with height .33, what would that graph be called and how do I generate it?
The following code:
from collections import Counter
x = [.2,.2,.8]
C = Counter(x)
total = float(sum(C.values()))
for key in C: C[key] /= total
Gives a "dictionary" C=Counter({0.2: 0.666666, 0.8: 0.333333}). From here one could construct a bar graph, but this would only work if the PDF is discrete and takes only a finite fixed set of values that are well separated from each other.
Related
I want to know how count and bin_edges values are assigned at a time by the np.histogram.
counts,bin_edges=np.histogram(iris_setosa['sepal_length'],bins=10,density=True)
Supposing data is a 1D numpy array and bins is a strict positive integer, the code roughly works like:
import numpy as np
def numpy_histogram(data, bins=10, density=False):
xmin = data.min()
xmax = data.max()
bin_edges = np.linspace(xmin, xmax, bins + 1)
counts = np.zeros(bins, dtype=int)
bin_indices = ((data - xmin) / (xmax - xmin) * bins * 0.999999).astype(int)
for i in bin_indices:
counts[i] += 1
if density:
counts = counts / sum(counts) / (bin_edges[1] - bin_edges[0])
return counts, bin_edges
counts, bin_edges = numpy_histogram(np.random.uniform(1, 10, 20), density=True)
print(sum(counts), counts)
So, the min and max of the data are used to define the bin boundaries. (There is one more boundary than there are bins). Then from each data value xmin is subtracted, and divided by the total range of the data. Then multiplied by the number of bins. This identifies the index of the bin where this value should go. A correction by a factor slightly smaller than 1 is needes so the rightmost value doesn't fall in the following (undefined) bin.
When density=True, the counts are normalized such that the sum of the areas of all bars would be one. The bars have a width of the difference between two successive bin_edges.
PS: About Python's assignment of multiple elements simultaneously, this question is interesting.
I am new to python and in the following code, I would like to plot a bell curve to show how the data follows a norm distribution. How would I go about it? Also, can anyone answer why when showing the hist, I have values (x-axis) greater than 100? I would assume by defining the Randels to 100, it would not show anything above it. If I am not mistaken, the x-axis represents what "floor" I am in and the y-axis represents how many observations matched that floor. By the way, this is a datacamp project.
"""
Let's say I roll a dice to determine if I go up or down a step in a building with
100 floors (1 step = 1 floor). If the dice is less than 2, I go down a step. If
the dice is less than or equal to 5, I go up a step, and if the dice is equal to 6,
I go up x steps based on a random integer generator between 1 and 6. What is the probability
I will be higher than floor 60?
"""
import numpy as np
import matplotlib.pyplot as plt
# Set the seed
np.random.seed(123)
# Simulate random walk
all_walks = []
for i in range(1000) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
if np.random.rand() <= 0.001 : # There's a 0.1% chance I fall and have to start at 0
step = 0
random_walk.append(step)
all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
# Select last row from np_aw_t: ends
ends = np_aw_t[-1,:]
# Plot histogram of ends, display plot
plt.hist(ends,bins=10,edgecolor='k',alpha=0.65)
plt.style.use('fivethirtyeight')
plt.xlabel("Floor")
plt.ylabel("# of times in floor")
plt.show()
You can use scipy.stats.norm to get a normal distribution. Documentation for it here. To fit any function to a data set you can use scipy.optimize.curve_fit(), documentation for that here. My suggestion would be something like the following:
import scipy.stats as ss
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
#Making a figure with two y-axis (one for the hist, one for the pdf)
#An alternative would be to multiply the pdf by the sum of counts if you just want to show the fit.
fig, ax = plt.subplots(1,1)
twinx = ax.twinx()
rands = ss.norm.rvs(loc = 1, scale = 1, size = 1000)
#hist returns the bins and the value of each bin, plot to the y-axis ax
hist = ax.hist(rands)
vals, bins = hist[0], hist[1]
#calculating the center of each bin
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
#finding the best fit coefficients, note vals/sum(vals) to get the probability in each bin instead of the count
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
#loc and scale are mean and standard deviation i believe
loc, scale = coeff
#x-values to plot the normal distribution curve
x = np.linspace(min(bins), max(bins), 100)
#Evaluating the pdf with the best fit mean and std
p = ss.norm.pdf(x, loc = loc, scale = scale)
#plot the pdf to the other axis and show
twinx.plot(x,p)
plt.show()
There are likely more elegant ways to do this, but if you are new to python and are going to use it for calculations and such, getting to know curve_fit and scipy.stats is recomended. I'm not sure I understand whan you mean by "defining the Randels", hist will plot a "standard" histogram with bins on the x-axis and the count in each bin on the y-axis. When using these counts to fit a pdf we can just divide all the counts by the total number of counts.
Hope that helps, just ask if anything is unclear :)
Edit: compact version
vals, bins,_ = ax.hist(my_histogram_data)
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
x = np.linspace(min(bins), max(bins), 100)
p = ss.norm.pdf(x, loc = coeff[0], scale = coeff[1])
#p is now the fitted normal distribution
I need to use each individual bin of three histograms to use in an equation (i.e. bin1 from histogram1, histogram2, and histogram3, and then bin2 from h1, h2, h3, etc).
Is there any way to call the individual bins from a histogram to use?
Thanks in advance
Yes, when you call plt.hist the locations of the bins, as well as the number of entries in each bin is returned. Let's say you generated three histograms (I'm going to go for histograms 0, 1 and 2, because python):
import matplotlib.pyplot as plt
import numpy as np
x0 = np.random.rand(25)
x1 = np.random.rand(25)
x2 = np.random.rand(25)
counts0, bins0, patches0 = plt.hist(x0)
counts1, bins1, patches1 = plt.hist(x1)
counts2, bins2, patches2 = plt.hist(x2)
The locations of the bins for histogram 0 are then stored in bins0.
The number of entries in the bins of histogram 0 is then stored in counts0.
I'd then be tempted to gather these together into 2d arrays:
counts = np.vstack([counts0, counts1, counts2]).T
bins = np.vstack([bins0, bins1, bins2]).T
Now, bins[i, j] details the location of bin i for histogram j. Similarly counts[i, j] contains the number of entries in bin i of histogram j.
With this set up you can get the counts in bin i for histograms 0, 1 and 2 as counts[i].
Additionally, if you don't actually need the plots, and are only calling plt.hist to get a handle on counts and bins then you can use np.histogram instead. The syntax is similar: counts, bins = np.histogram(x) (np.histogram doesn't return patches).
I am plotting a spectrogram of my data using matplotlib's specgram function.
Pxx, freqs, bins= mlab.specgram(my_data,NFFT=nFFT,Fs=Fs,detrend=mlab.detrend_linear,noverlap=n_overlap,pad_to=p_to,scale_by_freq=True)
For ref, the shape of "freqs", "bins" (i.e. times) and "Pxx" above are (1025,), (45510,) and (1025,45510) respectively.
where, I have defined the function parameters
Fs = 10E6 # Sampling Rate
w_length= 256 # window length
nFFT=2 * w_length
n_overlap=np.fix(w_length/2)
p_to = 8 *w_length
The frequency range (yaxis) for this plot is from 0 to 5E6 Hz. When I plot it, I am interested in viewing different frequency ranges, for example 100E3 Hz to 1E6. If I change the ylim of the plot, the colorbar limits don't change i.e. don't update to reflect the signal values in this "new" frequency range. Is there a way that I can do this, so that by changing the y-axis range plotted i.e. the frequency range limits , the colorbar will update/change accordingly?
interp='nearest'
cmap=seismic
fig = plt.figure()
ax1=fig.add_subplot(111)
img1=ax1.imshow(Pxx, interpolation=interp, aspect='auto',extent=extent,cmap=cmap)
ax1.autoscale(enable=True,axis='x',tight=True)
ax1.autoscale(enable=True,axis='y',tight=True)
ax1.set_autoscaley_on(False)
ax1.set_ylim([100E3,1E6])
fig.colorbar(img1)
plt.show()
I thought that if I could somehow find what the maximum and minimum value of Pxx was for the upper and lower frequencies respectively in the frequency range of interest, that I could use these values to set the colorbar limit e.g.
img1.set_clim(min_val, max_val)
I can find the max and min values of Pxx in general and return their indices using
import numpy as np
>>> np.unravel_index(Pxx.argmax(),Pxx.shape)
(20, 31805)
>>> np.unravel_index(Pxx.argmin(),Pxx.shape)
(1024, 31347)
How do I go about finding the values of Pxx that correspond to the freq range of interest?
I can do something like the following to roughly find where for example in "freqs" 100E3 and 1E6 are approx. located using (and take the first (or last) value from each )...
fmin_index= [i for i,x in enumerate(freqs) if x >= 100E3][0]
fmax_index= [i for i,x in enumerate(freqs) if x >= 1000E3][0]
OR
fmin_index= [i for i,x in enumerate(freqs) if x <= 100E3][-1]
fmax_index= [i for i,x in enumerate(freqs) if x <= 1000E3][-1]
Then possibly
min_val = np.min(Pxx[fmin_index,:])
max_val = np.min(Pxx[fmax_index,:])
and finally
img1.set_clim(min_val, max_val)
Unfortunately this doesn't appear to be working in the sense that value range on the colorbar doesn't look correct. There must be a better/easier/more accurate way to do the above.
Instead of changing the limits in the graph, a possible solution is to change the data you plot and let colorbar do its thing. A minimal working example in the pylab environment:
#some random data
my_data = np.random.random(2048)
#### Your Code
Fs = 10E6 # Sampling Rate
w_length= 256 # window length
nFFT=2 * w_length
n_overlap=np.fix(w_length/2)
p_to = 8 *w_length
Pxx, freqs, bins= mlab.specgram(my_data,NFFT=nFFT,Fs=Fs,
detrend=mlab.detrend_linear,
noverlap=n_overlap,
pad_to=p_to,scale_by_freq=True)
#find a maximum frequency index
maxfreq = 1E5 #replace by your maximum freq
if maxfreq:
lastfreq = freqs.searchsorted(maxfreq)
if lastfreq > len(freqs):
lastfreq = len(freqs)-1
Pxx = np.flipud(Pxx) #flipping image in the y-axis
interp='nearest'
seismic = plt.get_cmap('seismic')
cmap=seismic
fig = plt.figure()
ax1=fig.add_subplot(111)
extent = 0,4,freqs[0],freqs[lastfreq] # new extent
#plot reduced range
img1=ax1.imshow(Pxx[-lastfreq:], interpolation=interp, aspect='auto',
extent=extent ,cmap=cmap)
ax1.set_autoscaley_on(False)
fig.colorbar(img1)
plt.show()
My example only sets a maximum frequency, but with some small tweaks you can set a minimum.
Is there a way to tell matplotlib to "normalize" a histogram such that its area equals a specified value (other than 1)?
The option "normed = 0" in
n, bins, patches = plt.hist(x, 50, normed=0, histtype='stepfilled')
just brings it back to a frequency distribution.
Just calculate it and normalize it to any value you'd like, then use bar to plot the histogram.
On a side note, this will normalize things such that the area of all the bars is normed_value. The raw sum will not be normed_value (though it's easy to have that be the case, if you'd like).
E.g.
import numpy as np
import matplotlib.pyplot as plt
x = np.random.random(100)
normed_value = 2
hist, bins = np.histogram(x, bins=20, density=True)
widths = np.diff(bins)
hist *= normed_value
plt.bar(bins[:-1], hist, widths)
plt.show()
So, in this case, if we were to integrate (sum the height multiplied by the width) the bins, we'd get 2.0 instead of 1.0. (i.e. (hist * widths).sum() will yield 2.0)
You can pass a weights argument to hist instead of using normed. For example, if your bins cover the interval [minval, maxval], you have n bins, and you want to normalize the area to A, then I think
weights = np.empty_like(x)
weights.fill(A * n / (maxval-minval) / x.size)
plt.hist(x, bins=n, range=(minval, maxval), weights=weights)
should do the trick.
EDIT: The weights argument must be the same size as x, and its effect is to make each value in x contribute the corresponding value in weights towards the bin count, instead of 1.
I think the hist function could probably do with a greater ability to control normalization, though. For example, I think as it stands, values outside the binned range are ignored when normalizing, which isn't generally what you want.