Using pyplot to draw histogram - python

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?

Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

Related

matplotlib density graph / histogram

I have an array: [0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1] (16 long)
How can I create density histogram with, for instance bins=4 to see where there appears to be most 1:s? This histogram would for instance be very tall in the middle part, and raise at the end a little (most 1:s in the beginning and the end).
I have this:
plt.hist([0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1], bins=4)
This is what i get. This histogram just presents that it's as many 1:s as 0:s.
How can I later create a graph (line) to show me the average raise and fall och the histogram?
I wouldn't call this a histogram. It's more a plot of spacial density.
So you can enumerate your list, such that the first element has the number 0, the last the number 15. Then divide this list into 4 bins. Within each bin you can count how often you see a 1. To automate this, an option is scipy.stats.binned_statistic.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic
data = [0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1]
x = np.arange(len(data))
s, edges, _ = binned_statistic(x, data, bins=4, statistic = lambda c: len(c[c==1]))
print(s, edges)
plt.bar(edges[:-1], s, width=np.diff(edges), align="edge", ec="k")
plt.show()
So here the edges are [0., 3.75, 7.5, 11.25, 15.] and the number of 1 within each bin is [0. 3. 2. 3.].
The histogram will not evaluate the positions of your values; it's a representation of the distribution of the data, the positions of the values in hist are irrelevant; with the min and max representing the min and max of the data.
I'd try setting an index and using that as your range instead.
I think this is the closest answer to the graph you've described:
Pandas bar plot with binned range
The importanceOfBeingEnrnest's answer here may also be helpful.
Plot a histogram using the index as x-axis labels

Histogram has only one bar

My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.
Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.
Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.
x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()
Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.
you should modify number of bins, for exam
number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)

Which elements of list go into which histogram bins?

I'm trying to make a scaled scatter plot from a histogram. The scatter plot is fairly straight-forward, make the histogram, find bin centers, scatter plot.
nbins=7
# Some example data
A = np.random.randint(0, 10, 100)
B = np.random.rand(100)
counts, binEdges=np.histogram(A,bins=nbins)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
fig = plt.figure(figsize=(7,5))
ax = fig.add_subplot(111)
ax.scatter(bincenters,counts,c='k', marker='.')
ax_setup(ax, 'X', 'Y')
plt.show()
but I want each element of A to only contribute a scaled value to it's bin, that scaled value is stored in B. (i.e. instead of each bin being the count of elements from A for that bin, I want each bin to be the sum of corresponding values from B)
To do this I tried creating a list C (same length as A, and B) that had the bin number allocation for each element of A, then summing all of the values from B that go into the same bin. I thought numpy.searchsorted() is what I needed e.g.,
C = bincenters.searchsorted(A, 'right')
but this doesn't get the allocation right, and doesn't seem to return the correct number of bins.
So, how do I create a list that tells me which histogram bin each element of my data goes into?
You write
but I want each element of A to only contribute a scaled value to it's bin, that scaled value is stored in B. (i.e. instead of each bin being the count of elements from A for that bin, I want each bin to be the sum of corresponding values from B)
IIUC, this functionality is already supported in numpy.histogram via the weights parameter:
An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1). If normed is True, the weights are normalized, so that the integral of the density over the range remains 1.
So, for your case, it would just be
counts, binEdges=np.histogram(A, bins=nbins, weights=B)
Another point: if your intent is to plot the histogram, note that you can directly use matplotlib.pyplot's utility functions for this (which take weights as well):
from matplotlib import pyplot as plt
plt.hist(A, bins=nbins, weights=B);
Finally, if you're intent on getting the assignments to bins, then that's exactly what numpy.digitize does:
nbins=7
# Some example data
A = np.random.randint(0, 10, 10)
B = np.random.rand(10)
counts, binEdges=np.histogram(A,bins=nbins)
>>> binEdges, np.digitize(A, binEdges)
array([ 0. , 1.28571429, 2.57142857, 3.85714286, 5.14285714,
6.42857143, 7.71428571, 9. ])

Mean of y value in vertical bin

So I have Stock market data (date from 0 onwards, & a close price) and with this I use numpy.fft to calculate the fast fourier transform, and corresponding frequencies, and then have these in the form of a zipped list, 'FFT,Frequency'. I have the Frequency values separated into vertical logarithmic bins using:
logbins = np.logspace(Min(logX),max(logX),numbins=10, base=10.0
I then digitize the Frequency values into these bins and use:
for k in range(1,len(freqlogbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Which works fine, however I need to also somehow work out the mean of the y values in each bin.
I imagine it is somehow possible referring to the x values in the ZippedList[0,i], and the y values as ZippedList[1,i]
but can't quite work out how.
Here is an example of copyable code:
import numpy as np
T_date0=np.arange(0,400)
T_price=np.random.uniform(100,400, size=400)
T_fft=np.fft.fft(T_price)
Tfreq=np.fft.fftfreq(T_date0.shape[-1])
I then Eradicate any negative frequency values and the corresponding fft values using:
Tfreq2=[]
T_fft2=[]
for i in range(len(Tfreq)):
if Tfreq[i]>0:
Tfreq2.append(Tfreq[i])
T_fft2.append(T_fft[i])
T_fft_absSq=(np.absolute(T_fft2))**2
logTFAS=np.log10(T_fft_absSq)
logTfreq=np.log10(Tfreq2)
numbins=10
logbins=np.logspace((min(logTfreq)-0.00000001),(max(logTfreq)+0.00000001),num=numbins, base=10.0) #The +/-0.00000001 are so that the endpoints lie in the bin intervals.
Tfreq2=np.array(Tfreq2)
TFAS=np.array(T_fft_absSq)
freqdig=np.digitize(Tfreq2,logbins)
mean_freq=[]
mean_fft=[]
for k in range(1,len(logbins)):
mean_freq.append(np.mean(Tfreq2[freqdig==k]))
Fourier=zip(logTfreq,logTFAS)
##This is where I need to work out the mean of the y values, in the vertical bins
Here is what the data looks like, where the black dashed lines represent the bins, and the dashed yellow lines represent the mean of the x values in each bin. The blue line is a 2st order polynomial fit.
Obviously with random data it will look a little different to the link I posted below, but it gives an idea.
I was overthinking everything....
I was able to calculate the y value averages in a very similar way, using the frequency binning as such:
for k in range(1,len(logbins)):
mean_freq.append(np.mean(np.array(logTfreq)[freqdig==k]))
mean_fft.append(np.mean(np.array(logTFAS)[freqdig==k]))
Not quite sure what your asking for but maybe np.digitize will help:
import numpy as np
d = np.random.random(1000)
bins = np.linspace(0, 1, 10)
dig = np.digitize(d, bins)
binmean = [d[dig == i].mean() for i in range(1, len(bins))]
print binmean

How to best utilize the hist() to show a cumulative and normed histogram?

I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()

Categories

Resources