I made the PDF which is this hist code below;
plt.figure()
values1,bins1,_ = plt.hist(np.log10(fakeclusterlum),bins=20)
plt.hist(np.log10(bigclusterlum151mh),alpha = .5,bins = bins1)
but I am not sure how to plot this to make it into a CDF? I want to plot the fakeclusterlum and bigclusterlum151mh points. if that makes sense if it doesn't I apologise, I am somewhat of a beginner!
pyplot.hist has an argument
cumulative : bool, optional
If True, then a histogram is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of datapoints.
Default: False
Hence use
plt.hist(..., cumulative=True)
to plot a cumulative histogram.
Related
I am trying to understand the matplotlib.hist function. I have the following data:
cs137_count = np.array([this has a size of 750 and integers in the range from 1820 to 1980])
plt.figure()
plt.hist(cs137_count, density=True, bin = 50)
plt.ylabel('Distribution')
plt.xlabel('Counts');
but the plot it provides has weird values for the y-axis in the range from 0 - 0.016 which makes no sense and I am not sure why it returns those values? I have attached an image of the plot below.
That's because you're using density=True. From the docs
density: bool, optional
If True, the first element of the return tuple
will be the counts normalized to form a probability density, i.e., the
area (or integral) under the histogram will sum to 1. This is achieved
by dividing the count by the number of observations times the bin
width and not dividing by the total number of observations. If stacked
is also True, the sum of the histograms is normalized to 1.
Default is False.
My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.
Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.
Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.
x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()
Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.
you should modify number of bins, for exam
number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)
Below is the data for which I want to plot the PDF.
https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out
Below is the script
import numpy as np
import matplotlib.pyplot as plt
import pylab
data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')
plt.plot(X1[1:], H)
plt.show()
When I plot the above, I get the following.
Is the above the correct way to calculate the PDF of a range of values
Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.
It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.
In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.
I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?
Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()
I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()