I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()
How do I plot a histogram using matplotlib.pyplot.hist?
I have a list of y-values that correspond to bar height, and a list of x-value strings.
Related: matplotlib.pyplot.bar.
If you want a histogram, you don't need to attach any 'names' to x-values because:
on x-axis you will have data bins
on y-axis counts (by default) or frequencies (density=True)
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(42)
x = np.random.normal(size=1000)
plt.hist(x, density=True, bins=30) # density=False would make counts
plt.ylabel('Probability')
plt.xlabel('Data');
Note, the number of bins=30 was chosen arbitrarily, and there is Freedman–Diaconis rule to be more scientific in choosing the "right" bin width:
, where IQR is Interquartile range and n is total number of datapoints to plot
So, according to this rule one may calculate number of bins as:
q25, q75 = np.percentile(x, [25, 75])
bin_width = 2 * (q75 - q25) * len(x) ** (-1/3)
bins = round((x.max() - x.min()) / bin_width)
print("Freedman–Diaconis number of bins:", bins)
plt.hist(x, bins=bins);
Freedman–Diaconis number of bins: 82
And finally you can make your histogram a bit fancier with PDF line, titles, and legend:
import scipy.stats as st
plt.hist(x, density=True, bins=82, label="Data")
mn, mx = plt.xlim()
plt.xlim(mn, mx)
kde_xs = np.linspace(mn, mx, 300)
kde = st.gaussian_kde(x)
plt.plot(kde_xs, kde.pdf(kde_xs), label="PDF")
plt.legend(loc="upper left")
plt.ylabel("Probability")
plt.xlabel("Data")
plt.title("Histogram");
If you're willing to explore other opportunities, there is a shortcut with seaborn:
# !pip install seaborn
import seaborn as sns
sns.displot(x, bins=82, kde=True);
Now back to the OP.
If you have limited number of data points, a bar plot would make more sense to represent your data. Then you may attach labels to x-axis:
x = np.arange(3)
plt.bar(x, height=[1,2,3])
plt.xticks(x, ['a','b','c']);
If you haven't installed matplotlib yet just try the command.
> pip install matplotlib
Library import
import matplotlib.pyplot as plot
The histogram data:
plot.hist(weightList,density=1, bins=20)
plot.axis([50, 110, 0, 0.06])
#axis([xmin,xmax,ymin,ymax])
plot.xlabel('Weight')
plot.ylabel('Probability')
Display histogram
plot.show()
And the output is like :
This is an old question but none of the previous answers has addressed the real issue, i.e. that fact that the problem is with the question itself.
First, if the probabilities have been already calculated, i.e. the histogram aggregated data is available in a normalized way then the probabilities should add up to 1. They obviously do not and that means that something is wrong here, either with terminology or with the data or in the way the question is asked.
Second, the fact that the labels are provided (and not intervals) would normally mean that the probabilities are of categorical response variable - and a use of a bar plot for plotting the histogram is best (or some hacking of the pyplot's hist method), Shayan Shafiq's answer provides the code.
However, see issue 1, those probabilities are not correct and using bar plot in this case as "histogram" would be wrong because it does not tell the story of univariate distribution, for some reason (perhaps the classes are overlapping and observations are counted multiple times?) and such plot should not be called a histogram in this case.
Histogram is by definition a graphical representation of the distribution of univariate variable (see Histogram | NIST/SEMATECH e-Handbook of Statistical Methods & Histogram | Wikipedia) and is created by drawing bars of sizes representing counts or frequencies of observations in selected classes of the variable of interest. If the variable is measured on a continuous scale those classes are bins (intervals). Important part of histogram creation procedure is making a choice of how to group (or keep without grouping) the categories of responses for a categorical variable, or how to split the domain of possible values into intervals (where to put the bin boundaries) for continuous type variable. All observations should be represented, and each one only once in the plot. That means that the sum of the bar sizes should be equal to the total count of observation (or their areas in case of the variable widths, which is a less common approach). Or, if the histogram is normalised then all probabilities must add up to 1.
If the data itself is a list of "probabilities" as a response, i.e. the observations are probability values (of something) for each object of study then the best answer is simply plt.hist(probability) with maybe binning option, and use of x-labels already available is suspicious.
Then bar plot should not be used as histogram but rather simply
import matplotlib.pyplot as plt
probability = [0.3602150537634409, 0.42028985507246375,
0.373117033603708, 0.36813186813186816, 0.32517482517482516,
0.4175257731958763, 0.41025641025641024, 0.39408866995073893,
0.4143222506393862, 0.34, 0.391025641025641, 0.3130841121495327,
0.35398230088495575]
plt.hist(probability)
plt.show()
with the results
matplotlib in such case arrives by default with the following histogram values
(array([1., 1., 1., 1., 1., 2., 0., 2., 0., 4.]),
array([0.31308411, 0.32380469, 0.33452526, 0.34524584, 0.35596641,
0.36668698, 0.37740756, 0.38812813, 0.39884871, 0.40956928,
0.42028986]),
<a list of 10 Patch objects>)
the result is a tuple of arrays, the first array contains observation counts, i.e. what will be shown against the y-axis of the plot (they add up to 13, total number of observations) and the second array are the interval boundaries for x-axis.
One can check they they are equally spaced,
x = plt.hist(probability)[1]
for left, right in zip(x[:-1], x[1:]):
print(left, right, right-left)
Or, for example for 3 bins (my judgment call for 13 observations) one would get this histogram
plt.hist(probability, bins=3)
with the plot data "behind the bars" being
The author of the question needs to clarify what is the meaning of the "probability" list of values - is the "probability" just a name of the response variable (then why are there x-labels ready for the histogram, it makes no sense), or are the list values the probabilities calculated from the data (then the fact they do not add up to 1 makes no sense).
This is a very round-about way of doing it but if you want to make a histogram where you already know the bin values but dont have the source data, you can use the np.random.randint function to generate the correct number of values within the range of each bin for the hist function to graph, for example:
import numpy as np
import matplotlib.pyplot as plt
data = [np.random.randint(0, 9, *desired y value*), np.random.randint(10, 19, *desired y value*), etc..]
plt.hist(data, histtype='stepfilled', bins=[0, 10, etc..])
as for labels you can align x ticks with bins to get something like this:
#The following will align labels to the center of each bar with bin intervals of 10
plt.xticks([5, 15, etc.. ], ['Label 1', 'Label 2', etc.. ])
Though the question appears to be demanding plotting a histogram using matplotlib.hist() function, it can arguably be not done using the same as the latter part of the question demands to use the given probabilities as the y-values of bars and given names(strings) as the x-values.
I'm assuming a sample list of names corresponding to given probabilities to draw the plot. A simple bar plot serves the purpose here for the given problem. The following code can be used:
import matplotlib.pyplot as plt
probability = [0.3602150537634409, 0.42028985507246375,
0.373117033603708, 0.36813186813186816, 0.32517482517482516,
0.4175257731958763, 0.41025641025641024, 0.39408866995073893,
0.4143222506393862, 0.34, 0.391025641025641, 0.3130841121495327,
0.35398230088495575]
names = ['name1', 'name2', 'name3', 'name4', 'name5', 'name6', 'name7', 'name8', 'name9',
'name10', 'name11', 'name12', 'name13'] #sample names
plt.bar(names, probability)
plt.xticks(names)
plt.yticks(probability) #This may be included or excluded as per need
plt.xlabel('Names')
plt.ylabel('Probability')
I'm using pyplot to make a histogram. Here is approximately what I'm doing:
import numpy as np
import pylab as pl
A = {my dataset as a dictionary: different numbers and their frequencies}
numbers = A.keys()
frequencies = A.values()
plot = np.transpose(np.array([[numbers,frequencies]])
n = <my bins-value here>
pl.hist(plot,bins=n,log=True)
pl.show()
I have noticed that, regardless of the number of bins I specify, the second bin is always green, like below. Why is it green? What does this mean? How do I prevent this from happening?
You can't really use hist that way. hist computes value frequencies given the raw data. You have already computed the frequencies, and you're trying to pass them to hist, but that's not the input hist needs. When you pass in a two-dimensional array, as you're doing, hist displays multiple histograms, one for each column. This is documented:
Multiple data can be provided via x as a list of datasets of potentially different length ([x0, x1, ...]), or as a 2-D ndarray in which each column is a dataset.
So you're getting one bar graph (the blue ones) for your labels, and another (the green ones) for their counts. Presumably all the green ones are lumped together because their range is much smaller.
If you generated your frequencies from raw data, you can pass that raw data to hist to get your histogram. If you only have the histogram data, you should use matplotlib's bar function to make a bar graph yourself using the histogram data. However, you'd have to bin it yourself. The bottom line is that you can either let hist do everything, or nothing: you can have it compute the frequencies and the bins and do the plot, or you can compute the frequencies and the bins and do the plot, but you can't just compute the frequencies yourself and have hist just do the binning and the plot.
I'm plotting a 2d histogram as image in pyqtgraph. I would like to set the axes scales correctly (i.e. representing the actual values of the binned data).
I found this article but I'm not quite sure how to translate it to my case.
I do:
h = np.histogram2d(x, y, 30, normed = True)
w = pg.ImageView(view=pg.PlotItem())
w.setImage(h[0])
but the scale of the PlotItem axes run from 0 to 30 (number of bins), which is not what I would like.
You need to set the position and scale of the image. The link you provided has the following code:
view.setImage(img, pos=[x0, y0], scale=[xscale, yscale])
You only need to determine the correct values of [x0, y0] and [xscale, yscale] based on your bin values in h[1].
I'm trying to draw part of a histogram using matplotlib.
Instead of drawing the whole histogram which has a lot of outliers and large values I want to focus on just a small part. The original histogram looks like this:
hist(data, bins=arange(data.min(), data.max(), 1000), normed=1, cumulative=False)
plt.ylabel("PDF")
And after focusing it looks like this:
hist(data, bins=arange(0, 121, 1), normed=1, cumulative=False)
plt.ylabel("PDF")
Notice that the last bin is stretched and worst of all the Y ticks are scaled so that the sum is exactly 1 (so points out of the current range are not taken into account at all)
I know that I can achieve what I want by drawing the histogram over the whole possible range and then restricting the axis to the part I'm interested in, but it wastes a lot of time calculating bins that I won't use/see anyway.
hist(btsd-40, bins=arange(btsd.min(), btsd.max(), 1), normed=1, cumulative=False)
axis([0,120,0,0.0025])
Is there a fast and easy way to draw just the focused region but still get the Y scale correct?
In order to plot a subset of the histogram, I don't think you can get around to calculating the whole histogram.
Have you tried computing the histogram with numpy.histogram and then plotting a region using pylab.plot or something? I.e.
import numpy as np
import pylab as plt
data = np.random.normal(size=10000)*10000
plt.figure(0)
plt.hist(data, bins=np.arange(data.min(), data.max(), 1000))
plt.figure(1)
hist1 = np.histogram(data, bins=np.arange(data.min(), data.max(), 1000))
plt.bar(hist1[1][:-1], hist1[0], width=1000)
plt.figure(2)
hist2 = np.histogram(data, bins=np.arange(data.min(), data.max(), 200))
mask = (hist2[1][:-1] < 20000) * (hist2[1][:-1] > 0)
plt.bar(hist2[1][mask], hist2[0][mask], width=200)
Original histogram:
Histogram calculated manually:
Histogram calculated manually, cropped:
(N.B.: values are smaller because bins are narrower)
I think, you can normalize your data using a given weight. (repeat is a numpy function).
hist(data, bins=arange(0, 121, 1), weights=repeat(1.0/len(data), len(data)))