I want to plot a histogram of my df with about 60 thousand of values. After I used plt.hist(x, bins = 30) it gave me something like
The problem is that there are more values bigger than 20 but the frequencies of those values may be smaller than 10. So how can I adjust the axis displayed to show more bins since I want to look at the whole distribution here.
The problem with histograms that skew so much towards one value is you're going to essentially flatten out any outlying values. A solution might be just to present the data with two charts.
Can you create another histogram containing only the values greater than 20?
(psuedo-code, since I don't know your data structure from your post)
plt.hist(x[x.column > 20], bins = 30)
Finally, it could look like this example:
import matplotlib.pyplot as plt
import numpy as np
values1 = np.random.rand(1000,1)*100
values2 = np.random.rand(100000,1)*5
values3 = np.random.rand(10000,1)*20
values = np.vstack((values1,values2,values3))
fig = plt.figure(figsize=(12,5))
ax1 = fig.add_subplot(121)
ax1.hist(values,bins=30)
ax1.set_yscale('log')
ax1.set_title('with log scale')
ax2 = fig.add_subplot(122)
ax2.hist(values,bins=30)
ax2.set_title('no log scale')
fig.savefig('test.jpg')
You could use plt.xscale('log')
PyPlot Logarithmic and other nonlinear axis
Related
I have a Pandas column with data unique to .0001
I would like to plot a histogram that has a bar for each unique .0001 of data.
I achieve a lot of granularity by
plt.hist(df['data'], bins=500)
but I would like to see counts for each unique value.
How would I go about doing this?
thank you
As your values are discrete, it is important to set the bin boundaries nicely in-between these values. If the boundaries coincide with the values, strange rounding artifacts can happen. The example below has each value 10 times, but the histogram with the boundaries on top of the values puts the last two values into the same bin:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'data': np.repeat(np.arange(0.0005, 0.0030, 0.0001), 10)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 4))
ax1.hist(df['data'], bins=np.arange(df['data'].min(), df['data'].max(), 0.0001), ec='w')
ax1.set_title('bin boundaries on top of the values')
ax2.hist(df['data'], bins=np.arange(df['data'].min() - 0.00005, df['data'].max() + 0.0001, 0.0001), ec='w')
ax2.set_title('bin boundaries in-between the values')
plt.show()
Note that the version with the boundaries at the halves also puts the x-ticks nicely in the center of the bins.
Instead of specify the number of bins bins=500, you can specify the bins:
plt.hist(df['data'], bins=np.arange(df['data'].min(), df['data'].max(), 0.0001) )
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
value = [8904,8953,8977,9147,9243,9320]
bin = np.arange(0,70,10)
ax.hist(value, bins=bin)
plt.grid(True)
plt.show()
I am trying to plot a histogram with the value array on the x-axis and the y-axis will be the bin. But when I run the code I get an empty chart. Could anyone please help me out. Thank you
First thing I see is that in your values array, your data points aren't separated by commas.
Second thing, your values are outside the ranges of your bins. All your values are well into the thousands, and your bins' range is between 0 and 70.
Here is my edited version of your code (I included my import statements to make things clear). I changed the values to being within your bin ranges:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
value = [7, 8, 15, 45, 50, 80]
bin = np.arange(0,70,10)
ax.hist(value, bins=bin)
plt.grid(True)
plt.show()
The result I get is this image, which illustrates what's going on. The data point 80 is outside the bin range, and therefore isn't shown at all, just like the data points you originally had. Other than that, all data points are shown in the histogram.
Hope this helps!
Edit: you said in a comment to this answer that you want it to be horizontal, not vertical. You add orientation="horizontal" to your ax.hist statement as an argument. New code looks like this:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
value = [7, 8, 15, 45, 50, 80]
bin = np.arange(0,70,10)
ax.hist(value, bins=bin, orientation="horizontal")
plt.grid(True)
plt.show()
Your plot should now look like this.
I am having trouble using the pyplot.hist function to plot 2 histograms on the same figure. For each binning interval, I want the 2 bars to be centered between the bins (Python 3.6 user). To illustrate, here is an example:
import numpy as np
from matplotlib import pyplot as plt
bin_width=1
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=np.arange(0,np.round(max(A.max(),B.max())/bin_width)*bin_width+2*bin_width,bin_width)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(A,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='A')
ax.hist(B,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='B')
ax.legend()
ax.set_ylabel('Count')
I get this:
Histogram_1
A and B series are overlapping, which is not good. Knowing there are only 3 option for 'align', (centered on left bin, middle of 2 bins, centered on right bin), i see no other options than modifying the bins, by adding:
bins-=0.25*bin_width
Before plotting A, and adding:
bins+=0.5*bin_width
Before plotting B. That gives me: Histogram
That's better! However, I had to modify the binning, so it is not the same for A and B.
I searched for a simple way to use the same bins, and then shift the 1st and 2nd plot so they are correctly displayed in the binning intervals, but I didn't find it. Any advice?
I hope I explained my problem clearly.
As previously was mentioned in the above comment you do not need a hist plot function. Use numpy histogram function and plot it results with bar function of matplotlib.
According to bins count and count of data types you can calculate bin width. Ticks you may adjust with xticks method:
import numpy as np
import matplotlib.pylab as plt
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=20
# calculate heights and bins for both lists
ahist, abins = np.histogram(A, bins)
bhist, bbins = np.histogram(B, abins)
fig = plt.figure()
ax = fig.add_subplot(111)
# calc bin width for two lists
w = (bbins[1] - bbins[0])/3.
# plot bars
ax.bar(abins[:-1]-w/2.,ahist,width=w,color='r')
ax.bar(bbins[:-1]+w/2.,bhist,width=w,color='orange')
# adjsut xticks
plt.xticks(abins[:-1], np.arange(bins))
plt.show()
I think this is a simple question, but I just still can't seem to think of a simple solution. I have a set of data of molecular abundances, with values ranging many orders of magnitude. I want to represent these abundances with boxplots (box-and-whiskers plots), and I want the boxes to be calculated on log scale because of the wide range of values.
I know I can just calculate the log10 of the data and send it to matplotlib's boxplot, but this does not retain the logarithmic scale in plots later.
So my question is basically this:
When I have calculated a boxplot based on the log10 of my values, how do I convert the plot afterward to be shown on a logarithmic scale instead of linear with the log10 values?
I can change tick labels to partly fix this, but I have no clue how I get logarithmic scales back to the plot.
Or is there another more direct way to plotting this. A different package maybe that has this options already included?
Many thanks for the help.
I'd advice against doing the boxplot on the raw values and setting the y-axis to logarithmic, because the boxplot function is not designed to work across orders of magnitudes and you may get too many outliers (depends on your data, of course).
Instead, you can plot the logarithm of the data and manually adjust the y-labels.
Here is a very crude example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
np.random.seed(42)
values = 10 ** np.random.uniform(-3, 3, size=100)
fig = plt.figure(figsize=(9, 3))
ax = plt.subplot(1, 3, 1)
ax.boxplot(np.log10(values))
ax.set_yticks(np.arange(-3, 4))
ax.set_yticklabels(10.0**np.arange(-3, 4))
ax.set_title('log')
ax = plt.subplot(1, 3, 2)
ax.boxplot(values)
ax.set_yscale('log')
ax.set_title('raw')
ax = plt.subplot(1, 3, 3)
ax.boxplot(values, whis=[5, 95])
ax.set_yscale('log')
ax.set_title('5%')
plt.show()
The right figure shows the box plot on the raw values. This leads to many outliers, because the maximum whisker length is computed as a multiple (default: 1.5) of the interquartile range (the box height), which does not scale across orders of magnitude.
Alternatively, you could specify to draw the whiskers for a given percentile range:
ax.boxplot(values, whis=[5, 95])
In this case you get a fixed amount of outlires (5%) above and below.
You can use plt.yscale:
plt.boxplot(data); plt.yscale('log')
I would like to compare two histograms by having the Y axis show the percentage of each column from the overall dataset size instead of an absolute value. Is that possible? I am using Pandas and matplotlib.
Thanks
The density=True (normed=True for matplotlib < 2.2.0) returns a histogram for which np.sum(pdf * np.diff(bins)) equals 1. If you want the sum of the histogram to be 1 you can use Numpy's histogram() and normalize the results yourself.
x = np.random.randn(30)
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(x, density=True, color='grey')
hist, bins = np.histogram(x)
ax[1].bar(bins[:-1], hist.astype(np.float32) / hist.sum(), width=(bins[1]-bins[0]), color='grey')
ax[0].set_title('normed=True')
ax[1].set_title('hist = hist / hist.sum()')
Btw: Strange plotting glitch at the first bin of the left plot.
Pandas plotting can accept any extra keyword arguments from the respective matplotlib function. So for completeness from the comments of others here, this is how one would do it:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2), columns=list('AB'))
df.hist(density=1)
Also, for direct comparison this may be a good way as well:
df.plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)
Looks like #CarstenKönig found the right way:
df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))
I know this answer is 6 years later but to anyone using density=True (the substitute for the normed=True), this is not doing what you might want to. It will normalize the whole distribution so that the area of the bins is 1. So if you have more bins with a width < 1 you can expect the height to be > 1 (y-axis). If you want to bound your histogram to [0;1] you will have to calculate it yourself.
You can simplify the weighting using np.ones_like():
df["ColumnName"].plot.hist(weights = np.ones_like(df.index) / len(df.index))
np.ones_like() is okay with the df.index structure
len(df.index) is faster for large DataFrames
I see this is an old question but it shows up on top for some searches, so I think as of 2021 seaborn would be an easy way to do this.
You can do something like this:
import seaborn as sns
sns.histplot(df,stat="probability")
In some scenarios you can adapt with a barplot:
tweets_df['label'].value_counts(normalize=True).plot(figsize=(12,12), kind='bar')