Python: Align bars between bin edges for a double histogram - python

I am having trouble using the pyplot.hist function to plot 2 histograms on the same figure. For each binning interval, I want the 2 bars to be centered between the bins (Python 3.6 user). To illustrate, here is an example:
import numpy as np
from matplotlib import pyplot as plt
bin_width=1
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=np.arange(0,np.round(max(A.max(),B.max())/bin_width)*bin_width+2*bin_width,bin_width)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(A,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='A')
ax.hist(B,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='B')
ax.legend()
ax.set_ylabel('Count')
I get this:
Histogram_1
A and B series are overlapping, which is not good. Knowing there are only 3 option for 'align', (centered on left bin, middle of 2 bins, centered on right bin), i see no other options than modifying the bins, by adding:
bins-=0.25*bin_width
Before plotting A, and adding:
bins+=0.5*bin_width
Before plotting B. That gives me: Histogram
That's better! However, I had to modify the binning, so it is not the same for A and B.
I searched for a simple way to use the same bins, and then shift the 1st and 2nd plot so they are correctly displayed in the binning intervals, but I didn't find it. Any advice?
I hope I explained my problem clearly.

As previously was mentioned in the above comment you do not need a hist plot function. Use numpy histogram function and plot it results with bar function of matplotlib.
According to bins count and count of data types you can calculate bin width. Ticks you may adjust with xticks method:
import numpy as np
import matplotlib.pylab as plt
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=20
# calculate heights and bins for both lists
ahist, abins = np.histogram(A, bins)
bhist, bbins = np.histogram(B, abins)
fig = plt.figure()
ax = fig.add_subplot(111)
# calc bin width for two lists
w = (bbins[1] - bbins[0])/3.
# plot bars
ax.bar(abins[:-1]-w/2.,ahist,width=w,color='r')
ax.bar(bbins[:-1]+w/2.,bhist,width=w,color='orange')
# adjsut xticks
plt.xticks(abins[:-1], np.arange(bins))
plt.show()

Related

How to combine 2 dataframe histograms in 1 plot?

I would like to use a code that shows all histograms in a dataframe. That will be df.hist(bins=10). However, I would like to add another histograms which shows CDF df_hist=df.hist(cumulative=True,bins=100,density=1,histtype="step")
I tried separating their matplotlib axes by using fig=plt.figure() and
plt.subplot(211). But this df.hist is actually part of pandas function, not matplotlib function. I also tried setting axes and adding ax=ax1 and ax2 options to each histogram but it didn't work.
How can I combine these histograms together?
Any help?
Histograms that I want to combine are like these. I want to show them side by side or put the second one on tip of the first one.
Sorry that I didn't care to make them look good.
It is possible to draw them together:
# toy data frame
df = pd.DataFrame(np.random.normal(0,1,(100,20)))
# draw hist
fig, axes = plt.subplots(5,4, figsize=(16,10))
df.plot(kind='hist', subplots=True, ax=axes, alpha=0.5)
# clone axes so they have different scales
ax_new = [ax.twinx() for ax in axes.flatten()]
df.plot(kind='kde', ax=ax_new, subplots=True)
plt.show()
Output:
It's also possible to draw them side-by-side. For example
fig, axes = plt.subplots(10,4, figsize=(16,10))
hist_axes = axes.flatten()[:20]
df.plot(kind='hist', subplots=True, ax=hist_axes, alpha=0.5)
kde_axes = axes.flatten()[20:]
df.plot(kind='kde', subplots=True, ax=kde_axes, alpha=0.5)
will plot hist on top of kde.
You can find more info here: Multiple histograms in Pandas (possible duplicate btw) but apparently Pandas cannot handle multiple histogram on same graphs.
It's ok because np.histogram and matplotlib.pyplot can, check the above link for a more complete answer.
Solution for overlapping histograms with df.hist with any number of subplots
You can combine two dataframe histogram figures by creating twin axes using the grid of axes returned by df.hist. Here is an example of normal histograms combined with cumulative step histograms where the size of the figure and the layout of the grid of subplots are taken care of automatically:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
# Create sample dataset stored in a pandas dataframe
rng = np.random.default_rng(seed=1) # random number generator
letters = [chr(i) for i in range(ord('A'), ord('G')+1)]
df = pd.DataFrame(rng.exponential(1, size=(100, len(letters))), columns=letters)
# Set parameters for figure dimensions and grid layout
nplots = df.columns.size
ncols = 3
nrows = int(np.ceil(nplots/ncols))
subp_w = 10/ncols # 10 is the total figure width in inches
subp_h = 0.75*subp_w
bins = 10
# Plot grid of histograms with pandas function (with a shared y-axis)
grid = df.hist(grid=False, sharey=True, figsize=(ncols*subp_w, nrows*subp_h),
layout=(nrows, ncols), bins=bins, edgecolor='white', linewidth=0.5)
# Create list of twin axes containing second y-axis: note that due to the
# layout, the grid object may contain extra unused axes that are not shown
# (here in the H and I positions). The ax parameter of df.hist only accepts
# a number of axes that corresponds to the number of numerical variables
# in df, which is why the flattened array of grid axes is sliced here.
grid_twinx = [ax.twinx() for ax in grid.flat[:nplots]]
# Plot cumulative step histograms over normal histograms: note that the grid layout is
# preserved in grid_twinx so no need to set the layout parameter a second time here.
df.hist(ax=grid_twinx, histtype='step', bins=bins, cumulative=True, density=True,
color='tab:orange', linewidth=2, grid=False)
# Adjust space between subplots after generating twin axes
plt.gcf().subplots_adjust(wspace=0.4, hspace=0.4)
plt.show()
Solution for displaying histograms of different types side-by-side with matplotlib
To my knowledge, it is not possible to show the different types of plots side-by-side with df.hist. You need to create the figure from scratch, like in this example using the same dataset as before:
# Set parameters for figure dimensions and grid layout
nvars = df.columns.size
plot_types = 2 # normal histogram and cumulative step histogram
ncols_vars = 2
nrows = int(np.ceil(nvars/ncols_vars))
subp_w = 10/(plot_types*ncols_vars) # 10 is the total figure width in inches
subp_h = 0.75*subp_w
bins = 10
# Create figure with appropriate size
fig = plt.figure(figsize=(plot_types*ncols_vars*subp_w, nrows*subp_h))
fig.subplots_adjust(wspace=0.4, hspace=0.7)
# Create subplots by adding a new axes per type of plot for each variable
# and create lists of axes of normal histograms and their y-axis limits
axs_hist = []
axs_hist_ylims = []
for idx, var in enumerate(df.columns):
axh = fig.add_subplot(nrows, plot_types*ncols_vars, idx*plot_types+1)
axh.hist(df[var], bins=bins, edgecolor='white', linewidth=0.5)
axh.set_title(f'{var} - Histogram', size=11)
axs_hist.append(axh)
axs_hist_ylims.append(axh.get_ylim())
axc = fig.add_subplot(nrows, plot_types*ncols_vars, idx*plot_types+2)
axc.hist(df[var], bins=bins, density=True, cumulative=True,
histtype='step', color='tab:orange', linewidth=2)
axc.set_title(f'{var} - Cumulative step hist.', size=11)
# Set shared y-axis for histograms
for ax in axs_hist:
ax.set_ylim(max(axs_hist_ylims))
plt.show()

Adjust spacing on X-axis in python boxplots

I plot boxplots using sns.boxplot and pandas.DataFrame.boxplot in python 3.x.
And I want to ask is it possible to adjust the spacing between boxes in boxplot, so the box of Group_b is farther right to the box of Group_a than in the output figures. Thanks
Codes:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
dict_a = {'value':[1,2,3,7,8,9],'name':['Group_a']*3+['Group_b']*3}
dataframe = pd.DataFrame(dict_a)
sns.boxplot( y="value" , x="name" , data=dataframe )
Output figure:
dataframe.boxplot("value" ,by = "name" )
Output figure 2:
The distance between the two boxes is determined by the x axis limits. For a constant distance in data units between the boxes, what makes them spaced more or less appart is the fraction of this data unit distance compared to the overall data space shown on the axis.
For example, in the seaborn case, the first box sits at x=0, the second at x=1. The difference is 1 unit. The maximal distance between the two boxplots is hence achieved by setting the x axis limits to those exact limits,
ax.set_xlim(0, 1)
Of course this will cut half of each box.
So a more useful value would be ax.set_xlim(0-val, 1+val) with val being somewhere in the range of the width of the boxes.
One needs to mention that pandas uses different units. The first box is at x=1, the second at x=2. Hence one would need something like ax.set_xlim(1-val, 2+val).
The following would add a slider to the plot to see the effect of different values.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dict_a = {'value':[1,2,3,7,8,9],'name':['Group_a']*3+['Group_b']*3}
dataframe = pd.DataFrame(dict_a)
fig, (ax, ax2, ax3) = plt.subplots(nrows=3,
gridspec_kw=dict(height_ratios=[4,4,1], hspace=1))
sns.boxplot( y="value" , x="name" , data=dataframe, width=0.1, ax=ax)
dataframe.boxplot("value", by = "name", ax=ax2)
from matplotlib.widgets import Slider
slider = Slider(ax3, "", valmin=0, valmax=3)
def update(val):
ax.set_xlim(-val, 1+val)
ax2.set_xlim(1-val, 2+val)
slider.on_changed(update)
plt.show()

Second y axis and vertical line

I am creating a violinplot using the following code:
import seaborn as sns
ax = sns.violinplot(data=df[['SoundProduction','SoundForecast','diff']])
ax.set_ylabel("Sound power level [dB(A)]")
It gives me the folowing result:
Is there any way I can plot diff on a second y-axis so that all three series become clearly visible?
Also, is there a way to plot a vertical line in between 2 series? In this case I want a vertical line between SoundForecast and diff once they are plotted on two different axes.
You can achieve this using multiple subplots, which are easily set up using the plt.subplots (see lots more subplot examples).
This allows you to display your distributions on scales that are appropriate, and don't "waste" the display space. Most(all?) of seaborn's plotting functions accept the ax= argument so you can set the axes where the plot will be rendered. The axes also have clear separations between them.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate some random distribution data
n = 800 # samples
prod = 95 + 5 * np.random.beta(0.6, 0.5, size=n); # a bimodal distribution
forecast = prod + 3*np.random.randn(n) # forecast is noisy estimate around the "true" production
diff = prod-forecast # should be with mu 0 sigma 3
df = pd.DataFrame(np.array([prod, forecast, diff]).T, columns=['SoundProduction','SoundForecast','diff']);
# set up two subplots, with one wider than the other
fig, ax = plt.subplots(1,2, num=1, gridspec_kw={'width_ratios':[2,1]})
# plot violin distribution estimates separately so the y-scaling makes sense in each group
sns.violinplot(data=df[['SoundProduction','SoundForecast']], ax=ax[0])
sns.violinplot(data=df[['diff']], ax=ax[1])

Extending the range of bins in seaborn histogram

I'm trying to create a histogram with seaborn, where the bins start at 0 and go to 1. However, there is only date in the range from 0.22 to 0.34. I want the empty space more for a visual effect to better present the data.
I create my sheet with
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')
df = pd.read_excel('test.xlsx', sheetname='IvT')
Here I create a variable for my list and one that I think should define the range of the bins of the histogram.
st = pd.Series(df['Short total'])
a = np.arange(0, 1, 15, dtype=None)
And the histogram itself looks like this
sns.set_style("white")
plt.figure(figsize=(12,10))
plt.xlabel('Ration short/total', fontsize=18)
plt.title ('CO3 In vitro transcription, Na+', fontsize=22)
ax = sns.distplot(st, bins=a, kde=False)
plt.savefig("hist.svg", format="svg")
plt.show()
Histogram
It creates a graph bit the range in x goes from 0 to 0.2050 and in y from -0.04 to 0.04. So completely different from what I expect. I google searched for quite some time but can't seem to find an answer to my specific problem.
Already, thanks for your help guys.
There are a few approaches to achieve the desired results here. For example, you can change the xaxis limits after you have plotted the histogram, or adjust the range over which the bins are created.
import seaborn as sns
# Load sample data and create a column with values in the suitable range
iris = sns.load_dataset('iris')
iris['norm_sep_len'] = iris['sepal_length'] / (iris['sepal_length'].max()*2)
sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
Change the xaxis limits (the bins are still created over the range of your data):
ax = sns.distplot(iris['norm_sep_len'], bins=10, kde=False)
ax.set_xlim(0,1)
Create the bins over the range 0 to 1:
sns.distplot(iris['norm_sep_len'], bins=10, kde=False, hist_kws={'range':(0,1)})
Since the range for the bins is larger, you now need to use more bins if you want to have the same bin width as when adjusting the xlim:
sns.distplot(iris['norm_sep_len'], bins=45, kde=False, hist_kws={'range':(0,1)})

Get actual numbers instead of normalized value in seaborn KDE plots

I have three dataframes and I plot the KDE using seaborn module in python. The issue is that these plots try to make the area under the curve 1 (which is how they are intended to perform), so the height in the plots are normalized ones. But is there any way to show the actual values instead of the normalized ones. Also is there any way I can find out the point of intersection for the curves?
Note: I do not want to use the curve_fit method of scipy as I am not sure about the distribution I will get for each dataframe, it can be multimodal also.
import seaborn as sns
plt.figure()
sns.distplot(data_1['gap'],kde=True,hist=False,label='1')
sns.distplot(data_2['gap'],kde=True,hist=False,label='2')
sns.distplot(data_3['gap'],kde=True,hist=False,label='3')
plt.legend(loc='best')
plt.show()
Output for the code is attached in the link as I can't post images.plot_link
You can just grab the line and rescale its y-values with set_data:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create some data
n = 1000
x = np.random.rand(n)
# plot stuff
fig, ax = plt.subplots(1,1)
ax = sns.distplot(x, kde=True, hist=False, ax=ax)
# find the line and rescale y-values
children = ax.get_children()
for child in children:
if isinstance(child, matplotlib.lines.Line2D):
x, y = child.get_data()
y *= n
child.set_data(x,y)
# update y-limits (not done automatically)
ax.set_ylim(y.min(), y.max())
fig.canvas.draw()

Categories

Resources