Side-by-side boxplots from two pandas in one figure - python

I have two pandas dataframes containing data for three different categories: 'a', 'b' and 'c'.
import pandas as pd
import numpy as np
n=100
df_a = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(0, 1, 3*n)})
df_b = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(1, 1, 3*n)})
I would like to illustrate the differences in 'a', 'b' and 'c' between the two dataframes, and for that I want to use boxplots. I.e., for each category ('a', 'b' and 'c'), I want to make side-by-side boxplots - and they should all be in the same figure.
So one figure containing 6 boxplots, 2 per category. How can I achieve this the easiest?

IIUC:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(3, 2)
for j, df in enumerate([df_a, df_b]):
for i, cat in enumerate(sorted(df['id'].unique())):
df[df['id'] == cat].boxplot('val', 'id', ax=axes[i, j])
plt.tight_layout()
plt.show()

does this help? I tried to make it somewhat dynamic/ flexible
import matplotlib.pyplot as plt
import pandas
import seaborn as sns
ids = [val for val in df_a["id"].unique() for _ in (0, 1)]
fig, ax = plt.subplots(len(ids)//2,2, figsize=(10,10))
plt.subplots_adjust(hspace=0.5, wspace=0.3)
plt.suptitle("df_a vs. df_b")
ax = ax.ravel()
for i, id in enumerate(ids):
if i%2 == 0:
ax[i] = sns.boxplot(x=df_a[df_a.id == id]["val"], ax = ax[i])
else:
ax[i] = sns.boxplot(x=df_b[df_b.id == id]["val"], ax = ax[i])
ax[i].set_title(id)
sns.despine()

You could add an extra column to indicate the dataset and then concatenate the dataframes:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
n = 100
df_a = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(0, 1, 3 * n)})
df_b = pd.DataFrame({'id': np.ravel([['a' for i in range(n)], ['b' for i in range(n)], ['c' for i in range(n)]]),
'val': np.random.normal(1, 1, 3 * n)})
df_a['dataset'] = 'set a'
df_b['dataset'] = 'set b'
sns.boxplot(data=pd.concat([df_a, df_b]), x='id', y='val', hue='dataset', palette='spring')
plt.tight_layout()
plt.show()
PS: Note that in matplotlib (and seaborn, which builds upon it), a figure is a plot with one or more subplots (referred to as ax). As you write figure instead of plot, it might give the impression that you want multiple subplots. You can use sns.catplot(...., kind='box') to create multiple subplots from the concatenated dataframe.

Related

Place ellipsis on seaborn catplot

I have a seaborn.catplot that looks like this:
What I am trying to do is highlight differences in the graph with the following rules:
If A-B > 4, color it green
If A-B < -1, color it red
If A-B = <2= and >=0, color it blue
I am looking to produce something akin to the below image:
I have an MRE here:
# Stack Overflow Example
import numpy as np, pandas as pd, seaborn as sns
from random import choice
from string import ascii_lowercase, digits
chars = ascii_lowercase + digits
lst = [''.join(choice(chars) for _ in range(2)) for _ in range(100)]
np.random.seed(8)
t = pd.DataFrame(
{
'Key': [''.join(choice(chars) for _ in range(2)) for _ in range(5)]*2,
'Value': np.random.uniform(low=1, high=10, size=(10,)),
'Type': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B']
}
)
ax = sns.catplot(data=t, x='Value', y='Key', hue='Type', palette="dark").set(title="Stack Overflow Help Me")
plt.show()
I believe an ellipsis will need to be plotted around the points of interest, and I have looked into some questions:
Creating a Confidence Ellipses in a sccatterplot using matplotlib
plot ellipse in a seaborn scatter plot
But none seem to be doing this with catplot in particular, or with customizing their color and with rules.
How can I achieve the desired result with my toy example?
You could create ellipses around the midpoint of A and B, using the distance between A and B, increased by some padding, as width. The height should be a bit smaller than 1.
To get a full outline and transparent inner color, to_rgba() can be used. Setting the zorder to a low number puts the ellips behind the scatter points.
sns.scatterplot is an axes-level equivalent for sns.catplot, and is easier to work with when there is only one subplot.
Making the Key column of type pd.Categorical gives a fixed relation between y-position and label.
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
from matplotlib.colors import to_rgba
import seaborn as sns
import pandas as pd
import numpy as np
from string import ascii_lowercase, digits
chars = ascii_lowercase + digits
num = 9
df = pd.DataFrame({'Key': [''.join(np.random.choice([*chars], 2)) for _ in range(num)] * 2,
'Value': np.random.uniform(low=1, high=10, size=2 * num),
'Type': np.repeat(['A', 'B'], num)})
df['Key'] = pd.Categorical(df['Key']) # make the key categorical for a consistent ordering
sns.set_style('white')
ax = sns.scatterplot(data=df, x='Value', y='Key', hue='Type', palette="dark")
df_grouped = df.groupby(['Key', 'Type'])['Value'].mean().unstack()
for y_pos, y_label in enumerate(df['Key'].cat.categories):
A = df_grouped.loc[y_label, 'A']
B = df_grouped.loc[y_label, 'B']
dif = A - B
color = 'limegreen' if dif > 4 else 'crimson' if dif < -1 else 'dodgerblue' if 0 <= dif < 2 else None
if color is not None:
ell = Ellipse(xy=((A + B) / 2, y_pos), width=abs(dif) + 0.8, height=0.8,
fc=to_rgba(color, 0.1), lw=1, ec=color, zorder=0)
ax.add_patch(ell)
plt.tight_layout()
plt.show()

How to label these points on the scatter plot

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_excel("path to the file")
fig, ax = plt.subplots()
fig.set_size_inches(7,3)
df = pd.DataFrame(data, columns = ['Player', 'Pos', 'Age'])
df.plot.scatter(x='Age',
y='Pos',
c='DarkBlue', xticks=([15,20,25,30,35,40]))
plt.show()
Got the plot but not able to label these points
Provided you'd like to label each point, you can loop over each coordinate plotted, assigning it a label using plt.text() at the plotted point's position, like so:
from matplotlib import pyplot as plt
y_points = [i for i in range(0, 20)]
x_points = [(i*3) for i in y_points]
offset = 5
plt.figure()
plt.grid(True)
plt.scatter(x_points, y_points)
for i in range(0, len(x_points)):
plt.text(x_points[i] - offset, y_points[i], f'{x_points[i]}')
plt.show()
In the above example it will give the following:
The offset is just to make the labels more readable so that they're not right on top of the scattered points.
Obviously we don't have access to your spreadsheet, but the same basic concept would apply.
EDIT
For non numerical values, you can simply define the string as the coordinate. This can be done like so:
from matplotlib import pyplot as plt
y_strings = ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd']
x_values = [i for i, string in enumerate(y_strings)]
# Plot coordinates:
plt.scatter(x_values, y_strings)
for i, string in enumerate(y_strings):
plt.text(x_values[i], string, f'{x_values[i]}:{string}')
plt.grid(True)
plt.show()
Which will provide the following output:

Matplotlib - plotting grouped values with a for loop

I'm trying to plot a graph grouped by column values using a for loop without knowing the number of unique values in that column.
You can see sample code below (without a for loop) and the desired output.
I would like that each plot will have different color and marker (as seen below).
This is the code:
import pandas as pd
from numpy import random
df = pd.DataFrame(data = random.randn(5,4), index = ['A','B','C','D','E'],
columns = ['W','X','Y','Z'])
df['W'] = ['10/01/2018 12:00:00','10/03/2018 13:00:00',
'10/03/2018 12:30:00','10/04/2018 12:05:00',
'10/08/2018 12:00:15']
df['W']=pd.to_datetime(df['W'])
df['Entity'] = ['C201','C201','C201','C202','C202']
print(df.head())
fig, ax = plt.subplots()
df[df['Entity']=="C201"].plot(x="W",y="Y",label='C201',ax=ax,marker='x')
df[df['Entity']=="C202"].plot(x="W",y="Y",label='C202',ax=ax, marker='o')
This is the output:
You can first find out the unique values of your df['Entity'] and then loop over them. To generate new markers automatically for each Entity, you can define an order of some markers (let's say 5 in the answer below) which will repeat via marker=next(marker).
Complete minimal answer
import itertools
import pandas as pd
from numpy import random
import matplotlib.pyplot as plt
marker = itertools.cycle(('+', 'o', '*', '^', 's'))
df = pd.DataFrame(data = random.randn(5,4), index = ['A','B','C','D','E'],
columns = ['W','X','Y','Z'])
df['W'] = ['10/01/2018 12:00:00','10/03/2018 13:00:00',
'10/03/2018 12:30:00','10/04/2018 12:05:00',
'10/08/2018 12:00:15']
df['W']=pd.to_datetime(df['W'])
df['Entity'] = ['C201','C201','C201','C202','C202']
fig, ax = plt.subplots()
for idy in np.unique(df['Entity'].values):
df[df['Entity']==idy].plot(x="W",y="Y", label=idy, ax=ax, marker=next(marker))
plt.legend()
plt.show()

Remove empty bars from grouped barplot

I have a grouped barplot. It's working very well, but I try to remove the empty barplots. They take too much space.
I have already tried :
%matplotlib inline
import matplotlib as mpl
from matplotlib.gridspec import GridSpec
import matplotlib.pyplot as plt
import sys
import os
import glob
import seaborn as sns
import pandas as pd
import ggplot
from ggplot import aes
sns.set(style= "whitegrid", palette="pastel", color_codes=True )
tab_folder = 'myData'
out_folder ='myData/plots'
tab = glob.glob('%s/R*.tab'%(tab_folder))
#is reading all my data
for i, tab_file in enumerate(tab):
folder,file_name=os.path.split(tab_file)
s_id=file_name[:-4].replace('DD','')
df=pd.DataFrame.from_csv(tab_file, sep='\t')
df_2 = df.groupby(['name','ab']).size().reset_index(name='count')
df_2 = df_2[df_2['count'] != 0]
table = pd.pivot_table(df_2, index='name',columns='ab', values='count' )
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'], ax = ax)
for label in (ax.get_xticklabels() + ax.get_yticklabels()):
label.set_fontsize(4)
ax.set_title(s_id).update({'color':'black', 'size':5, 'family':'monospace'})
ax.set_xlabel('')
ax.set_ylabel('')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], bbox_to_anchor=(1, 1.05),prop= {'size': 4} )
png_t = '%s/%s.b.png'%(out_folder,s_id)
plt.savefig(png_t, dpi = 500)
But it's not working. The bars are still the same.
Is there any other method to remove empty bars?
Your question is not complete. I don't know what you're trying to accomplish, but from what you've said I'd guess that you are trying not to display empty pivot pairs.
This is not possible by standard means of pandas. Plot of groups need to display all of them even NaNs which will be plot as "empty bars".
Furthermore after groupby every group is at least size of one, so df_2[df_2['count'] != 0] is allways true.
For example
df = pd.DataFrame([['nameA', 'abA'], ['nameB', 'abA'],['nameA','abB'],['nameD', 'abD']], columns=['names', 'ab'])
df_2 = df.groupby(['names', 'ab']).size().reset_index(name='count')
df_2 = df_2[df_2['count'] != 0] # this line has no effect
table = pd.pivot_table(df_2, index='names',columns='ab', values='count' )
table
gives
ab abA abB abD
names
nameA 1.00 1.00 NaN
nameB 1.00 NaN NaN
nameD NaN NaN 1.00
and
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'])
shows
And that's the way it is. Plot need to show all groups after pivot.
EDIT
You can also use stacked plot, to get rid of spaces
table.plot(kind='barh', width = 0.9, color = ['b', 'g', 'r'], stacked=True)

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources