Boxplot by two groups in pandas

Boxplot by two groups in pandas - python

I have the following dataset:
df_plots = pd.DataFrame({'Group':['A','A','A','A','A','A','B','B','B','B','B','B'],
'Type':['X','X','X','Y','Y','Y','X','X','X','Y','Y','Y'],
'Value':[1,1.2,1.4,1.3,1.8,1.5,15,19,18,17,12,13]})
df_plots
Group Type Value
0 A X 1.0
1 A X 1.2
2 A X 1.4
3 A Y 1.3
4 A Y 1.8
5 A Y 1.5
6 B X 15.0
7 B X 19.0
8 B X 18.0
9 B Y 17.0
10 B Y 12.0
11 B Y 13.0
And I want to create boxplots per Group (there are two in the example) and in each plot to show by type. I have tried this:
fig, axs = plt.subplots(1,2,figsize=(8,6), sharey=False)
axs = axs.flatten()
for i, g in enumerate(df_plots[['Group','Type','Value']].groupby(['Group','Type'])):
g[1].boxplot(ax=axs[i])
Results in an IndexError, because the loop tries to create 4 plots.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-12-8e1150950024> in <module>
3
4 for i, g in enumerate(df[['Group','Type','Value']].groupby(['Group','Type'])):
----> 5 g[1].boxplot(ax=axs[i])
IndexError: index 2 is out of bounds for axis 0 with size 2
Then I tried this:
fig, axs = plt.subplots(1,2,figsize=(8,6), sharey=False)
axs = axs.flatten()
for i, g in enumerate(df_plots[['Group','Type','Value']].groupby(['Group','Type'])):
g[1].boxplot(ax=axs[i], by=['Group','Type'])
But no, I have the same problem. The expected result should have only two plots, and each plot have a box-and-whisker per Type. This is a sketch of this idea:
Please, any help will be greatly appreciated, with this code I can control some aspects of the data that I can't with seaborn.

Use seaborn.catplot:
import seaborn as sns
sns.catplot(data=df, kind='box', col='Group', x='Type', y='Value', hue='Type', sharey=False, height=4)

We can use groupby boxplot to create subplots per Group and then separate each boxplot by Type:
fig, axes = plt.subplots(1, 2, figsize=(8, 6), sharey=False)
df_plots.groupby('Group').boxplot(by='Type', ax=axes)
plt.show()
Or without subplots by passing parameters directly through the function call:
axes = df_plots.groupby('Group').boxplot(by='Type', figsize=(8, 6),
layout=(1, 2), sharey=False)
plt.show()
Data and imports:
import pandas as pd
from matplotlib import pyplot as plt
df_plots = pd.DataFrame({
'Group': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Type': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'Value': [1, 1.2, 1.4, 1.3, 1.8, 1.5, 15, 19, 18, 17, 12, 13]
})

As #Prune mentioned, the immediate issue is that your groupby() returns four groups (AX, AY, BX, BY), so first fix the indexing and then clean up a couple more issues:
Change axs[i] to axs[i//2] to put groups 0 and 1 on axs[0] and groups 2 and 3 on axs[1].
Add positions=[i] to place the boxplots side by side rather than stacked.
Set the title and xticklabels after plotting (I'm not aware of how to do this in the main loop).
for i, g in enumerate(df_plots.groupby(['Group', 'Type'])):
g[1].boxplot(ax=axs[i//2], positions=[i])
for i, ax in enumerate(axs):
ax.set_title('Group: ' + df_plots['Group'].unique()[i])
ax.set_xticklabels(['Type: X', 'Type: Y'])
Note that mileage may vary depending on version:
matplotlib.__version__
pd.__version__
confirmed working
3.4.2
1.3.1
confirmed not working
3.0.1
1.2.4

The immediate problem is that your groupby operation returns four elements (AX, AY, BX, BY), which you're trying to plot individually. You try to use ax=axs[i] ... but i runs 0-3, while you have only the two elements in your flattened structure. There is no axs[2] or axs[3], which raises the given run-time exception.
You need to resolve your referencing one way or the other.

Related

Pandas: Finding maxima of 2d data (integers) in Dataframe

I have a 2d data set of (x,y). x and y are integer values.
How can I use only Pandas code to find all x values where y reaches its maximum values (there are multiple and same absolute maxima)?
I also want to plot (with pandas.DataFrame.plot) x vs. y and mark the maxima positions.
Example code:
import numpy as np
import pandas as pd
np.random.seed(10)
x = np.arange(100)*0.2
y = np.random.randint(0, 20, size=100)
data = np.vstack((x, y)).T
df = pd.DataFrame(data, columns=['x', 'y'])
ymax = df['y'].max()
df_ymax = df[df['y'] == ymax]
print(df_ymax)
# x y
# 13 2.6 19.0
# 24 4.8 19.0
# 28 5.6 19.0
# 86 17.2 19.0
# 88 17.6 19.0
df.plot(x='x', y='y', figsize=(8, 4),
ylabel='y', legend=False, style=['b-'])
I have no idea how to mark the maxima values (df_ymax) in the same plot, e.g. using circles. How can that be solved?
The final plot should look like this (here I programmed everything with numpy and matplotlib):

Get the Axes returned by df.plot and reuse it to plot the maxima values:
ax = df.plot(x='x', y='y', figsize=(8, 4), ylabel='y', legend=False, style=['b-'])
df_ymax.plot.scatter(x='x', y='y', color='r', ax=ax)

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

Imagine I have the following dataframes
import pandas as pd
import seaborn as sns
import numpy as np
d = {'val': [1, 2,3,4], 'a': [1, 1, 2, 2]}
d2 = {'val': [1, 2], 'a': [1, 2]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
This will give me two dataframes that look the following:
df =
val a
0 1 1
1 2 1
2 3 2
3 4 2
and
df2 =
val a
0 1 1
1 2 2
Now I want to create a boxplot based on val in df and the values of a, i.e. fix a value a, i.e. 1; Then I have two different values val: 1 and 2; Then create a box at x=1 based on the values {1,2}; Then move on to a=2: Based on a=2 we have two values val={3,4} so create a box at x=2 based on the values {3,4};
Then I want to simply draw a line based on df2, where a is again my x-axis and val my y-axis; The way I did that is the following
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
sns.pointplot(x='a', y='val', data=df2, ax=ax)
The problem is that the box for a=1 is shifted at a=2 and the box for a=2 disappeared; I am confused if I have an error in my code or if it is a bug;
If I just add the boxplot, everything is fine, so if I do:
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
The boxes are at the right position but as soon as I add the pointplot, things don't seem to work anymore;
Anyone an idea what to do?

The problem is that you are plotting categories on the x-axis. Pointplot plots the first item at position 0 while boxplot starts at 1, thus the shift. One possibility is to use an twinned axis:
ax = df.boxplot(column=['val'], by = ['a'])
ax2 = ax.twiny()
sns.pointplot(x='a', y='val', data=df2, ax=ax2)
ax2.xaxis.set_visible(False)

Plotting: qcut then groupby two variables

I have the following dataset:
df = pd.DataFrame({'cls': [1,2,2,1,2,1,2,1,2,1,2],
'x': [10,11,21,21,8,1,4,3,5,6,2],
'y': [10,1,2,2,5,2,4,3,8,6,5]})
df['bin'] = pd.qcut(np.array(df['x']), 4)
a = df.groupby(['bin', 'cls'])['y'].mean()
a
This gives me
bin cls
(0.999, 3.5] 1 2.5
2 5.0
(3.5, 6.0] 1 6.0
2 6.0
(6.0, 10.5] 1 10.0
2 5.0
(10.5, 21.0] 1 2.0
2 1.5
Name: y, dtype: float64
I want to plot the right-most column (that is, the average of y per cls per bin) per bin per class. That is, for each bin we have two values of y that I would like to plot as points/scatters. Is that possible using matplotlib or seaborn?

You can indeed use seaborn for what you're asking. Does this work?
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
# set up some plotting options
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(1,1,1)
# we reset index to avoid having to do multi-indexing
a = a.reset_index()
# use seaborn with argument 'hue' to do the grouping
sns.barplot(x="bin", y="y", hue="cls", data=a, ax=ax)
plt.show()
EDIT: I've just noticed that you wanted to plot "points". I wouldn't advise it for this dataset but you can do that if you replace barplot with catplot.

How can I group a stacked bar chart?

I'm trying to create a grouped, stacked bar chart.
Currently I have the following DataFrame:
>>> df
Value
Rating 1 2 3
Context Parameter
Total 1 43.312347 9.507902 1.580367
2 42.862649 9.482205 1.310549
3 43.710651 9.430811 1.400488
4 43.209559 9.803418 1.349094
5 42.541436 10.008994 1.220609
6 42.978286 9.430811 1.336246
7 42.734164 10.317358 1.606064
User 1 47.652348 11.138861 2.297702
2 47.102897 10.589411 1.848152
3 46.853147 10.139860 1.848152
4 47.252747 11.138861 1.748252
5 45.954046 10.239760 1.448551
6 46.353646 10.439560 1.498501
7 47.102897 11.338661 1.998002
I'd like to have for each Parameter the bars for Total and User grouped together.
This is the resulting chart with df.plot(kind='bar', stacked=True):
The bars themselve look right, but how do I get the bars for Total and User next to each other, for each Parameter, best with some margin between the parameters?

The following approach allows grouped and stacked bars at the same time.
First the dataframe is sorted by parameter, context. Then the context is unstacked from the index, creating new columns for every context, value pair.
Finally, three bar plots are drawn over each other to visualize the stacked bars.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['Context', 'Parameter', 'Val1', 'Val2', 'Val3'],
data=[['Total', 1, 43.312347, 9.507902, 1.580367],
['Total', 2, 42.862649, 9.482205, 1.310549],
['Total', 3, 43.710651, 9.430811, 1.400488],
['Total', 4, 43.209559, 9.803418, 1.349094],
['Total', 5, 42.541436, 10.008994, 1.220609],
['Total', 6, 42.978286, 9.430811, 1.336246],
['Total', 7, 42.734164, 10.317358, 1.606064],
['User', 1, 47.652348, 11.138861, 2.297702],
['User', 2, 47.102897, 10.589411, 1.848152],
['User', 3, 46.853147, 10.139860, 1.848152],
['User', 4, 47.252747, 11.138861, 1.748252],
['User', 5, 45.954046, 10.239760, 1.448551],
['User', 6, 46.353646, 10.439560, 1.498501],
['User', 7, 47.102897, 11.338661, 1.998002]])
df.set_index(['Context', 'Parameter'], inplace=True)
df0 = df.reorder_levels(['Parameter', 'Context']).sort_index()
colors = plt.cm.Paired.colors
df0 = df0.unstack(level=-1) # unstack the 'Context' column
fig, ax = plt.subplots()
(df0['Val1']+df0['Val2']+df0['Val3']).plot(kind='bar', color=[colors[1], colors[0]], rot=0, ax=ax)
(df0['Val2']+df0['Val3']).plot(kind='bar', color=[colors[3], colors[2]], rot=0, ax=ax)
df0['Val3'].plot(kind='bar', color=[colors[5], colors[4]], rot=0, ax=ax)
legend_labels = [f'{val} ({context})' for val, context in df0.columns]
ax.legend(legend_labels)
plt.tight_layout()
plt.show()

Here's a way to do it:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
# reshape you data - ensure no index is set initially
df1 = (df
.set_index(['Parameter','Context'])
.stack()
.reset_index()
.drop('level_2', 1)
.rename(columns={0:'value'}))
print(df1.head())
Parameter Context value
0 1 Total 43.312347
1 1 Total 9.507902
2 1 Total 1.580367
3 2 Total 42.862649
4 2 Total 9.482205
sns.barplot(x = 'Parameter',
y = 'value',
hue='Context',
data=df1,
errwidth=0.1)

scatter plot by category in pandas

This has been troubling me for the past 30 minutes. What I'd like to do is to scatter plot by category. I took a look at the documentation, but I haven't been able to find the answer there. I looked here, but when I ran that in iPython Notebook, I don't get anything.
Here's my data frame:
time cpu wait category
8 1 0.5 a
9 2 0.2 a
2 3 0.1 b
10 4 0.7 c
3 5 0.2 c
5 6 0.8 b
Ideally, I'd like to have a scatter plot that shows CPU on the x axis, wait on the y axis, and each point on the graph is distinguished by category. So for example, if a=red, b=blue, and c=green then point (1, 0.5) and (2, 0.2) should be red, (3, 0.1) and (6, 0.8) should be blue, etc.
How would I do this with pandas? or matplotlib? whichever does the job.

This is essentially the same answer as #JoeCondron, but a two liner:
cmap = {'a': 'red', 'b': 'blue', 'c': 'yellow'}
df.plot(x='cpu', y='wait', kind='scatter',
colors=[cmap.get(c, 'black') for c in df.category])
If no color is mapped for the category, it defaults to black.
EDIT:
The above works for Pandas 0.14.1. For 0.16.2, 'colors' needs to be changed to 'c':
df.plot(x='cpu', y='wait', kind='scatter',
c=[cmap.get(c, 'black') for c in df.category])

I'd create a column with your colors based on category, then do the following, where ax is a matplotlib ax and df is your dataframe:
ax.scatter(df['cpu'], df['wait'], marker = '.', c = df['colors'], s = 100)

You could do
color_map = {'a': 'r', 'b': 'b', 'c': 'y'}
ax = plt.subplot()
x, y = df.cpu, df.wait
colors = df.category.map(color_map)
ax.scatter(x, y, color=colors)
This will give you red for category a, blue for b, yellow for c.
So you can past a list of color aliases of the same length as the arrays.
You can check out the myriad available colours here : http://matplotlib.org/api/colors_api.html.
I don't think the plot method is very useful for scatter plots.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boxplot by two groups in pandas - python

Use seaborn.catplot: import seaborn as sns sns.catplot(data=df, kind='box', col='Group', x='Type', y='Value', hue='Type', sharey=False, height=4)

Related

Pandas: Finding maxima of 2d data (integers) in Dataframe

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

Plotting: qcut then groupby two variables

How can I group a stacked bar chart?

scatter plot by category in pandas

Categories

Resources