Grouping boxplots in seaborn when input is a DataFrame

Grouping boxplots in seaborn when input is a DataFrame - python

I intend to plot multiple columns in a pandas dataframe, all grouped by another column using groupby inside seaborn.boxplot. There is a nice answer here, for a similar problem in matplotlib matplotlib: Group boxplots but given the fact that seaborn.boxplot comes with groupby option I thought it could be much easier to do this in seaborn.
Here we go with a reproducible example that fails:
import seaborn as sns
import pandas as pd
df = pd.DataFrame([[2, 4, 5, 6, 1], [4, 5, 6, 7, 2], [5, 4, 5, 5, 1],
[10, 4, 7, 8, 2], [9, 3, 4, 6, 2], [3, 3, 4, 4, 1]],
columns=['a1', 'a2', 'a3', 'a4', 'b'])
# display(df)
a1 a2 a3 a4 b
0 2 4 5 6 1
1 4 5 6 7 2
2 5 4 5 5 1
3 10 4 7 8 2
4 9 3 4 6 2
5 3 3 4 4 1
#Plotting by seaborn
sns.boxplot(df[['a1','a2', 'a3', 'a4']], groupby=df.b)
What I get is something that completely ignores groupby option:
Whereas if I do this with one column it works thanks to another SO question Seaborn groupby pandas Series :
sns.boxplot(df.a1, groupby=df.b)
So I would like to get all my columns in one plot (all columns come in a similar scale).
EDIT:
The above SO question was edited and now includes a 'not clean' answer to this problem, but it would be nice if someone has a better idea for this problem.

As the other answers note, the boxplot function is limited to plotting a single "layer" of boxplots, and the groupby parameter only has an effect when the input is a Series and you have a second variable you want to use to bin the observations into each box..
However, you can accomplish what I think you're hoping for with the factorplot function, using kind="box". But, you'll first have to "melt" the sample dataframe into what is called long-form or "tidy" format where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
Then it's very simple to plot:
sns.factorplot("a", hue="b", y="c", data=df_long, kind="box")

You can use directly boxplot (I imagine when the question was asked, that was not possible, but with seaborn version > 0.6 it is).
As explained by #mwaskom, you have to "melt" the sample dataframe into its "long-form" where each column is a variable and each row is an observation:
df_long = pd.melt(df, "b", var_name="a", value_name="c")
# display(df_long.head())
b a c
0 1 a1 2
1 2 a1 4
2 1 a1 5
3 2 a1 10
4 2 a1 9
Then you just plot it:
sns.boxplot(x="a", hue="b", y="c", data=df_long)

Seaborn's groupby function takes Series not DataFrames, that's why it's not working.
As a work around, you can do this :
fig, ax = plt.subplots(1,2, sharey=True)
for i, grp in enumerate(df.filter(regex="a").groupby(by=df.b)):
sns.boxplot(grp[1], ax=ax[i])
it gives :
Note that df.filter(regex="a") is equivalent to df[['a1','a2', 'a3', 'a4']]
a1 a2 a3 a4
0 2 4 5 6
1 4 5 6 7
2 5 4 5 5
3 10 4 7 8
4 9 3 4 6
5 3 3 4 4
Hope this helps

It isn't really any better than the answer you linked, but I think the way to achieve this in seaborn is using the FacetGrid feature, as the groupby parameter is only defined for Series passed to the boxplot function.
Here's some code - the pd.melt is necessary because (as best I can tell) the facet mapping can only take individual columns as parameters, so the data need to be turned into a 'long' format.
g = sns.FacetGrid(pd.melt(df, id_vars='b'), col='b')
g.map(sns.boxplot, 'value', 'variable')

It's not adding a lot to this conversation, but after struggling with this for longer than warranted (the actual clusters are unusable), I thought I would add my implementation as another example. It's got a superimposed scatterplot (because of how annoying my dataset is), shows melt using indices, and some aesthetic tweaks. I hope this is useful for someone.
output_graph
Here it is without using column headers (I saw a different thread that wanted to know how to do this using indices):
combined_array: ndarray = np.concatenate([dbscan_output.data, dbscan_output.labels.reshape(-1, 1)], axis=1)
cluster_data_df: DataFrame = DataFrame(combined_array)
if you want to use labelled columns:
column_names: List[str] = list(outcome_variable_names)
column_names.append('cluster')
cluster_data_df.set_axis(column_names, axis='columns', inplace=True)
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=['cluster'],
# value_vars is an optional param - by default it uses columns except the id vars, but I've included it as an example
# value_vars=['outcome_var_1', 'outcome_var_2', 'outcome_var_3', 'outcome_var_4', 'outcome_var_5', 'outcome_var_6']
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
The resulting dataframe (rows = sample_n x variable_n (in my case 1626 x 6 = 9756)):
index
cluster
psychometric_tst
standard deviations from the mean
0
0.0
outcome_var_1
-1.276182
1
0.0
outcome_var_1
-1.118813
2
0.0
outcome_var_1
-1.276182
9754
0.0
outcome_var_6
0.892548
9755
0.0
outcome_var_6
1.420480
If you want to use indices with melt:
graph_data: DataFrame = pd.melt(
frame=cluster_data_df,
id_vars=cluster_data_df.columns[-1],
# value_vars=cluster_data_df.columns[:-1],
var_name='psychometric_test',
value_name='standard deviations from the mean'
)
And here's the graphing code:
(Done with column headings - just note that y-axis=value_name, x-axis = var_name, hue = id_vars):
# plot graph grouped by cluster
sns.set_theme(style="ticks")
fig = plt.figure(figsize=(10, 10))
fig.set(font_scale=1.2)
fig.set_style("white")
# create boxplot
fig.ax = sns.boxplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', showfliers=False,
data=graph_data)
# set box alpha:
for patch in fig.ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .2))
# create scatterplot
fig.ax = sns.stripplot(y='standard deviations from the mean', x='psychometric_test', hue='cluster', data=graph_data,
dodge=True, alpha=.25, zorder=1)
# customise legend:
cluster_n: int = dbscan_output.n_clusters
## create list with legend text
i = 0
cluster_info: Dict[int, int] = dbscan_output.cluster_sizes # custom method
legend_labels: List[str] = []
while i < cluster_n:
label: str = f"cluster {i+1}, n = {cluster_info[i]}"
legend_labels.append(label)
i += 1
if -1 in cluster_info.keys():
cluster_n += 1
label: str = f"Unclustered, n = {cluster_info[-1]}"
legend_labels.insert(0, label)
## fetch existing handles and legends (each tuple will have 2*cluster number -> 1 for each boxplot cluster, 1 for each scatterplot cluster, so I will remove the first half)
handles, labels = fig.ax.get_legend_handles_labels()
index: int = int(cluster_n*(-1))
labels = legend_labels
plt.legend(handles[index:], labels[0:])
plt.xticks(rotation=45)
plt.show()
asds
Just a note: Most of my time was spent debugging the melt function. I predominantly got the error "*only integer scalar arrays can be converted to a scalar index with 1D numpy indices array*". My output required me to concatenate my outcome variable value table and the clusters (DBSCAN), and I'd put extra square brackets around the cluster array in the concat method. So I had a column where each value was an invisible List[int], rather than a plain int. It's pretty niche, but maybe it'll help someone.
List item

Related

Plotting multiple groups from a dataframe with datashader as lines

I am trying to make plots with datashader. the data itself is a time series of points in polar coordiantes. i managed to transform them to cartesian coordianted(to have equal spaced pixles) and i can plot them with datashader.
the point where i am stuck is that if i just plot them with line() instead of points() it just connects the whole dataframe as a single line. i would like to plot the data of the dataframe group per group(the groups are the names in list_of_names ) onto the canvas as lines.
data can be found here
i get this kind of image with datashader
This is a zoomed in view of the plot generated with points() instead of line() the goal is to produce the same plot but with connected lines instead of points
import datashader as ds, pandas as pd, colorcet
import numby as np
df = pd.read_csv('file.csv')
print(df)
starlink_name = df.loc[:,'Name']
starlink_alt = df.loc[:,'starlink_alt']
starlink_az = df.loc[:,'starlink_az']
name = starlink_name.values
alt = starlink_alt.values
az = starlink_az.values
print(name)
print(df['Name'].nunique())
df['Date'] = pd.to_datetime(df['Date'])
for name, df_name in df.groupby('Name'):
print(name)
df_grouped = df.groupby('Name')
list_of_names = list(df_grouped.groups)
print(len(list_of_names))
#########################################################################################
#i want this kind of plot with connected lines with datashader
#########################################################################################
fig = plt.figure()
ax = fig.add_axes([0.1,0.1,0.8,0.8], polar=True)
# ax.invert_yaxis()
ax.set_theta_zero_location('N')
ax.set_rlim(90, 60, 1)
# Note: you must set the end of arange to be slightly larger than 90 or it won't include 90
ax.set_yticks(np.arange(0, 91, 15))
ax.set_rlim(bottom=90, top=0)
for name in list_of_names:
df2 = df_grouped.get_group(name)
ax.plot(np.deg2rad(df2['starlink_az']), df2['starlink_alt'], linestyle='solid', marker='.',linewidth=0.5, markersize=0.1)
plt.show()
print(df)
#########################################################################################
#transformation to cartasian coordiantes
#########################################################################################
df['starlink_alt'] = 90 - df['starlink_alt']
df['x'] = df.apply(lambda row: np.deg2rad(row.starlink_alt) * np.cos(np.deg2rad(row.starlink_az)), axis=1)
df['y'] = df.apply(lambda row: -1 * np.deg2rad(row.starlink_alt) * np.sin(np.deg2rad(row.starlink_az)), axis=1)
#########################################################################################
# this is what i want but as lines group per group
#########################################################################################
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.points(df, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#########################################################################################
#here i am stuck
#########################################################################################
for name in list_of_names:
df2 = df_grouped.get_group(name)
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.line(df2, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#plt.imshow(img)
plt.show()

To do this, you have a couple options. One is inserting NaN rows as a breakpoint into your dataframe when using cvs.line. You need DataShader to "pick up the pen" as it were, by inserting a row of NaNs after each group. It's not the slickest, but that's a current recommended solution.
Really simple, hacky example:
In [17]: df = pd.DataFrame({
...: 'name': list('AABBCCDD'),
...: 'x': np.arange(8),
...: 'y': np.arange(10, 18),
...: })
In [18]: df
Out[18]:
name x y
0 A 0 10
1 A 1 11
2 B 2 12
3 B 3 13
4 C 4 14
5 C 5 15
6 D 6 16
7 D 7 17
This block groups on the 'name' column, then reindexes each group to be one element longer than the original data:
In [20]: res = df.set_index('name').groupby('name').apply(
...: lambda x: x.reset_index(drop=True).reindex(np.arange(len(x) + 1))
...: )
In [21]: res
Out[21]:
x y
name
A 0 0.0 10.0
1 1.0 11.0
2 NaN NaN
B 0 2.0 12.0
1 3.0 13.0
2 NaN NaN
C 0 4.0 14.0
1 5.0 15.0
2 NaN NaN
D 0 6.0 16.0
1 7.0 17.0
2 NaN NaN
You can plug this reindexed dataframe into datashader to have multiple disconnected lines in the result.
This is a still-open issue on the datashader repo, including additional examples and boilerplate code: https://github.com/holoviz/datashader/issues/257
Other options include restructuring your data to accommodate one of cvs.line's other formats. From the Canvas.line docstring:
def line(self, source, x=None, y=None, agg=None, axis=0, geometry=None,
antialias=False):
Parameters
----------
source : pandas.DataFrame, dask.DataFrame, or xarray.DataArray/Dataset
The input datasource.
x, y : str or number or list or tuple or np.ndarray
Specification of the x and y coordinates of each vertex
* str or number: Column labels in source
* list or tuple: List or tuple of column labels in source
* np.ndarray: When axis=1, a literal array of the
coordinates to be used for every row
agg : Reduction, optional
Reduction to compute. Default is ``any()``.
axis : 0 or 1, default 0
Axis in source to draw lines along
* 0: Draw lines using data from the specified columns across
all rows in source
* 1: Draw one line per row in source using data from the
specified columns
There are a number of additional examples in the cvs.line docstring. You can pass arrays as the x, y arguments giving multiple columns to use in forming lines when axis=1, or you can a dataframe with ragged array values.
See this pull request adding the line options (h/t to #James-a-bednar in the comments) for a discussion of their use.

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

Imagine I have the following dataframes
import pandas as pd
import seaborn as sns
import numpy as np
d = {'val': [1, 2,3,4], 'a': [1, 1, 2, 2]}
d2 = {'val': [1, 2], 'a': [1, 2]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)
This will give me two dataframes that look the following:
df =
val a
0 1 1
1 2 1
2 3 2
3 4 2
and
df2 =
val a
0 1 1
1 2 2
Now I want to create a boxplot based on val in df and the values of a, i.e. fix a value a, i.e. 1; Then I have two different values val: 1 and 2; Then create a box at x=1 based on the values {1,2}; Then move on to a=2: Based on a=2 we have two values val={3,4} so create a box at x=2 based on the values {3,4};
Then I want to simply draw a line based on df2, where a is again my x-axis and val my y-axis; The way I did that is the following
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
sns.pointplot(x='a', y='val', data=df2, ax=ax)
The problem is that the box for a=1 is shifted at a=2 and the box for a=2 disappeared; I am confused if I have an error in my code or if it is a bug;
If I just add the boxplot, everything is fine, so if I do:
ax = df.boxplot(column=['val'], by = ['a'],meanline=True, showmeans=True, showcaps=True,showbox=True)
The boxes are at the right position but as soon as I add the pointplot, things don't seem to work anymore;
Anyone an idea what to do?

The problem is that you are plotting categories on the x-axis. Pointplot plots the first item at position 0 while boxplot starts at 1, thus the shift. One possibility is to use an twinned axis:
ax = df.boxplot(column=['val'], by = ['a'])
ax2 = ax.twiny()
sns.pointplot(x='a', y='val', data=df2, ax=ax2)
ax2.xaxis.set_visible(False)

How to plot a dot plot type scatterplot in matplotlib or seaborn?

Let's say I have a df like this:
df = pd.DataFrame({'col1': list('aabbb'), 'col2': [1, 3, 1, 5, 3]})
col1 col2
0 a 1
1 a 3
2 b 1
3 b 5
4 b 3
I would like to see a plot, where on the x axis, I have the col1 names ONCE, and on the y axis, the col2 data, as individual dots, so above 'a' I would have two dots at the height of 1 and 3, and above b I would have three dots at the heights of 1, 5 and 3.
My main problem is that anything I try results in several a and b on the x axis, not grouped.

Beeswarm, strip, and scatter plots are all options, depending on your data and preferred aesthetic.
plt.scatter or df.plot.scatter (most basic)
plt.scatter(data=df, x='col1', y='col2') # or df.plot.scatter(x='col1', y='col2')
plt.margins(x=0.5)
sns.swarmplot (avoid collisions)
sns.swarmplot(data=df, x='col1', y='col2')
sns.stripplot (random jitter)
sns.stripplot(data=df, x='col1', y='col2')

pandas data frame plotting in subplots

I have the following pandas data frame and would like to create n plots horizontally where n = unique labels(l1,l2,.) in the a1 row(for example in the following example there will be two plots because of l1 and l2). Then for these two plots, each plot will plot a4 as the x-axis against a3 as y axis. For example, ax[0] will contain a graph for a1, where it has three lines, linking the points [(1,15)(2,20)],[(1,17)(2,19)],[(1,23)(2,15)] for the below data.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
d = {'a1': ['l1','l1','l1','l1','l1','l1','l2','l2','l2','l2','l2','l2'],
'a2': ['a', 'a', 'b','b','c','c','d','d','e','e','f','f'],
'a3': [15,20,17,19,23,15,22,21,23,23,24,27],
'a4': [1,2,1,2,1,2,1,2,1,2,1,2]}
df=pd.DataFrame(d)
df
a1 a2 a3 a4
1 a 15 1
1 a 20 2
1 b 17 1
1 b 19 2
1 c 23 1
1 c 15 2
2 d 22 1
2 d 21 2
2 e 23 1
2 e 23 2
2 f 24 1
2 f 27 2
I currently have the following:
def graph(dataframe):
x = dataframe["a4"]
y = dataframe["a3"]
ax[0].plot(x,y) #how do I plot and set the title for each group in their respective subplot without the use of for-loop?
fig, ax = plt.subplots(1,len(pd.unique(df["a1"])),sharey='row',figsize=(15,2))
df.groupby(["a1"]).apply(graph)
However, my above attempt only plots all a3 against a4 on the first subplot(because I wrote ax[0].plot()). I can always use a for-loop to accomplish the desired task, but for large number of unique groups in a1, it will be computationally expensive. Is there a way to make it a one-liner on the line ax[0].plot(x,y) and it accomplishes the desired task without a for loop? Any inputs are appreciated.

I do not see any way of avoiding a for loop when plotting this data with pandas. My initial thought was to reshape the dataframe to make subplots=True work, like this:
dfp = df.pivot(columns='a1').swaplevel(axis=1).sort_index(axis=1)
dfp
But I do not see how to select the level 1 of the the columns MultiIndex to make something like dfp.plot(x='a4', y='a3', subplots=True) work.
Removing level 0 and then running the plotting function with
dfp.droplevel(axis=1, level=0).plot(x='a4', y='a3', subplots=True) raises ValueError: x must be a label or position. And even if this worked, there would still be the issue of linking the correct points together.
The seaborn package was created to conveniently plot this kind of dataset. If you are open to using it here is an example with relplot:
import pandas as pd # v 1.1.3
import seaborn as sns # v 0.11.0
d = {'a1': ['l1','l1','l1','l1','l1','l1','l2','l2','l2','l2','l2','l2'],
'a2': ['a', 'a', 'b','b','c','c','d','d','e','e','f','f'],
'a3': [15,20,17,19,23,15,22,21,23,23,24,27],
'a4': [1,2,1,2,1,2,1,2,1,2,1,2]}
df = pd.DataFrame(d)
sns.relplot(data=df, x='a4', y='a3', col='a1', hue ='a2', kind='line', height=4)
You can customize the colors with the palette argument and adjust the grid layout with col_wrap.

How to plot data after groupby

I have a data frame similar to this
import pandas as pd
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
print(df) #printing dataframe.
I performed the groupby function on it to get the required output.
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
print(group_data)
now i want to plot the out using matplot lib. Please help me with it.. I am not able to figure how to start and what to do.
I want to plot using the counter value and something similar to bar graph

Try:
group_data = group_data.reset_index()
in order to get rid of the multiple index that the groupby() has created for you.
Your print(group_data) will give you this:
In [24]: group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
In [25]: print(group_data)
age data
1 ONE 3
THREE 1
TWO 1
2 ONE 2
TWO 1
3 ONE 1
TWO 1
Name: COUNTER, dtype: int64
Whereas, reseting will 'simplify' the new index:
In [26]: group_data = group_data.reset_index()
In [27]: group_data
Out[27]:
age data COUNTER
0 1 ONE 3
1 1 THREE 1
2 1 TWO 1
3 2 ONE 2
4 2 TWO 1
5 3 ONE 1
6 3 TWO 1
Then depending on what it is exactly that you want to plot, you might want to take a look at the Matplotlib docs
EDIT
I now read more carefully that you want to create a 'bar' chart.
If that is the case then I would take a step back and not use reset_index() on the groupby result. Instead, try this:
In [46]: fig = group_data.plot.bar()
In [47]: fig.figure.show()
I hope this helps

Try with this:
# This is a great tool to add plots to jupyter notebook
% matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Params get plot bigger
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["legend.fontsize"] = 12
plt.rcParams["figure.figsize"] = [15, 7]
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
df['COUNTER'] = 1
group_data = df.groupby(['age','data']).sum()[['COUNTER']].plot.bar(rot = 90) # If you want to rotate labels from x axis
_ = group_data.set(xlabel = 'xlabel', ylabel = 'ylabel'), group_data.legend(['Legend']) # you can add labels and legend

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping boxplots in seaborn when input is a DataFrame - python

Related

Plotting multiple groups from a dataframe with datashader as lines

Weird shifting of boxplot in pandas boxplot combining it with seaborn pointplot - what is going on?

How to plot a dot plot type scatterplot in matplotlib or seaborn?

pandas data frame plotting in subplots

How to plot data after groupby

Categories

Resources