Multiple plotting from dataframe using seaborn - python

I am trying to improve my visualizations on Python. Assume I have this data:
data = {'animal':['cat', 'tiger', 'leopard', 'dog'], 'family':['mustelids ','felidae','felidae', 'canidae'], 'family_pct':[6.06,33.33,9.09,12.12]}
df = pd.DataFrame(data)
I want to create a barplot as follows:
fig, ax = plt.subplots()
sns.barplot(x = 'family', y = 'family_pct', hue='animal', data = df)
However, I would like each "family" to be plotted separately (one plot for mustelids, one for felidae, and one for canidae) and not on the same plot. I effectively would like to loop the graph over every value of the family column. However, I am not sure how to go about this.
Thanks!

Use catplot() to combine a barplot() and a FacetGrid. This allows grouping within additional categorical variables. Using catplot() is safer than using FacetGrid directly, as it ensures synchronization of variable order across facets:
sns.catplot(x = 'family', y = 'family_pct', hue='animal', col='family', data = df, kind='bar', height=4, aspect=.7)
See more details here.

Related

How to make a line plot from a dataframe with multiple categorical columns in matplotlib

I want to make line chart for the different categories where one is a different country, and one is a different country for weekly based line charts. Initially, I was able to draft line plots using seaborn but it is not quite handy like setting its label, legend, color palette and so on. I am wondering is there any way to easily reshape this data with multiple categorical variables and render line charts. In initial attempt, I tried seaborn.relplot but it is not easy to tune its parameter and hard to customize the resulted plot. Can anyone point me to any efficient way to reshape dataframe with multiple categorical columns and render a clear line chart? Any thoughts?
reproducible data & my attempt:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'])
dff.drop('Unnamed: 0', axis=1, inplace=True)
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.isocalendar().week
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
df_ = df.groupby(['country', 'year', 'week'], as_index=False)['value'].sum()
sns.relplot(data=df_, x='week', y='value', hue='year', row='country', kind='line', height=6, aspect=2, facet_kws={'sharey': False, 'sharex': False}, sizes=(20, 10))
current plot
this is one of current plot that I made with seaborn.relplot
structure of plot is okay for me, but in seaborn.replot, it is hard to tune parameter and it is as flexible as using matplotlib. Also, I realized that the way of aggregating my data is not very efficient. I think there might be a shortcut to make the above code snippet more efficient like:
plt_data = []
for i in dff.loc[:, ['FCF_Beef','FCF_Beef']]:
...
but doing this way I faced a couple of issues to make the right plot. Can anyone point me out how to make this simple and efficient in order to make the expected line chart with matplotlib? Does anyone know any better way of doing this? Any idea? Thanks
desired output
In my desired plot, first I need to iterate list of countries, where each country has one subplot, in each subplot, x-axis shows 52 weeks and y-axis shows weeklyExport amount of different years for each country. Here is draft plot that I made with seaborn.relplot.
note that, I don't like the output from seaborn.relplot, so I am wondering how can I make above attempt more efficient with matplotlib attempt. Any idea?
As requested by the OP, following is an iterative way to plot the data.
The following example plots each year, for a given 'destination' in a single figure
This is similar to the answer for this question.
import pandas as pd
import matplotlib.pyplot as plt
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
df = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
# groupby destination and iterate through for plotting
for g, d in df.groupby(['destination']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data['week'] = data.weekly.dt.isocalendar().week # create a week column
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='FCF_Beef', ax=ax, label=year)
plt.show()
Single sample plot
If we look at the tail of one of the dataframes, data.weekly.dt.isocalendar().week as putting the last day of the year as week 1, so a line is drawn back to the last data point being placed at week 1.
This function rests on datetime.datetime(2018, 12, 31).isocalendar() and is the expected behavior from the datetime module, as per this closed pandas bug.
Removing the last row with .iloc[:-1, :], is a work around
Alternatively, replace data['week'] = data.weekly.dt.isocalendar().week with data['week'] = data.weekly.dt.strftime('%W').astype('int')
data.iloc[:-1, :].plot(x='week', y='FCF_Beef', ax=ax, label=year)
Updated with all code from OP
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/cb0553e009933574ac7ec3109ffb5140/raw/a277bc00dc08e526a7d5b7ead5425905f7206bfa/export.csv'
dff = pd.read_csv(url, parse_dates=['weekly'], usecols=range(1, 6))
df2_bf = dff.groupby(['destination', 'weekly'])['FCF_Beef'].sum().unstack()
df2_bf = df2_bf.fillna(0)
mm = df2_bf.T
mm.columns.name = None
mm = mm[~(mm.isna().sum(1)/mm.shape[1]).gt(0.9)].fillna(0)
#Total sum per column:
mm.loc['Total',:]= mm.sum(axis=0)
mm1 = mm.T
mm1 = mm1.nlargest(6, columns=['Total'])
mm1.drop('Total', axis=1, inplace=True)
mm2 = mm1.T
mm2.reset_index(inplace=True)
mm2['weekly'] = pd.to_datetime(mm2['weekly'])
mm2['year'] = mm2['weekly'].dt.year
mm2['week'] = mm2['weekly'].dt.strftime('%W').astype('int')
df = mm2.melt(id_vars=['weekly','week','year'], var_name='country')
# groupby destination and iterate through for plotting
for g, d in df.groupby(['country']):
# create the figure
fig, ax = plt.subplots(figsize=(7, 4))
# add lines for specific years
for year in d.weekly.dt.year.unique():
data = d[d.weekly.dt.year == year].copy() # select the data from d, by year
data.sort_values('weekly', inplace=True)
display(data.head()) # display is for jupyter, if it causes an error, use pring
data.plot(x='week', y='value', ax=ax, label=year, title=g)
plt.show()

Unusual bar plot in matplotlib

I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.
I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.
I have a makeshift prototype that looks like this:
I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.
Edit: Here is some requested toy data as well as the code used to produce the plot I already have.
Toy data:
The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv(). I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.
,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127
and
,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198
Code:
The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.
def plot_all_clusters_words(dfs):
# target structure: word as non unique column, value as other non unique column
df_dict_list = []
for df in dfs:
for index, row in df.iterrows():
df_dict_list.append({"word": row.features, "value": row.score})
df_dict_list.append({"word": "", "value": 0})
df_dict_list = df_dict_list[:-1]
new_df = pd.DataFrame(df_dict_list)
new_df.plot.bar(x="word")
plt.show()
return new_df
Note:
I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.
Calling plt.bar for each of the dataframes, each with an own label and color, would create the following plot:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))
dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

Reorder Catagorical X axis ticks Matplotlib [duplicate]

EDIT: this question arose back in 2013 with pandas ~0.13 and was obsoleted by direct support for boxplot somewhere between version 0.15-0.18 (as per #Cireo's late answer; also pandas greatly improved support for categorical since this was asked.)
I can get a boxplot of a salary column in a pandas DataFrame...
train.boxplot(column='Salary', by='Category', sym='')
...however I can't figure out how to define the index-order used on column 'Category' - I want to supply my own custom order, according to another criterion:
category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()
How can I apply my custom column order to the boxplot columns? (other than ugly kludging the column names with a prefix to force ordering)
'Category' is a string (really, should be a categorical, but this was back in 0.13, where categorical was a third-class citizen) column taking 27 distinct values: ['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']. So it can be easily factorized with pd.Categorical.from_array()
On inspection, the limitation is inside pandas.tools.plotting.py:boxplot(), which converts the column object without allowing ordering:
pandas.core.frame.py.boxplot() is a passthrough to
pandas.tools.plotting.py:boxplot()
which instantiates ...
matplotlib.pyplot.py:boxplot() which instantiates ...
matplotlib.axes.py:boxplot()
I suppose I could either hack up a custom version of pandas boxplot(), or reach into the internals of the object. And also file an enhance request.
Hard to say how to do this without a working example. My first guess would be to just add an integer column with the orders that you want.
A simple, brute-force way would be to add each boxplot one at a time.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
ax.boxplot(df[column], positions=[position])
ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()
EDIT: this is the right answer after direct support was added somewhere between version 0.15-0.18
tl;dr: for recent pandas - use positions argument to boxplot.
Adding a separate answer, which perhaps could be another question - feedback appreciated.
I wanted to add a custom column order within a groupby, which posed many problems for me. In the end, I had to avoid trying to use boxplot from a groupby object, and instead go through each subplot myself to provide explicit positions.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame()
df['GroupBy'] = ['g1', 'g2', 'g3', 'g4'] * 6
df['PlotBy'] = [chr(ord('A') + i) for i in xrange(24)]
df['SortBy'] = list(reversed(range(24)))
df['Data'] = [i * 10 for i in xrange(24)]
# Note that this has no effect on the boxplot
df = df.sort_values(['GroupBy', 'SortBy'])
for group, info in df.groupby('GroupBy'):
print 'Group: %r\n%s\n' % (group, info)
# With the below, cannot use
# - sort data beforehand (not preserved, can't access in groupby)
# - categorical (not all present in every chart)
# - positional (different lengths and sort orders per group)
# df.groupby('GroupBy').boxplot(layout=(1, 5), column=['Data'], by=['PlotBy'])
fig, axes = plt.subplots(1, df.GroupBy.nunique(), sharey=True)
for ax, (g, d) in zip(axes, df.groupby('GroupBy')):
d.boxplot(column=['Data'], by=['PlotBy'], ax=ax, positions=d.index.values)
plt.show()
Within my final code, it was even slightly more involved to determine positions because I had multiple data points for each sortby value, and I ended up having to do the below:
to_plot = data.sort_values([sort_col]).groupby(group_col)
for ax, (group, group_data) in zip(axes, to_plot):
# Use existing sorting
ordering = enumerate(group_data[sort_col].unique())
positions = [ind for val, ind in sorted((v, i) for (i, v) in ordering)]
ax = group_data.boxplot(column=[col], by=[plot_by], ax=ax, positions=positions)
Actually I got stuck with the same question. And I solved it by making a map and reset the xticklabels, with code as follows:
df = pd.DataFrame({"A":["d","c","d","c",'d','c','a','c','a','c','a','c']})
df['val']=(np.random.rand(12))
df['B']=df['A'].replace({'d':'0','c':'1','a':'2'})
ax=df.boxplot(column='val',by='B')
ax.set_xticklabels(list('dca'))
Note that pandas can now create categorical columns. If you don't mind having all the columns present in your graph, or trimming them appropriately, you can do something like the below:
http://pandas.pydata.org/pandas-docs/stable/categorical.html
df['Category'] = df['Category'].astype('category', ordered=True)
Recent pandas also appears to allow positions to pass all the way through from frame to axes.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/pyplot.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_axes.py
It might sound kind of silly, but many of the plot allow you to determine the order. For example:
Library & dataset
import seaborn as sns
df = sns.load_dataset('iris')
Specific order
p1=sns.boxplot(x='species', y='sepal_length', data=df, order=["virginica", "versicolor", "setosa"])
sns.plt.show()
If you're not happy with the default column order in your boxplot, you can change it to a specific order by setting the column parameter in the boxplot function.
check the two examples below:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
##
plt.figure()
df.boxplot()
plt.title("default column order")
##
plt.figure()
df.boxplot(column=['C','A', 'D', 'B'])
plt.title("Specified column order")
Use the new positions= attribute:
df.boxplot(column=['Data'], by=['PlotBy'], positions=df.index.values)
This can be resolved by applying a categorical order. You can decide on the ranking yourself. I'll give an example with days of week.
Provide categorical order to weekday
#List categorical variables in correct order
weekday = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
#Assign the above list to category ranking
wDays = pd.api.types.CategoricalDtype(ordered= True, categories=Weekday)
#Apply this to the specific column in DataFrame
df['Weekday'] = df['Weekday'].astype(wDays)
# Then generate your plot
plt.figure(figsize = [15, 10])
sns.boxplot(data = flights_samp, x = 'Weekday', y = 'Y Axis Variable', color = colour)

How to plot multiple linear regressions in the same figure

Given the following:
import numpy as np
import pandas as pd
import seaborn as sns
np.random.seed(365)
x1 = np.random.randn(50)
y1 = np.random.randn(50) * 100
x2 = np.random.randn(50)
y2 = np.random.randn(50) * 100
df1 = pd.DataFrame({'x1':x1, 'y1': y1})
df2 = pd.DataFrame({'x2':x2, 'y2': y2})
sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
This will create 2 separate plots. How can I add the data from df2 onto the SAME graph? All the seaborn examples I have found online seem to focus on how you can create adjacent graphs (say, via the 'hue' and 'col_wrap' options). Also, I prefer not to use the dataset examples where an additional column might be present as this does not have a natural meaning in the project I am working on.
If there is a mixture of matplotlib/seaborn functions that are required to achieve this, I would be grateful if someone could help illustrate.
You could use seaborn's FacetGrid class to get desired result.
You would need to replace your plotting calls with these lines:
# sns.lmplot('x1', 'y1', df1, fit_reg=True, ci = None)
# sns.lmplot('x2', 'y2', df2, fit_reg=True, ci = None)
df = pd.concat([df1.rename(columns={'x1':'x','y1':'y'})
.join(pd.Series(['df1']*len(df1), name='df')),
df2.rename(columns={'x2':'x','y2':'y'})
.join(pd.Series(['df2']*len(df2), name='df'))],
ignore_index=True)
pal = dict(df1="red", df2="blue")
g = sns.FacetGrid(df, hue='df', palette=pal, size=5);
g.map(plt.scatter, "x", "y", s=50, alpha=.7, linewidth=.5, edgecolor="white")
g.map(sns.regplot, "x", "y", ci=None, robust=1)
g.add_legend();
This will yield this plot:
Which is if I understand correctly is what you need.
Note that you will need to pay attention to .regplot parameters and may want to change the values I have put as an example.
; at the end of the line is to suppress output of the command (I use ipython notebook where it's visible).
Docs give some explanation on the .map() method. In essence, it does just that, maps plotting command with data. However it will work with 'low-level' plotting commands like regplot, and not lmlplot, which is actually calling regplot behind the scene.
Normally plt.scatter would take parameters: c='none', edgecolor='r' to make non-filled markers. But seaborn is interfering the process and enforcing color to the markers, so I don't see an easy/straigtforward way to fix this, but to manipulate ax elements after seaborn has produced the plot, which is best to be addressed as part of a different question.
Option 1: sns.regplot
In this case, the easiest to implement solution is to use sns.regplot, which is an axes-level function, because this will not require combining df1 and df2.
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
# create the figure and axes
fig, ax = plt.subplots(figsize=(6, 6))
# add the plots for each dataframe
sns.regplot(x='x1', y='y1', data=df1, fit_reg=True, ci=None, ax=ax, label='df1')
sns.regplot(x='x2', y='y2', data=df2, fit_reg=True, ci=None, ax=ax, label='df2')
ax.set(ylabel='y', xlabel='x')
ax.legend()
plt.show()
Option 2: sns.lmplot
As per sns.FacetGrid, it is better to use figure-level functions than to use FacetGrid directly.
Combine df1 and df2 into a long format, and then use sns.lmplot with the hue parameter.
When working with seaborn, it is almost always necessary for the data to be in a long format.
It's customary to use pandas.DataFrame.stack or pandas.melt to convert DataFrames from wide to long.
For this reason, df1 and df2 must have the columns renamed, and have an additional identifying column. This allows them to be concatenated on axis=0 (the default long format), instead of axis=1 (a wide format).
There are a number of ways to combine the DataFrames:
The combination method in the answer from Primer is fine if combining a few DataFrames.
However, a function, as shown below, is better for combining many DataFrames.
def fix_df(data: pd.DataFrame, name: str) -> pd.DataFrame:
"""rename columns and add a column"""
# rename columns to a common name
data.columns = ['x', 'y']
# add an identifying value to use with hue
data['df'] = name
return data
# create a list of the dataframes
df_list = [df1, df2]
# update the dataframes by calling the function in a list comprehension
df_update_list = [fix_df(v, f'df{i}') for i, v in enumerate(df_list, 1)]
# combine the dataframes
df = pd.concat(df_update_list).reset_index(drop=True)
# plot the dataframe
sns.lmplot(data=df, x='x', y='y', hue='df', ci=None)
Notes
Package versions used for this answer:
pandas v1.2.4
seaborn v0.11.1
matplotlib v3.3.4

Overlapping boxplots in python

I have the foll. dataframe:
Av_Temp Tot_Precip
278.001 0
274 0.0751864
270.294 0.631634
271.526 0.229285
272.246 0.0652201
273 0.0840059
270.463 0.0602944
269.983 0.103563
268.774 0.0694555
269.529 0.010908
270.062 0.043915
271.982 0.0295718
and want to plot a boxplot where the x-axis is 'Av_Temp' divided into equi-sized bins (say 2 in this case), and the Y-axis shows the corresponding range of values for Tot_Precip. I have the foll. code (thanks to Find pandas quartiles based on another column), however, when I plot the boxplots, they are getting plotted one on top of another. Any suggestions?
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
grp_df = df.groupby(expl_var+'_Deciles').apply(lambda x: numpy.array(x[cname]))
fig, ax = plt.subplots()
for i in range(len(grp_df)):
box_arr = grp_df[i]
box_arr = box_arr[~numpy.isnan(box_arr)]
stats = cbook.boxplot_stats(box_arr, labels = str(i))
ax.bxp(stats)
ax.set_yscale('log')
plt.show()
Since you're using pandas already, why not use the boxplot method on dataframes?
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
ax = df.boxplot(by='Av_Temp_Deciles', column='Tot_Precip')
ax.set_yscale('log')
That produces this: http://i.stack.imgur.com/20KPx.png
If you don't like the labels, throw in a
plt.xlabel('');plt.suptitle('');plt.title('')
If you want a standard boxplot, the above should be fine. My understanding of the separation of boxplot into boxplot_stats and bxp is to allow you to modify or replace the stats generated and fed to the plotting routine. See https://github.com/matplotlib/matplotlib/pull/2643 for some details.
If you need to draw a boxplot with non-standard stats, you can use boxplot_stats on 2D numpy arrays, so you only need to call it once. No loops required.
expl_var = 'Av_Temp'
cname = 'Tot_Precip'
df[expl_var+'_Deciles'] = pandas.qcut(df[expl_var], 2)
# I moved your nan check into the df apply function
grp_df = df.groupby('Av_Temp_Deciles').apply(lambda x: numpy.array(x[cname][~numpy.isnan(x[cname])]))
# boxplot_stats can take a 2D numpy array of data, and a 1D array of labels
# stats is now a list of dictionaries of stats, one dictionary per quantile
stats = cbook.boxplot_stats(grp_df.values, labels=grp_df.index)
# now it's a one-shot plot, no loops
fig, ax = plt.subplots()
ax.bxp(stats)
ax.set_yscale('log')

Categories

Resources