How to plot a bar chart without aggregation Seaborn? - python

How do you plot a bar chart without aggregation? I have two columns, one contains values and the other is categorical, but I want to plot each row individually, without aggregation.
By default, sns.barplot(x = "col1", y = "col2", data = df) will aggregate by taking the mean of the values for each category in col1.
How do I simply just plot a bar for each row in my dataframe with no aggregation?

In case 'col1' only contains unique labels, you immediately get your result with sns.barplot(x='col1', y='col2', data=df). In case there are repeated labels, you can use the index as x and afterwards change the ticks:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'col1': list('ababab'), 'col2': np.random.randint(10, 20, 6)})
ax = sns.barplot(x=df.index, y='col2', data=df)
ax.set_xticklabels(df['col1'])
ax.set_xlabel('col1')
plt.show()
PS: Similarly, a horizontal bar chart could be created as:
df = pd.DataFrame({'col1': list('ababab'), 'col2': np.random.randint(10, 20, 6)})
ax = sns.barplot(x='col2', y=df.index, data=df, orient='h')
ax.set_yticklabels(df['col1'])
ax.set_ylabel('col1')

Related

How to change the legend font size of pd.DataFrame.plot() when `secondary_y` is used?

Question
I have used the secondary_y argument in pd.DataFrame.plot().
While trying to change the fontsize of legends by .legend(fontsize=20), I ended up having only 1 column name in the legend when I actually have 2 columns to be printed on the legend.
This problem (having only 1 column name in the legend) does not take place when I did not use secondary_y argument.
I want all the column names in my dataframe to be printed in the legend, and change the fontsize of the legend even when I use secondary_y while plotting dataframe.
Example
The following example with secondary_y shows only 1 column name A, when I have actually 2 columns, which are A and B.
The fontsize of the legend is changed, but only for 1 column name.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(secondary_y = ["B"], figsize=(12,5)).legend(fontsize=20, loc="upper right")
When I do not use secondary_y, then legend shows both of the 2 columns A and B.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(figsize=(12,5)).legend(fontsize=20, loc="upper right")
To manage to customize it you have to create your graph with subplots function of Matplotlib:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
#define colors to use
col1 = 'steelblue'
col2 = 'red'
#define subplots
fig,ax = plt.subplots()
#add first line to plot
lns1=ax.plot(df.index,df['A'], color=col1)
#add x-axis label
ax.set_xlabel('dates', fontsize=14)
#add y-axis label
ax.set_ylabel('A', color=col1, fontsize=16)
#define second y-axis that shares x-axis with current plot
ax2 = ax.twinx()
#add second line to plot
lns2=ax2.plot(df.index,df['B'], color=col2)
#add second y-axis label
ax2.set_ylabel('B', color=col2, fontsize=16)
#legend
ax.legend(lns1+lns2,['A','B'],loc="upper right",fontsize=20)
#another solution is to create legend for fig,:
#fig.legend(['A','B'],loc="upper right")
plt.show()
result:
this is a somewhat late response, but something that worked for me was simply setting plt.legend(fontsize = wanted_fontsize) after the plot function.

Python plotting by different dataframe columns (using Seaborn?)

I'm trying to create a scatterplot of a dataset with point coloring based on different categorical columns. Seaborn works well here for one plot:
fg = sns.FacetGrid(data=plot_data, hue='col_1')
fg.map(plt.scatter, 'x_data', 'y_data', **kws).add_legend()
plt.show()
I then want to display the same data, but with hue='col_2' and hue='col_3'. It works fine if I just make 3 plots, but I'm really hoping to find a way to have them all appear as subplots in one figure. Unfortunately, I haven't found any way to change the hue from one plot to the next. I know there are plotting APIs that allow for an axis keyword, thereby letting you pop it into a matplotlib figure, but I haven't found one that simultaneously allows you to set 'ax=' and 'hue='. Any ideas?
Thanks in advance!
Edit:
Here's some sample code to illustrate the idea
xx = np.random.rand(10,2)
cat1 = np.array(['cat','dog','dog','dog','cat','hamster','cat','cat','hamster','dog'])
cat2 = np.array(['blond','brown','brown','black','black','blond','blond','blond','brown','blond'])
d = {'x':xx[:,0], 'y':xx[:,1], 'pet':cat1, 'hair':cat2}
df = pd.DataFrame(data=d)
sns.set(style='ticks')
fg = sns.FacetGrid(data=df, hue='pet', size=5)
fg.map(plt.scatter, 'x', 'y').add_legend()
fg = sns.FacetGrid(data=df, hue='hair', size=5)
fg.map(plt.scatter, 'x', 'y').add_legend()
plt.show()
This plots what I want, but in two windows. The color scheme is set in the first plot by grouping by 'pet', and in the second plot by 'hair'. Is there any way to do this on one plot?
In order to plot 3 scatterplots with different colors for each, you may create 3 axes in matplotlib and plot a scatter to each axes.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,5),
columns=["x", "y", "col1", "col2", "col3"])
fig, axes = plt.subplots(nrows=3)
for ax, col in zip(axes, df.columns[2:]):
ax.scatter(df.x, df.y, c=df[col])
plt.show()
For categorical data it is often easier to plot several scatter plots, one per category.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns
xx = np.random.rand(10,2)
cat1 = np.array(['cat','dog','dog','dog','cat','hamster','cat','cat','hamster','dog'])
cat2 = np.array(['blond','brown','brown','black','black','blond','blond','blond','brown','blond'])
d = {'x':xx[:,0], 'y':xx[:,1], 'pet':cat1, 'hair':cat2}
df = pd.DataFrame(data=d)
cols = ['pet',"hair"]
fig, axes = plt.subplots(nrows=len(cols ))
for ax,col in zip(axes,cols):
for n, group in df.groupby(col):
ax.scatter(group.x,group.y, label=n)
ax.legend()
plt.show()
You may surely use a FacetGrid, if you really want, but that requires a different data format of the DataFrame.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import seaborn as sns
xx = np.random.rand(10,2)
cat1 = np.array(['cat','dog','dog','dog','cat','hamster','cat','cat','hamster','dog'])
cat2 = np.array(['blond','brown','brown','black','black','blond','blond','blond','brown','blond'])
d = {'x':xx[:,0], 'y':xx[:,1], 'pet':cat1, 'hair':cat2}
df = pd.DataFrame(data=d)
df2 = pd.melt(df, id_vars=['x','y'], value_name='category', var_name="kind")
fg = sns.FacetGrid(data=df2, row="kind",hue='category', size=3)
fg.map(plt.scatter, 'x', 'y').add_legend()

Seaborn Plot including different distributions of the same data

I wish to create a seaborn pointplot to display the full data distribution in a column, alongside the distribution of the lowest 25% of values, and the distribution of the highest 25% of values, and all side by side (on the x axis).
My attempt so far provides me with the values, but they are displayed on the same part of the x-axis only and not spread out from left to right on the graph, and with no obvious way to label the points from x-ticks (which I would prefer , rather than via a legend).
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df1 = df[(df.total_bill < df.total_bill.quantile(.25))]
df2 = df[(df.total_bill > df.total_bill.quantile(.75))]
sns.pointplot(y=df['total_bill'], data=df, color='red')
sns.pointplot(y=df1['total_bill'], data=df1, color='green')
sns.pointplot(y=df2['total_bill'], data=df2, color='blue')
You could .join() the new distributions to your existing df and then .plot() using wide format:
lower, upper = df.total_bill.quantile([.25, .75]).values.tolist()
df = df.join(df.loc[df.total_bill < lower, 'total_bill'], rsuffix='_lower')
df = df.join(df.loc[df.total_bill > upper, 'total_bill'], rsuffix='_upper')
sns.pointplot(data=df.loc[:, [c for c in df.columns if c.startswith('total')]])
to get:
If you wanted to add groups, you could simply use .unstack() to get to long format:
df = df.loc[:, ['total_bill', 'total_bill_upper', 'total_bill_lower']].unstack().reset_index().drop('level_1', axis=1).dropna()
df.columns = ['grp', 'val']
to get:
sns.pointplot(x='grp', y='val', hue='grp', data=df)
I would think along the lines of adding a "group" and then plot as a single DataFrame.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
df = sns.load_dataset('tips')
df = df.append(df)
df.loc[(df.total_bill < df.total_bill.quantile(.25)),'group'] = 'L'
df.loc[(df.total_bill > df.total_bill.quantile(.75)),'group'] = 'H'
df = df.reset_index(drop=True)
df.loc[len(df)/2:,'group'] = 'all'
sns.pointplot(data = df,
y='total_bill',
x='group',
hue='group',
linestyles='')

pandas boxplot: swap box placement for comparison

tmpdf.boxplot(['original','new'], by = 'by column', ax = ax, sym = '')
gets me a plot like this
I want to compare "original" with "new", how can I arrange to put the two "0" boxes in one panel and the two "1" boxes in another panel? And of course swap the labelling with that.
Thanks
Here is a sample dataset to demonstrate.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# simulate some artificial data
# ==========================================
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10,2), columns=['original', 'new'] )
df['by column'] = pd.Series([0,0,0,0,1,1,1,1,1,1])
# your original plot
ax = df.boxplot(['original', 'new'], by='by column', figsize=(12,6))
To get desired output, use groupby explicitly out of boxplot, so that we iterate over all subgroups, and plot a boxplot for each.
ax = df[['original', 'new']].groupby(df['by column']).boxplot(figsize=(12,6))

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources