I'm working on a school project and I'm stuck in making a grouped bar chart. I found this article online with an explanation: https://www.pythoncharts.com/2019/03/26/grouped-bar-charts-matplotlib/
Now I have a dataset with an Age column and a Sex column in the Age column there stand how many years the client is and in the sex is a 0 for female and 1 for male. I want to plot the age difference between male and female. Now I have tried the following code like in the example:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, data.loc[data['Sex'] == '0', 'count'], width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, data.loc[data['Sex'] == '1', 'count'], width=bar_width)
This raised the following error: KeyError: 'count'
Then I tried to change the code a bit and got another error:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pylab as pyl
fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(data.Age.unique()))
# Define bar width. We'll use this to offset the second bar.
bar_width = 0.4
# Note we add the `width` parameter now which sets the width of each bar.
b1 = ax.bar(x, (data.loc[data['Sex'] == '0'].count()), width=bar_width)
# Same thing, but offset the x by the width of the bar.
b2 = ax.bar(x + bar_width, (data.loc[data['Sex'] == '1'].count()), width=bar_width)
This raised the error: ValueError: shape mismatch: objects cannot be broadcast to a single shape
Now how do I count the results that I do can make this grouped bar chart?
It seems like the article goes through too much trouble just to plot grouped chart bar:
np.random.seed(1)
data = pd.DataFrame({'Sex':np.random.randint(0,2,1000),
'Age':np.random.randint(20,50,1000)})
(data.groupby('Age')['Sex'].value_counts() # count the Sex values for each Age
.unstack('Sex') # turn Sex into columns
.plot.bar(figsize=(12,6)) # plot grouped bar
)
Or even simpler with seaborn:
fig, ax = plt.subplots(figsize=(12,6))
sns.countplot(data=data, x='Age', hue='Sex', ax=ax)
Output:
Related
Question
I have used the secondary_y argument in pd.DataFrame.plot().
While trying to change the fontsize of legends by .legend(fontsize=20), I ended up having only 1 column name in the legend when I actually have 2 columns to be printed on the legend.
This problem (having only 1 column name in the legend) does not take place when I did not use secondary_y argument.
I want all the column names in my dataframe to be printed in the legend, and change the fontsize of the legend even when I use secondary_y while plotting dataframe.
Example
The following example with secondary_y shows only 1 column name A, when I have actually 2 columns, which are A and B.
The fontsize of the legend is changed, but only for 1 column name.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(secondary_y = ["B"], figsize=(12,5)).legend(fontsize=20, loc="upper right")
When I do not use secondary_y, then legend shows both of the 2 columns A and B.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
df.plot(figsize=(12,5)).legend(fontsize=20, loc="upper right")
To manage to customize it you have to create your graph with subplots function of Matplotlib:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame(np.random.randn(24*3, 2),
index=pd.date_range('1/1/2019', periods=24*3, freq='h'))
df.columns = ['A', 'B']
#define colors to use
col1 = 'steelblue'
col2 = 'red'
#define subplots
fig,ax = plt.subplots()
#add first line to plot
lns1=ax.plot(df.index,df['A'], color=col1)
#add x-axis label
ax.set_xlabel('dates', fontsize=14)
#add y-axis label
ax.set_ylabel('A', color=col1, fontsize=16)
#define second y-axis that shares x-axis with current plot
ax2 = ax.twinx()
#add second line to plot
lns2=ax2.plot(df.index,df['B'], color=col2)
#add second y-axis label
ax2.set_ylabel('B', color=col2, fontsize=16)
#legend
ax.legend(lns1+lns2,['A','B'],loc="upper right",fontsize=20)
#another solution is to create legend for fig,:
#fig.legend(['A','B'],loc="upper right")
plt.show()
result:
this is a somewhat late response, but something that worked for me was simply setting plt.legend(fontsize = wanted_fontsize) after the plot function.
I have plotted a heatmap which is displayed below. on the xaxis it shows time of the day and y axis shows date. I want to show xaxis at every hour instead of the random xlabels it displays here.
I tried following code but the resulting heatmap overrites all xlabels together:
t = pd.date_range(start='00:00:00', end='23:59:59', freq='60T').time
df = pd.DataFrame(index=t)
df.reset_index(inplace=True)
df['index'] = df['index'].astype('str')
sns_hm = sns.heatmap(data=mat, cbar=True, lw=0,cmap=colormap,xticklabels=df['index'])
The following code supposes mat is a dataframe with columns for some timestamps for each of a number of days. Each of the days, the same timestamps need to appear again.
After drawing the heatmap, the left and right limits of the x-axis are retrieved. Supposing these go from 0 to 24 hour, the range can be subdivided into 25 positions, one for each of the hours.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
from matplotlib.colors import ListedColormap, to_hex
# first, create some test data
df = pd.DataFrame()
df["date"] = pd.date_range('20220304', periods=19000, freq=DateOffset(seconds=54))
df["val"] = (((np.random.rand(len(df)) ** 100).cumsum() / 2).astype(int) % 2) * 100
df['day'] = df['date'].dt.strftime('%d-%m-%Y')
df['time'] = df['date'].dt.strftime('%H:%M:%S')
mat = df.pivot(index='day', columns='time', values='val')
colors = list(plt.cm.Greens(np.linspace(0.2, 0.9, 10)))
ax = sns.heatmap(mat, cmap=colors, cbar_kws={'ticks': range(0, 101, 10)})
xmin, xmax = ax.get_xlim()
tick_pos = np.linspace(xmin, xmax, 25)
tick_labels = [f'{h:02d}:00:00' for h in range(len(tick_pos))]
ax.set_xticks(tick_pos)
ax.set_xticklabels(tick_labels, rotation=90)
ax.set(xlabel='', ylabel='')
plt.tight_layout()
plt.show()
The left plot shows the default tick labels, the right plot the customized labels.
I have an issue with axis labels when using groupby and trying to plot with seaborn. Here is my problem:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'user': ['Bob', 'Jane','Alice','Bob','Jane','Alice'],
'income': [40000, 50000, 42000,47000,53000,46000]})
groupedProduct = df.groupby(['Product']).sum().reset_index()
I then plot a horizontal bar plot using seaborn:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
#Prettify the plot
bar.set_yticklabels( bar.get_yticks(), size = 10)
bar.set_xticklabels( bar.get_xticks(), size = 10)
bar.set_ylabel("User", fontsize = 20)
bar.set_xlabel("Income ($)", fontsize = 20)
bar.set_title("Total income per user", fontsize = 20)
sns.set_theme(style="whitegrid")
sns.set_color_codes("muted")
Unfortunately, when I run the code in such a manner, the y-axis ticks are labelled as 0,1,2 instead of Bob, Jane, Alice as I'd like it to.
I can get around the issue if I use matplotlib in the following manner:
df_group_user = df.groupby(['user']).sum()
df_group_user['income'].plot(kind="barh")
plt.title("Total income per user")
plt.ylabel("User")
plt.xlabel("Income ($)")
Ideally, I'd like to use seaborn for plotting, but if I don't use reset_index() like above, when calling sns.barplot:
bar = sns.barplot( x="income", y="user", data=df_group_user, color="b" )
ValueError: Could not interpret input 'user'
just try re-writing the positions of x and y axis.
I'm using a diff dataframe to exhibit similar situation.
gp = df.groupby("Gender")['Salary'].sum().reset_index()
gp
Output:
Gender Salary
0 Female 8870
1 Male 23667
Now while plotting a bar chart, mention x axis first and then supply y axis and check,
bar = sns.barplot(x = 'Salary', y = "Gender", data = gp);
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()
In addition to the solution posted in this link I would also like if I can also add the Hue Parameter, and add the Median Values in each of the plots.
The Current Code:
testPlot = sns.boxplot(x='Pclass', y='Age', hue='Sex', data=trainData)
m1 = trainData.groupby(['Pclass', 'Sex'])['Age'].median().values
mL1 = [str(np.round(s, 2)) for s in m1]
p1 = range(len(m1))
for tick, label in zip(p1, testPlot.get_xticklabels()):
print(testPlot.text(p1[tick], m1[tick] + 1, mL1[tick]))
Gives a Output Like:
I'm working on the Titanic Dataset which can be found in this link.
I'm getting the required values, but only when I do a print statement, how do I include it in my Plot?
Place your labels manually according to hue parameter and width of bars for every category in a cycle of all xticklabels:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
trainData = pd.read_csv('titanic.csv')
testPlot = sns.boxplot(x='pclass', y='age', hue='sex', data=trainData)
m1 = trainData.groupby(['pclass', 'sex'])['age'].median().values
mL1 = [str(np.round(s, 2)) for s in m1]
ind = 0
for tick in range(len(testPlot.get_xticklabels())):
testPlot.text(tick-.2, m1[ind+1]+1, mL1[ind+1], horizontalalignment='center', color='w', weight='semibold')
testPlot.text(tick+.2, m1[ind]+1, mL1[ind], horizontalalignment='center', color='w', weight='semibold')
ind += 2
plt.show()
This answer is nearly copy & pasted from here but fit more to your example code. The linked answer is IMHO a bit missplaced there because that question is just about labeling a boxplot and not about a boxplot using the hue argument.
I couldn't use your Train dataset because it is not available as Python package. So I used Titanic instead which has nearly the same column names.
#!/usr/bin/env python3
import pandas as pd
import matplotlib
import matplotlib.patheffects as path_effects
import seaborn as sns
def add_median_labels(ax, fmt='.1f'):
"""Credits: https://stackoverflow.com/a/63295846/4865723
"""
lines = ax.get_lines()
boxes = [c for c in ax.get_children() if type(c).__name__ == 'PathPatch']
lines_per_box = int(len(lines) / len(boxes))
for median in lines[4:len(lines):lines_per_box]:
x, y = (data.mean() for data in median.get_data())
# choose value depending on horizontal or vertical plot orientation
value = x if (median.get_xdata()[1] - median.get_xdata()[0]) == 0 else y
text = ax.text(x, y, f'{value:{fmt}}', ha='center', va='center',
fontweight='bold', color='white')
# create median-colored border around white text for contrast
text.set_path_effects([
path_effects.Stroke(linewidth=3, foreground=median.get_color()),
path_effects.Normal(),
])
df = sns.load_dataset('titanic')
plot = sns.boxplot(x='pclass', y='age', hue='sex', data=df)
add_median_labels(plot)
plot.figure.show()
Als an alternative when you create your boxplot with a figure-based function. In that case you need to give the axes parameter to add_median_labels().
# imports and add_median_labels() unchanged
df = sns.load_dataset('titanic')
plot = sns.catplot(kind='box', x='pclass', y='age', hue='sex', data=df)
add_median_labels(plot.axes[0][0])
plot.figure.show()
The resulting plot
This solution also works with more then two categories in the column used for the hue argument.