I'm trying to get a barplot to rotate it's X Labels in 45° to make them readable (as is, there's overlap).
len(genero) is 7, and len(filmes_por_genero) is 20
I'm using a MovieLens dataset and making a graph counting the number of movies in each individual genre. Here's my code as of now:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
filmes_por_genero = filmes["generos"].str.get_dummies('|').sum().sort_values(ascending=False)
genero = filmes_com_media.index
chart = plt.figure(figsize=(16,8))
sns.barplot(x=genero,
y=filmes_por_genero.values,
palette=sns.color_palette("BuGn_r", n_colors=len(filmes_por_genero) + 4)
)
chart.set_xticklabels(
chart.get_xticklabels(),
rotation=45,
horizontalalignment='right'
)
Here's the full error:
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
623 in_axis=in_axis,
624 )
--> 625 if not isinstance(gpr, Grouping)
626 else gpr
627 )
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
254 self.name = name
255 self.level = level
--> 256 self.grouper = _convert_grouper(index, grouper)
257 self.all_grouper = None
258 self.index = index
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in _convert_grouper(axis, grouper)
653 elif isinstance(grouper, (list, Series, Index, np.ndarray)):
654 if len(grouper) != len(axis):
--> 655 raise ValueError("Grouper and axis must be same length")
656 return grouper
657 else:
ValueError: Grouper and axis must be same length
Data from MovieLens 25M Dataset at MovieLens
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# data
df = pd.read_csv('ml-25m/movies.csv')
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
# clean genres
df['genres'] = df['genres'].str.split('|')
df = df.explode('genres', ignore_index=True)
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy
Genres Counts
gc = df.genres.value_counts().to_frame()
print(gc)
genres
Drama 25606
Comedy 16870
Thriller 8654
Romance 7719
Action 7348
Horror 5989
Documentary 5605
Crime 5319
(no genres listed) 5062
Adventure 4145
Sci-Fi 3595
Children 2935
Animation 2929
Mystery 2925
Fantasy 2731
War 1874
Western 1399
Musical 1054
Film-Noir 353
IMAX 195
sns.barplot
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc) + 4), ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
plt.figure(figsize=(12, 6))
chart = sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
sns.countplot
Use sns.countplot to skip using .value_counts() if the plot order doesn't matter.
To order the countplot, order=df.genres.value_counts().index must be used, so countplot doesn't really save you from needing .value_counts(), if a descending order is desired.
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=df, x='genres', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
Shorter code for label rotation:
plt.xticks(rotation=45, ha='right')
Rotates labels by 45 degree
Aligns labels horizontally to the right for better readability
Full Example
sns.countplot with sorted x-axis
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('planets')
sns.countplot(data=df,
x='method',
order=df['method'].value_counts().index)
plt.xticks(rotation=45, ha='right');
Related
To draw plot, I am using seaborn and below is my code
import seaborn as sns
sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
tips=tips.head()
ax = sns.barplot(x="day", y="total_bill",hue="sex", data=tips, palette="tab20_r")
I want to get and print frequency of data plots that is no. of times it occurred and below is the expected image
To Add label in bar,
I have used below code
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
So, With above code. I am able to display height with respect to x-axis , but I don't want height. I want frequency/count that satisfies relationship. For above example, there are 2 male and 3 female who gave tip on Sunday. So it should display 2 and 3 and not the amount of tip
Below is the code
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
df = sns.load_dataset("tips")
ax = sns.barplot(x='day', y='tip',hue="sex", data=df, palette="tab20_r")
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = "{:.0f}".format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
How to display custom values on a bar plot does not clearly show how to annotate grouped bars, nor does it show how to determine the frequency of each hue category for each day.
How to plot and annotate grouped bars in seaborn / matplotlib shows how to annotate grouped bars, but not with custom labels.
for rect in ax.patches is an obsolete way to annotate bars. Use matplotlib.pyplot.bar_label, as fully described in How to add value labels on a bar chart.
Use pandas.crosstab or pandas.DataFrame.groupby to calculate the count of each category by the hue group.
As tips.info() shows, several columns have a category Dtype, which insures the plotting order and why the tp.index and tp.column order matches the x-axis and hue order of ax. Use pandas.Categorical to set a column to a category Dtype.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import pandas as pd
import seaborn as sns
# load the data
tips = sns.load_dataset('tips')
# determine the number of each gender for each day
tp = pd.crosstab(tips.day, tips.sex)
# or use groupby
# tp = tips.groupby(['day', 'sex']).sex.count().unstack('sex')
# plot the data
ax = sns.barplot(x='day', y='total_bill', hue='sex', data=tips)
# move the legend if needed
sns.move_legend(ax, bbox_to_anchor=(1, 1.02), loc='upper left', frameon=False)
# iterate through each group of bars, zipped to the corresponding column name
for c, col in zip(ax.containers, tp):
# add bar labels with custom annotation values
ax.bar_label(c, labels=tp[col], padding=3, label_type='center')
DataFrame Views
tips
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null category
3 smoker 244 non-null category
4 day 244 non-null category
5 time 244 non-null category
6 size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
tp
sex Male Female
day
Thur 30 32
Fri 10 9
Sat 59 28
Sun 58 18
I need help adding the percent distribution of the total (no decimals) in each section of a stacked bar plot in pandas created from a crosstab in a dataframe.
Here is sample data:
data = {
'Name':['Alisa','Bobby','Bobby','Alisa','Bobby','Alisa',
'Alisa','Bobby','Bobby','Alisa','Bobby','Alisa'],
'Exam':['Semester 1','Semester 1','Semester 1','Semester 1','Semester 1','Semester 1',
'Semester 2','Semester 2','Semester 2','Semester 2','Semester 2','Semester 2'],
'Subject':['Mathematics','Mathematics','English','English','Science','Science',
'Mathematics','Mathematics','English','English','Science','Science'],
'Result':['Pass','Pass','Fail','Pass','Fail','Pass','Pass','Fail','Fail','Pass','Pass','Fail']}
df = pd.DataFrame(data)
# display(df)
Name Exam Subject Result
0 Alisa Semester 1 Mathematics Pass
1 Bobby Semester 1 Mathematics Pass
2 Bobby Semester 1 English Fail
3 Alisa Semester 1 English Pass
4 Bobby Semester 1 Science Fail
5 Alisa Semester 1 Science Pass
6 Alisa Semester 2 Mathematics Pass
7 Bobby Semester 2 Mathematics Fail
8 Bobby Semester 2 English Fail
9 Alisa Semester 2 English Pass
10 Bobby Semester 2 Science Pass
11 Alisa Semester 2 Science Fail
Here is my code:
#crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax.plot.bar(figsize=(10,10),stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='best', bbox_to_anchor=(0.1, 1.0),title="Subject",)
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
plt.show()
I know I need to add a plt.text some how, but can't figure it out. I would like the percent of the totals to be embedded within the stacked bars.
Let's try:
# crosstab
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
ax= pd.crosstab(df['Name'], df['Subject']).apply(lambda r: r/r.sum()*100, axis=1)
ax_1 = ax.plot.bar(figsize=(10,10), stacked=True, rot=0, color=pal)
display(ax)
plt.legend(loc='upper center', bbox_to_anchor=(0.1, 1.0), title="Subject")
plt.xlabel('Name')
plt.ylabel('Percent Distribution')
for rec in ax_1.patches:
height = rec.get_height()
ax_1.text(rec.get_x() + rec.get_width() / 2,
rec.get_y() + height / 2,
"{:.0f}%".format(height),
ha='center',
va='bottom')
plt.show()
Output:
Subject English Mathematics Science
Name
Alisa 33.333333 33.333333 33.333333
Bobby 33.333333 33.333333 33.333333
From matplotlib 3.4.2 use matplotlib.pyplot.bar_label
See this answer for a thorough explanation of using the method, and for additional examples.
Using label_type='center' will annotate with the value of each segment, and label_type='edge' will annotate with the cumulative sum of the segments.
It is easiest to plot stacked bars using pandas.DataFrame.plot with kind='bar' and stacked=True
To get the percent in a vectorized manner (without .apply):
Get the frequency count using pd.crosstab
Divide ct along axis=0 by ct.sum(axis=1)
It is important to specify the correct axis with .div and .sum.
Multiply by 100, and round.
This is best done using .crosstab because it results in a dataframe with the correct shape for plotting the stacked bars. .groupby would require further reshaping of the dataframe.
Tested in python 3.10, pandas 1.3.4, matplotlib 3.5.0
import pandas as pd
import matplotlib.pyplot as plt
# get a frequency count using crosstab
ct = pd.crosstab(df['Name'], df['Subject'])
# vectorized calculation of the percent per row
ct = ct.div(ct.sum(axis=1), axis=0).mul(100).round(2)
# display(ct)
Subject English Mathematics Science
Name
Alisa 33.33 33.33 33.33
Bobby 33.33 33.33 33.33
# specify custom colors
pal = ["royalblue", "dodgerblue", "lightskyblue", "lightblue"]
# plot
ax = ct.plot(kind='bar', figsize=(10, 10), stacked=True, rot=0, color=pal, xlabel='Name', ylabel='Percent Distribution')
# move the legend
ax.legend(title='Subject', bbox_to_anchor=(1, 1.02), loc='upper left')
# iterate through each bar container
for c in ax.containers:
# add the annotations
ax.bar_label(c, fmt='%0.0f%%', label_type='center')
plt.show()
Using label_type='edge' annotates with the cumulative sum
I have a dataframe:
count_single count_multi column_names
0 11345 7209 e
1 11125 6607 w
2 10421 5105 j
3 9840 4478 r
4 9561 5492 f
5 8317 3937 i
6 7808 3795 l
7 7240 4219 u
8 6915 3854 s
9 6639 2750 n
10 6340 2465 b
11 5627 2834 y
12 4783 2384 c
13 4401 1698 p
14 3305 1753 g
15 3283 1300 o
16 2767 1697 t
17 2453 1276 h
18 2125 1140 a
19 2090 929 q
20 1330 518 d
I want to visualize the single count and multi_count while column_names as a common column in both of them. I am looking something like this :
What I've tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
f, ax = plt.subplots(figsize = (6,15))
sns.set_color_codes('pastel')
sns.barplot(x = 'count_single', y = 'column_names', data = df,
label = 'Type_1', color = 'orange', edgecolor = 'w')
sns.set_color_codes('muted')
sns.barplot(x = 'count_multi', y = 'column_names', data = df,
label = 'Type_2', color = 'green', edgecolor = 'w')
ax.legend(ncol = 2, loc = 'lower right')
sns.despine(left = True, bottom = True)
plt.show()
it's giving me plot like this:
How to visualize these two columns with same as expected images?
I really appreciate any help you can provide.
# instantiate figure with two rows and one column
fig, axes = plt.subplots(nrows=2, figsize=(10,5))
# plot barplot in the first row
df.set_index('column_names').plot.bar(ax=axes[0], color=['rosybrown', 'tomato'])
# first scale each column bydividing by its sum and then use cumulative sum to generate the cumulative density function. plot on the second ax
df.set_index('column_names').apply(lambda x: x/x.sum()).cumsum().plot(ax=axes[1], color=['rosybrown', 'tomato'])
# change ticks in first plot:
axes[0].set_yticks(np.linspace(0, 12000, 7)) # this means: make 7 ticks between 0 and 12000
# adjust the axislabels for the second plot
axes[1].set_xticks(range(len(df)))
axes[1].set_xticklabels(df['column_names'], rotation=90)
plt.tight_layout()
Trying to plot a grid of mosaic plots using Seaborn's FacetGrid and statsmodels' mosaic and not quite making it.
Example dataset:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
index = [i for j in [[k] * 6 for k in range(4)] for i in j]
gender = ['male', 'male', 'male', 'female', 'female', 'female'] * 4
pet = np.random.choice(['cat', 'dog'], 24).tolist()
data = pd.DataFrame({'index': index, 'gender': gender, 'pet': pet})
data.head(10)
index gender pet
0 0 male dog
1 0 male dog
2 0 male cat
3 0 female dog
4 0 female dog
5 0 female cat
6 1 male cat
7 1 male dog
8 1 male dog
9 1 female dog
I want to make a 2x2 grid of 4 mosaic plots, each for the subset of column index.
Now, a single mosaic plot of say the first group (index == 0):
data0 = data[data['index'] == 0]
props = {}
for x in ['female', 'male']:
for y, col in {'dog': 'red', 'cat': 'blue'}.items():
props[(x, y)] ={'color': col}
mosaic(data0, ['gender', 'pet'],
labelizer=lambda k: '',
properties=props)
plt.show()
But trying to put this mosaic in a custom function sns.FacetGrid.map() could use, I fail (this is one version, I tried a few):
def my_mosaic(sliced_data, **kwargs):
mosaic(sliced_data, ['gender', 'pet'],
labelizer=lambda k: '',
properties=props)
g = sns.FacetGrid(data, col='index', col_wrap=2)
g = g.map(my_mosaic)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-323-a81a61aaeaff> in <module>()
5
6 g = sns.FacetGrid(data, col='index', col_wrap=2)
----> 7 g = g.map(my_mosaic)
~\AppData\Local\Programs\Python\Python36-32\lib\site-packages\seaborn\axisgrid.py in map(self, func, *args, **kwargs)
741
742 # Draw the plot
--> 743 self._facet_plot(func, ax, plot_args, kwargs)
744
745 # Finalize the annotations and layout
~\AppData\Local\Programs\Python\Python36-32\lib\site-packages\seaborn\axisgrid.py in _facet_plot(self, func, ax, plot_args, plot_kwargs)
825
826 # Draw the plot
--> 827 func(*plot_args, **plot_kwargs)
828
829 # Sort out the supporting information
TypeError: my_mosaic() missing 1 required positional argument: 'sliced_data'
I read the documentation and examples, but I just couldn't figure out how to make a callable function from any plotting function which isn't built-in in Seaborn or in matplotlib.pyplot (e.g. plt.scatter or sns.regplot).
I found it is easier to use map_dataframe() when you are ultimately dealing with... dataframes.
def my_mosaic(*args,**kwargs):
mosaic(kwargs['data'], list(args),
labelizer=lambda k: '',
properties=props,
ax=plt.gca())
g = sns.FacetGrid(data, col='index', col_wrap=2)
g = g.map_dataframe(my_mosaic, 'gender', 'pet')
I have a dataframe called conversionRate like this:
| State| Apps | Loans| conversionratio|
2013-01-01 IL 1165 152 13.047210
2013-01-01 NJ 2210 756 34.208145
2013-01-01 TX 1454 73 5.020633
2013-02-01 CA 2265 400 17.660044
2013-02-01 IL 1073 168 15.657036
2013-02-01 NJ 2036 739 36.296660
2013-02-01 TX 1370 63 4.598540
2013-03-01 CA 2545 548 21.532417
2013-03-01 IL 1108 172 15.523466
I intend to plot the number of apps and number of loans in the primary Y axis and the Conversion Ratio in the secondary axis for each state.
I tried the below code:
import math
rows =int(math.ceil(len(pd.Series.unique(conversionRate['State']))/2))
fig, axes = plt.subplots(nrows=rows, ncols=2, figsize=(10, 10),sharex=True, sharey=False)
columnCounter = itertools.cycle([0,1])
rowCounter1 = 0
for element in pd.Series.unique(conversionRate['State']):
rowCounter = (rowCounter1)//2
rowCounter1 = (rowCounter1+1)
subSample = conversionRate[conversionRate['State']==element]
axis=axes[rowCounter,next(columnCounter)]
#ax2 = axis.twinx()
subSample.plot(y=['Loans', 'Apps'],secondary_y=['conversionratio'],\
ax=axis)
I end up with a figure like the below:
The question is how do I get the secondary axis line to show? If I try the below (per the manual setting secondary_y in plot() should selectively plot those columns in the secondary axis), I see only the line I plot on the secondary axis. There must be something simple and obvious I am missing. I can't figure out what it is! Can any guru please help?
subSample.plot(secondary_y=['conversionratio'],ax=axis)
You need to include conversionration in y=['Loans', 'Apps','conversionratio'] as well as in secondary_y... or better yet leave that parameter out, since you're plotting all the columns.
rows =int(math.ceil(len(pd.Series.unique(conversionRate['State']))/2))
fig, axes = plt.subplots(nrows=rows, ncols=2, figsize=(10,
10),sharex=True, sharey=False)
columnCounter = itertools.cycle([0,1])
rowCounter1 = 0
for element in pd.Series.unique(conversionRate['State']):
rowCounter = (rowCounter1)//2
rowCounter1 = (rowCounter1+1)
subSample = conversionRate[conversionRate['State']==element]
axis=axes[rowCounter,next(columnCounter)]
#ax2 = axis.twinx()
subSample.plot(secondary_y=['conversionratio'], ax=axis)