Seaborn FacetGrid with mosaic plot - python

Trying to plot a grid of mosaic plots using Seaborn's FacetGrid and statsmodels' mosaic and not quite making it.
Example dataset:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
index = [i for j in [[k] * 6 for k in range(4)] for i in j]
gender = ['male', 'male', 'male', 'female', 'female', 'female'] * 4
pet = np.random.choice(['cat', 'dog'], 24).tolist()
data = pd.DataFrame({'index': index, 'gender': gender, 'pet': pet})
data.head(10)
index gender pet
0 0 male dog
1 0 male dog
2 0 male cat
3 0 female dog
4 0 female dog
5 0 female cat
6 1 male cat
7 1 male dog
8 1 male dog
9 1 female dog
I want to make a 2x2 grid of 4 mosaic plots, each for the subset of column index.
Now, a single mosaic plot of say the first group (index == 0):
data0 = data[data['index'] == 0]
props = {}
for x in ['female', 'male']:
for y, col in {'dog': 'red', 'cat': 'blue'}.items():
props[(x, y)] ={'color': col}
mosaic(data0, ['gender', 'pet'],
labelizer=lambda k: '',
properties=props)
plt.show()
But trying to put this mosaic in a custom function sns.FacetGrid.map() could use, I fail (this is one version, I tried a few):
def my_mosaic(sliced_data, **kwargs):
mosaic(sliced_data, ['gender', 'pet'],
labelizer=lambda k: '',
properties=props)
g = sns.FacetGrid(data, col='index', col_wrap=2)
g = g.map(my_mosaic)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-323-a81a61aaeaff> in <module>()
5
6 g = sns.FacetGrid(data, col='index', col_wrap=2)
----> 7 g = g.map(my_mosaic)
~\AppData\Local\Programs\Python\Python36-32\lib\site-packages\seaborn\axisgrid.py in map(self, func, *args, **kwargs)
741
742 # Draw the plot
--> 743 self._facet_plot(func, ax, plot_args, kwargs)
744
745 # Finalize the annotations and layout
~\AppData\Local\Programs\Python\Python36-32\lib\site-packages\seaborn\axisgrid.py in _facet_plot(self, func, ax, plot_args, plot_kwargs)
825
826 # Draw the plot
--> 827 func(*plot_args, **plot_kwargs)
828
829 # Sort out the supporting information
TypeError: my_mosaic() missing 1 required positional argument: 'sliced_data'
I read the documentation and examples, but I just couldn't figure out how to make a callable function from any plotting function which isn't built-in in Seaborn or in matplotlib.pyplot (e.g. plt.scatter or sns.regplot).

I found it is easier to use map_dataframe() when you are ultimately dealing with... dataframes.
def my_mosaic(*args,**kwargs):
mosaic(kwargs['data'], list(args),
labelizer=lambda k: '',
properties=props,
ax=plt.gca())
g = sns.FacetGrid(data, col='index', col_wrap=2)
g = g.map_dataframe(my_mosaic, 'gender', 'pet')

Related

TypeError: unhashable type: 'numpy.ndarray' using dataset for values

I want to do a Bar graph where I can see the 3 medals of a country. I've already dropped all the teams I only have one but when I do it I get this error...
This is what I got:
N = 3
ind = np.arange(N)
width = 0.25
goldMedals = df[(df.Medal == 'Gold')]
bar1 = plt.bar(ind, goldMedals, width, color = 'gold')
silverMedals = df[(df.Medal == 'Silver')]
bar2 = plt.bar(ind+width, silverMedals, width, color='bronze')
bronzeMedals = df[(df.Medal == 'Bronze')]
bar3 = plt.bar(ind+width*2, bronzeMedals, width, color = 'b')
plt.xlabel("Medal")
plt.ylabel('Count')
plt.title("Medal Portugal")
plt.xticks(ind+width,['Gold', 'Bronze', 'Silver'])
plt.legend( (bar1, bar2, bar3), ('Gold', 'Bronze', 'Silver') )
plt.show()
I am guessing since you have not presented any data, but I think you are trying to visualize the results of conditional extraction in a long form data frame. The cause of the error is that the extraction result is a data frame, but the index and the data frame itself are specified respectively. So, for each graph, the index and y-axis value of the data frame are specified.
import pandas as pd
import numpy as np
import io
data = '''
nation medal count
0 "South Korea" gold 24
1 China gold 10
2 Canada gold 9
3 "South Korea" silver 13
4 China silver 15
5 Canada silver 12
6 "South Korea" bronze 11
7 China bronze 8
8 Canada bronze 12
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df = df.query('nation == "Canada"')
import matplotlib.pyplot as plt
N = 3
ind = np.arange(N)
width = 0.25
goldMedals = df[df.medal == 'gold']
bar1 = plt.bar(goldMedals.medal, goldMedals['count'], width, color='gold')
silverMedals = df[(df.medal == 'silver')]
bar2 = plt.bar(silverMedals.medal, silverMedals['count'], width, color='gray')
bronzeMedals = df[(df.medal == 'bronze')]
bar3 = plt.bar(bronzeMedals.medal, bronzeMedals['count'], width, color='brown')
plt.xlabel("Medal")
plt.ylabel('Count')
plt.title("Medal Canada")
#plt.xticks(ind+width,['Gold', 'Silver','Bronze'])
plt.legend((bar1, bar2, bar3), ('Gold', 'Bronze', 'Silver'))
plt.show()

Better way to plot Gender count using Python

I am making a graph to plot Gender count for the time series data that look like following data. Each row represent hourly data of each respective patient.
HR
SBP
DBP
Sepsis
Gender
P_ID
92
120
80
0
0
0
98
115
85
0
0
0
93
125
75
1
1
1
95
130
90
1
1
1
102
120
80
0
0
2
109
115
75
0
0
2
94
135
100
0
0
2
97
100
70
1
1
3
85
120
80
1
1
3
88
115
75
1
1
3
93
125
85
1
1
3
78
130
90
1
0
4
115
140
110
1
0
4
102
120
80
0
1
5
98
140
110
0
1
5
This is my code:
gender = df_n['Gender'].value_counts()
plt.figure(figsize=(7, 6))
ax = gender.plot(kind='bar', rot=0, color="c")
ax.set_title("Bar Graph of Gender", y = 1)
ax.set_xlabel('Gender')
ax.set_ylabel('Number of People')
ax.set_xticklabels(('Male', 'Female'))
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
Now what is happening is the code is calculating total number of instances (0: Male, 1: Female) and plotting it. But I want to plot the total males and females, not the total number of 0s and 1s, as the Same patient is having multiple rows of data (as per P_ID). Like how many patients are male and how many are female?
Can someone help me out? I guess maybe sns.countplot can be used. But I don't know how.
Thanks for helping me out >.<
__________ Udpate ________________
How I can group those Genders that are sepsis (1) or no sepsis (0)?
__________ Update 2 ___________
So, I got the total actual count of Male and Female, thanks to #Shaido.
In the whole dataset, there are only 2932 septic patients. Rest are non-septic. This is what I got from #JohanC answer.
Now, the problem is that as there are only 2932 septic patients, by looking at the graph, it is assumed that only 426 (251 Male) and (175 Female) are septic patients (out of 2932), rest are non-septic. But this is not true. Please help. Thanks.
I have a working example for selecting the unique IDS, it looks ugly so there is probably a better way, but it works...
import pandas as pd
# example of data:
data = {'gender': [0, 0, 1, 1, 1, 1, 0, 0], 'id': [1, 1, 2, 2, 3, 3, 4, 4]}
df = pd.DataFrame(data)
# get all unique ids:
ids = set(df.id)
# Go over all id, get first element of gender:
g = [list(df[df['id'] == i]['gender'])[0] for i in ids]
# count genders, laze way using pandas since the rest of the code also assumes a dataframe for plotting:
gender_counts = pd.DataFrame(g).value_counts()
# from here you can use your plot function.
# Or Counter
from collections import Counter
gender_counts = Counter(g)
# You have to create another method for plotting the gender.
You can group by 'P_ID' and take the first row for each of them (supposing a 'P_ID' has only one gender and only one sepsis). Then you can call sns.countplot on that dataframe, using gender for x and sepsis for hue (or vice versa). You can rename the values in the columns to show their names in the legend and in the tick labels.
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data_str = '''
HR|SBP|DBP|Sepsis|Gender|P_ID
92|120|80|0|0|0
98|115|85|0|0|0
93|125|75|1|1|1
95|130|90|1|1|1
102|120|80|0|0|2
109|115|75|0|0|2
94|135|100|0|0|2
97|100|70|1|1|3
85|120|80|1|1|3
88|115|75|1|1|3
93|125|85|1|1|3
78|130|90|1|0|4
115|140|110|1|0|4
102|120|80|0|1|5
98|140|110|0|1|5
'''
df = pd.read_csv(StringIO(data_str), delimiter='|')
# new df: take Sepsis and Gender from the first row for every P_ID
df_per_PID = df.groupby('P_ID')[['Sepsis', 'Gender']].first()
# give names to the values in the columns
df_per_PID = df_per_PID.replace({'Gender': {0: 'Male', 1: 'Female'}, 'Sepsis': {0: 'No sepsis', 1: 'Sepsis'}})
# show counts per Gender and Sepsis
ax = sns.countplot(data=df_per_PID, x='Gender', hue='Sepsis', palette='rocket')
ax.legend(title='') # remove title, as it is clear from the legend items
ax.set_xlabel('')
for bars in ax.containers:
ax.bar_label(bars)
# ax.margins(y=0.1) # make some extra space for the labels
ax.locator_params(axis='y', integer=True)
sns.despine()
plt.show()

How to draw cumulative density plot from pandas?

I have a dataframe:
count_single count_multi column_names
0 11345 7209 e
1 11125 6607 w
2 10421 5105 j
3 9840 4478 r
4 9561 5492 f
5 8317 3937 i
6 7808 3795 l
7 7240 4219 u
8 6915 3854 s
9 6639 2750 n
10 6340 2465 b
11 5627 2834 y
12 4783 2384 c
13 4401 1698 p
14 3305 1753 g
15 3283 1300 o
16 2767 1697 t
17 2453 1276 h
18 2125 1140 a
19 2090 929 q
20 1330 518 d
I want to visualize the single count and multi_count while column_names as a common column in both of them. I am looking something like this :
What I've tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
f, ax = plt.subplots(figsize = (6,15))
sns.set_color_codes('pastel')
sns.barplot(x = 'count_single', y = 'column_names', data = df,
label = 'Type_1', color = 'orange', edgecolor = 'w')
sns.set_color_codes('muted')
sns.barplot(x = 'count_multi', y = 'column_names', data = df,
label = 'Type_2', color = 'green', edgecolor = 'w')
ax.legend(ncol = 2, loc = 'lower right')
sns.despine(left = True, bottom = True)
plt.show()
it's giving me plot like this:
How to visualize these two columns with same as expected images?
I really appreciate any help you can provide.
# instantiate figure with two rows and one column
fig, axes = plt.subplots(nrows=2, figsize=(10,5))
# plot barplot in the first row
df.set_index('column_names').plot.bar(ax=axes[0], color=['rosybrown', 'tomato'])
# first scale each column bydividing by its sum and then use cumulative sum to generate the cumulative density function. plot on the second ax
df.set_index('column_names').apply(lambda x: x/x.sum()).cumsum().plot(ax=axes[1], color=['rosybrown', 'tomato'])
# change ticks in first plot:
axes[0].set_yticks(np.linspace(0, 12000, 7)) # this means: make 7 ticks between 0 and 12000
# adjust the axislabels for the second plot
axes[1].set_xticks(range(len(df)))
axes[1].set_xticklabels(df['column_names'], rotation=90)
plt.tight_layout()

How to rotate seaborn barplot x-axis tick labels

I'm trying to get a barplot to rotate it's X Labels in 45° to make them readable (as is, there's overlap).
len(genero) is 7, and len(filmes_por_genero) is 20
I'm using a MovieLens dataset and making a graph counting the number of movies in each individual genre. Here's my code as of now:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
filmes_por_genero = filmes["generos"].str.get_dummies('|').sum().sort_values(ascending=False)
genero = filmes_com_media.index
chart = plt.figure(figsize=(16,8))
sns.barplot(x=genero,
y=filmes_por_genero.values,
palette=sns.color_palette("BuGn_r", n_colors=len(filmes_por_genero) + 4)
)
chart.set_xticklabels(
chart.get_xticklabels(),
rotation=45,
horizontalalignment='right'
)
Here's the full error:
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
623 in_axis=in_axis,
624 )
--> 625 if not isinstance(gpr, Grouping)
626 else gpr
627 )
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
254 self.name = name
255 self.level = level
--> 256 self.grouper = _convert_grouper(index, grouper)
257 self.all_grouper = None
258 self.index = index
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/grouper.py in _convert_grouper(axis, grouper)
653 elif isinstance(grouper, (list, Series, Index, np.ndarray)):
654 if len(grouper) != len(axis):
--> 655 raise ValueError("Grouper and axis must be same length")
656 return grouper
657 else:
ValueError: Grouper and axis must be same length
Data from MovieLens 25M Dataset at MovieLens
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# data
df = pd.read_csv('ml-25m/movies.csv')
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
# clean genres
df['genres'] = df['genres'].str.split('|')
df = df.explode('genres', ignore_index=True)
print(df.head())
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy
Genres Counts
gc = df.genres.value_counts().to_frame()
print(gc)
genres
Drama 25606
Comedy 16870
Thriller 8654
Romance 7719
Action 7348
Horror 5989
Documentary 5605
Crime 5319
(no genres listed) 5062
Adventure 4145
Sci-Fi 3595
Children 2935
Animation 2929
Mystery 2925
Fantasy 2731
War 1874
Western 1399
Musical 1054
Film-Noir 353
IMAX 195
sns.barplot
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc) + 4), ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
plt.figure(figsize=(12, 6))
chart = sns.barplot(x=gc.index, y=gc.genres, palette=sns.color_palette("BuGn_r", n_colors=len(gc)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
sns.countplot
Use sns.countplot to skip using .value_counts() if the plot order doesn't matter.
To order the countplot, order=df.genres.value_counts().index must be used, so countplot doesn't really save you from needing .value_counts(), if a descending order is desired.
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=df, x='genres', ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
Shorter code for label rotation:
plt.xticks(rotation=45, ha='right')
Rotates labels by 45 degree
Aligns labels horizontally to the right for better readability
Full Example
sns.countplot with sorted x-axis
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('planets')
sns.countplot(data=df,
x='method',
order=df['method'].value_counts().index)
plt.xticks(rotation=45, ha='right');

How to groupby column, and then create a scatterplot of counts

I have a dataframe similar to the one below:
id date available
0 1944 2019-07-11 f
1 1944 2019-07-11 t
2 159454 2019-07-12 f
3 159454 2019-07-13 f
4 159454 2019-07-14 f
I would like form a scatter plot where each id has a corresponding point; the x value is the number of t occurrences, and the y value is the number of f occurrences in the available column.
I have tried:
grouped = df.groupby(['listing_id'])['available'].value_counts().to_frame()
grouped.head()
This gives me something like
available
listing_id available
1944 t 364
f 1
2015 f 184
t 181
3176 t 279
f 10
But I'm not sure how to work this anymore. How can I get my desired plot? Is there a better way to proceed?
Assuming you won't have to use the date column:
# Generate example data
N = 100
np.random.seed(1)
df = pd.DataFrame({'id': np.random.choice(list(range(1, 6)), size=N),
'available': np.random.choice(['t', 'f'], size=N)})
df = df.sort_values('id').reset_index(drop=True)
# For each id: get t and f counts, unstack into columns, ensure
# column order is ['t', 'f']
counts = df.groupby(['id', 'available']).size().unstack()[['t', 'f']]
# Plot
fig, ax = plt.subplots()
counts.plot(x='t', y='f', kind='scatter', ax=ax)
# Optional: label each data point with its id.
# This is rough and might not look good beyond a few data points
for label, (t, f) in counts.iterrows():
ax.text(t + .05, f + .05, label)
Output:
You can group by both listing_id and available, do a count and then unstack and then plot using seaborn.
Below I used some random numbers, the image is only for illustration.
import seaborn as sns
data = df.groupby(['listing_id', 'available'])['date'].count().unstack()
sns.scatterplot(x=data.t, y=data.f, hue=data.index, legend='full')
Using your data:
reset the index
df.reset_index(inplace=True)
id available count
1944 t 364
1944 f 1
2015 f 184
2015 t 181
3176 t 279
3176 f 10
create a t & f dataframe:
t = df[df.available == 't'].reset_index(drop=True)
id available count
0 1944 t 364
1 2015 t 181
2 3176 t 279
f = df[df.available == 'f'].reset_index(drop=True)
id available count
0 1944 f 1
1 2015 f 184
2 3176 f 10
Plot the data:
plt.scatter(x=t['count'], y=f['count'])
plt.xlabel('t')
plt.ylabel('f')
for i, txt in enumerate(f['id'].tolist()):
plt.annotate(txt, (t['count'].loc[i] + 3, f['count'].loc[i]))

Categories

Resources