How to groupby column, and then create a scatterplot of counts - python

I have a dataframe similar to the one below:
id date available
0 1944 2019-07-11 f
1 1944 2019-07-11 t
2 159454 2019-07-12 f
3 159454 2019-07-13 f
4 159454 2019-07-14 f
I would like form a scatter plot where each id has a corresponding point; the x value is the number of t occurrences, and the y value is the number of f occurrences in the available column.
I have tried:
grouped = df.groupby(['listing_id'])['available'].value_counts().to_frame()
grouped.head()
This gives me something like
available
listing_id available
1944 t 364
f 1
2015 f 184
t 181
3176 t 279
f 10
But I'm not sure how to work this anymore. How can I get my desired plot? Is there a better way to proceed?

Assuming you won't have to use the date column:
# Generate example data
N = 100
np.random.seed(1)
df = pd.DataFrame({'id': np.random.choice(list(range(1, 6)), size=N),
'available': np.random.choice(['t', 'f'], size=N)})
df = df.sort_values('id').reset_index(drop=True)
# For each id: get t and f counts, unstack into columns, ensure
# column order is ['t', 'f']
counts = df.groupby(['id', 'available']).size().unstack()[['t', 'f']]
# Plot
fig, ax = plt.subplots()
counts.plot(x='t', y='f', kind='scatter', ax=ax)
# Optional: label each data point with its id.
# This is rough and might not look good beyond a few data points
for label, (t, f) in counts.iterrows():
ax.text(t + .05, f + .05, label)
Output:

You can group by both listing_id and available, do a count and then unstack and then plot using seaborn.
Below I used some random numbers, the image is only for illustration.
import seaborn as sns
data = df.groupby(['listing_id', 'available'])['date'].count().unstack()
sns.scatterplot(x=data.t, y=data.f, hue=data.index, legend='full')

Using your data:
reset the index
df.reset_index(inplace=True)
id available count
1944 t 364
1944 f 1
2015 f 184
2015 t 181
3176 t 279
3176 f 10
create a t & f dataframe:
t = df[df.available == 't'].reset_index(drop=True)
id available count
0 1944 t 364
1 2015 t 181
2 3176 t 279
f = df[df.available == 'f'].reset_index(drop=True)
id available count
0 1944 f 1
1 2015 f 184
2 3176 f 10
Plot the data:
plt.scatter(x=t['count'], y=f['count'])
plt.xlabel('t')
plt.ylabel('f')
for i, txt in enumerate(f['id'].tolist()):
plt.annotate(txt, (t['count'].loc[i] + 3, f['count'].loc[i]))

Related

Better way to plot Gender count using Python

I am making a graph to plot Gender count for the time series data that look like following data. Each row represent hourly data of each respective patient.
HR
SBP
DBP
Sepsis
Gender
P_ID
92
120
80
0
0
0
98
115
85
0
0
0
93
125
75
1
1
1
95
130
90
1
1
1
102
120
80
0
0
2
109
115
75
0
0
2
94
135
100
0
0
2
97
100
70
1
1
3
85
120
80
1
1
3
88
115
75
1
1
3
93
125
85
1
1
3
78
130
90
1
0
4
115
140
110
1
0
4
102
120
80
0
1
5
98
140
110
0
1
5
This is my code:
gender = df_n['Gender'].value_counts()
plt.figure(figsize=(7, 6))
ax = gender.plot(kind='bar', rot=0, color="c")
ax.set_title("Bar Graph of Gender", y = 1)
ax.set_xlabel('Gender')
ax.set_ylabel('Number of People')
ax.set_xticklabels(('Male', 'Female'))
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = 1
label = format(y_value)
ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')
plt.show()
Now what is happening is the code is calculating total number of instances (0: Male, 1: Female) and plotting it. But I want to plot the total males and females, not the total number of 0s and 1s, as the Same patient is having multiple rows of data (as per P_ID). Like how many patients are male and how many are female?
Can someone help me out? I guess maybe sns.countplot can be used. But I don't know how.
Thanks for helping me out >.<
__________ Udpate ________________
How I can group those Genders that are sepsis (1) or no sepsis (0)?
__________ Update 2 ___________
So, I got the total actual count of Male and Female, thanks to #Shaido.
In the whole dataset, there are only 2932 septic patients. Rest are non-septic. This is what I got from #JohanC answer.
Now, the problem is that as there are only 2932 septic patients, by looking at the graph, it is assumed that only 426 (251 Male) and (175 Female) are septic patients (out of 2932), rest are non-septic. But this is not true. Please help. Thanks.
I have a working example for selecting the unique IDS, it looks ugly so there is probably a better way, but it works...
import pandas as pd
# example of data:
data = {'gender': [0, 0, 1, 1, 1, 1, 0, 0], 'id': [1, 1, 2, 2, 3, 3, 4, 4]}
df = pd.DataFrame(data)
# get all unique ids:
ids = set(df.id)
# Go over all id, get first element of gender:
g = [list(df[df['id'] == i]['gender'])[0] for i in ids]
# count genders, laze way using pandas since the rest of the code also assumes a dataframe for plotting:
gender_counts = pd.DataFrame(g).value_counts()
# from here you can use your plot function.
# Or Counter
from collections import Counter
gender_counts = Counter(g)
# You have to create another method for plotting the gender.
You can group by 'P_ID' and take the first row for each of them (supposing a 'P_ID' has only one gender and only one sepsis). Then you can call sns.countplot on that dataframe, using gender for x and sepsis for hue (or vice versa). You can rename the values in the columns to show their names in the legend and in the tick labels.
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data_str = '''
HR|SBP|DBP|Sepsis|Gender|P_ID
92|120|80|0|0|0
98|115|85|0|0|0
93|125|75|1|1|1
95|130|90|1|1|1
102|120|80|0|0|2
109|115|75|0|0|2
94|135|100|0|0|2
97|100|70|1|1|3
85|120|80|1|1|3
88|115|75|1|1|3
93|125|85|1|1|3
78|130|90|1|0|4
115|140|110|1|0|4
102|120|80|0|1|5
98|140|110|0|1|5
'''
df = pd.read_csv(StringIO(data_str), delimiter='|')
# new df: take Sepsis and Gender from the first row for every P_ID
df_per_PID = df.groupby('P_ID')[['Sepsis', 'Gender']].first()
# give names to the values in the columns
df_per_PID = df_per_PID.replace({'Gender': {0: 'Male', 1: 'Female'}, 'Sepsis': {0: 'No sepsis', 1: 'Sepsis'}})
# show counts per Gender and Sepsis
ax = sns.countplot(data=df_per_PID, x='Gender', hue='Sepsis', palette='rocket')
ax.legend(title='') # remove title, as it is clear from the legend items
ax.set_xlabel('')
for bars in ax.containers:
ax.bar_label(bars)
# ax.margins(y=0.1) # make some extra space for the labels
ax.locator_params(axis='y', integer=True)
sns.despine()
plt.show()

Seaborn Line Plot for plotting multiple parameters

I have dataset as below,
index
10_YR_CAGR
5_YR_CAGR
1_YR_CAGR
c1_rev
20.5
21.5
31.5
c2_rev
20.5
22.5
24
c3_rev
21
24
27
c4_rev
20
26
30
c5_rev
24
19
15
c1_eps
21
22
23
c2_eps
21
24
25
This data has 5 companies and its parameters like rev, eps, profit etc. I need to plot as below:
rev:
x_axis-> index_col c1_rev, ...c5_rev
y_axis -> 10_YR_CAGR .. 1_YR_CAGR
eps:
x_axis -> index_col: c1_eps,...c5_eps
y_axis -> 10_YR_CAGR,... 1_YR_CAGR
etc...
I have tried with following code:
eps = analysis_df[analysis_df.index.str.contains('eps',regex=True)]
for i1 in eps.columns[eps.columns!='index']:
sns.lineplot(x="index",y=i1,data=eps,label=i1)
I have to make a dataframe from source and then loop it. How can I try to create a for loop which loops from the main source dataframe itself?
Instead of creating a loop for separate parameters, how can I loop from the main source dataframe to create a chart of plots with parameters like rev, eps, profit to facegrid parameters? How to apply those filter in facetgrid?
My sample output of the above code,
How to plot the same sort of plot for different parameters in a single for loop?
The way facets are typically plotted is by "melting" your analysis_df into id/variable/value columns.
split() the index column into Company and Parameter, which we'll later use as id columns when melting:
analysis_df[['Company', 'Parameter']] = analysis_df['index'].str.split('_', expand=True)
# index 10_YR_CAGR 5_YR_CAGR 1_YR_CAGR Company Parameter
# 0 c1_rev 100 21 1 c1 rev
# 1 c2_rev 1 32 24 c2 rev
# ...
melt() the CAGR columns:
melted = analysis_df.melt(
id_vars=['Company', 'Parameter'],
value_vars=['10_YR_CAGR', '5_YR_CAGR', '1_YR_CAGR'],
var_name='Period',
value_name='CAGR',
)
# Company Parameter Period CAGR
# 0 c1 rev 10_YR_CAGR 100
# 1 c2 rev 10_YR_CAGR 1
# 2 c3 rev 10_YR_CAGR 14
# 3 c1 eps 10_YR_CAGR 1
# ...
# 25 c2 pft 1_YR_CAGR 14
# 26 c3 pft 1_YR_CAGR 17
relplot() CAGR vs Company (colored by Period) for each Parameter using the melted dataframe:
sns.relplot(
data=melted,
kind='line',
col='Parameter',
x='Company',
y='CAGR',
hue='Period',
col_wrap=1,
facet_kws={'sharex': False, 'sharey': False},
)
Sample data to reproduce this plot:
import io
import pandas as pd
csv = '''
index,10_YR_CAGR,5_YR_CAGR,1_YR_CAGR
c1_rev,100,21,1
c2_rev,1,32,24
c3_rev,14,23,7
c1_eps,1,20,50
c2_eps,21,20,25
c3_eps,31,20,37
c1_pft,20,1,10
c2_pft,25,20,14
c3_pft,11,55,17
'''
analysis_df = pd.read_csv(io.StringIO(csv))

How to draw cumulative density plot from pandas?

I have a dataframe:
count_single count_multi column_names
0 11345 7209 e
1 11125 6607 w
2 10421 5105 j
3 9840 4478 r
4 9561 5492 f
5 8317 3937 i
6 7808 3795 l
7 7240 4219 u
8 6915 3854 s
9 6639 2750 n
10 6340 2465 b
11 5627 2834 y
12 4783 2384 c
13 4401 1698 p
14 3305 1753 g
15 3283 1300 o
16 2767 1697 t
17 2453 1276 h
18 2125 1140 a
19 2090 929 q
20 1330 518 d
I want to visualize the single count and multi_count while column_names as a common column in both of them. I am looking something like this :
What I've tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
f, ax = plt.subplots(figsize = (6,15))
sns.set_color_codes('pastel')
sns.barplot(x = 'count_single', y = 'column_names', data = df,
label = 'Type_1', color = 'orange', edgecolor = 'w')
sns.set_color_codes('muted')
sns.barplot(x = 'count_multi', y = 'column_names', data = df,
label = 'Type_2', color = 'green', edgecolor = 'w')
ax.legend(ncol = 2, loc = 'lower right')
sns.despine(left = True, bottom = True)
plt.show()
it's giving me plot like this:
How to visualize these two columns with same as expected images?
I really appreciate any help you can provide.
# instantiate figure with two rows and one column
fig, axes = plt.subplots(nrows=2, figsize=(10,5))
# plot barplot in the first row
df.set_index('column_names').plot.bar(ax=axes[0], color=['rosybrown', 'tomato'])
# first scale each column bydividing by its sum and then use cumulative sum to generate the cumulative density function. plot on the second ax
df.set_index('column_names').apply(lambda x: x/x.sum()).cumsum().plot(ax=axes[1], color=['rosybrown', 'tomato'])
# change ticks in first plot:
axes[0].set_yticks(np.linspace(0, 12000, 7)) # this means: make 7 ticks between 0 and 12000
# adjust the axislabels for the second plot
axes[1].set_xticks(range(len(df)))
axes[1].set_xticklabels(df['column_names'], rotation=90)
plt.tight_layout()

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

pandas: how to run a pivot with a multi-index?

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
Minimal code copied below. Thanks a lot!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
You can group and then unstack.
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
I believe if you include item in your MultiIndex, then you can just unstack:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.
The following worked for me:
mypiv = df.pivot(index=['year','month'],columns='item')[['values1','values2']]
thanks to gmoutso comment you can use this:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers

Categories

Resources