Grouped bar plot with categorical column count - python

I have this dataframe that has many columns, 2 of them are y and poutcome. Both of them are categorical data. I make a grouped bar plot based on y with sub bar plot poutcome. I tried to create a group by that results this
y poutcome
no failure 427
other 159
success 46
unknown 3368
yes failure 63
other 38
success 83
unknown 337
Based on that grouped by dataframe, I thought it will result to a graph that looks like this, the legend and the colored bars will be failure,success,other and unknown and they will be grouped by yes and no (in example graph, 4 and 5 would be yes and no). You got the gist.
The group by is this bank.groupby(['y','poutcome'])['poutcome'].count() But instead shows like above, mine show like this
How do I make it like the first graph? The bars represents the poutcome and they are grouped by y

You should be able to just unstack(level=1) before plotting:
grouped.unstack(level=1).plot.bar()

This is the dataframe that I used initally:
y poutcome count
0 no failure 427
1 no other 159
2 no success 46
3 no unknown 3368
4 yes failure 63
5 yes other 38
6 yes success 83
7 yes unknown 337
You should be able to get this from your groupby by setting as_index=False
You can then use DataFrame.pivot to arrange the values for plotting:
df.pivot(index="poutcome", columns="y", values="count")
# y no yes
# poutcome
# failure 427 63
# other 159 38
# success 46 83
# unknown 3368 337
# and plot that:
df.pivot(index="poutcome", columns="y", values="count").plot.bar()
Alternatively to have poutcome in the legend swap the index and columns parameters:
df.pivot(index="y", columns="poutcome", values="count").plot.bar()

Related

Best way to plot horizontal data and see it clearly?

I have my dataframe object df which looks like this:
product 7.month 8.month 9.month 10.month 11.month 12.month 1.month 2.month 3.month 4.month 5.month 6.month
0 phone 68 137 202 230 143 220 110 173 187 149 204 90
1 television <same kind of numerical data>
2
3
4
...
I would like to plot this data, but I'm not sure how to plot this, because months are horizontal (columns) and also have around 20 products (rows) in my dataframe, so people could read from it
Transpose the dataframe
df1 = df.T
and now plot df1
I agree and recommend Aavesh's approach. However, if it is absolutely necessary to access the data horizontally, then you can use list(df.iloc[index]) where index is the index of the row.
Then plot.

Transforming multiple observational feature to single observational feature in Pandas Dataframe Python

I have a dataframe that contains mother ids and multiple observations for the column (preDiabetes) as such:
ChildID MotherID preDiabetes
0 20 455 No
1 20 455 Not documented
2 13 102 NaN
3 13 102 Yes
4 702 946 No
5 82 571 No
6 82 571 Yes
7 82 571 Not documented
I want to transform the multiple observational feature (preDiabetes) into one with single observations for each MotherID.
To do this, I will create a new dataframe with feature newPreDiabetes and:
assign newPreDiabetes a value of "Yes" if preDiabetes=="Yes" for a particular MotherID regardless of the remaining observations
. Otherwise if preDiabetes != "Yes" for a particular MotherID, I will assign newPreDiabetes a value of "No"
Therefore, my new dataframe will have single observation for the feature preDiabetes and unique MotherIDs as such:
ChildID MotherID newPreDiabetes
0 20 455 No
1 13 102 Yes
2 702 946 No
3 82 571 Yes
I am new to Python and Pandas, so I am not sure what the best way to achieve this is, but this is what I have tried so far:
# get list of all unique mother ids
uniqueMotherIds = pd.unique(df[['MotherID']].values.ravel())
# create new dataframe that will contain unique MotherIDs and single observations for newPreDiabetes
newDf = {'MotherID','newPreDiabetes' }
# iterate through list of all mother ids and look for preDiabetes=="Yes"
for id in uniqueMotherIds:
filteredDf= df[df['MotherID'] == id].preDiabetes=="Yes"
result = pd.concat([filteredDf, newDf])
The code is not yet complete and I would appreciate some help as I am not sure if I am on the right track!
Many thanks :)
df = pd.DataFrame({
'MotherID': [455, 455,102,102,946,571,571,571],
'preDiabetes' : ['No','Not documented', np.NaN,
'Yes', 'No','No','Yes','Not documented'],
'ChildID' : [20,20,13,13,702,82,82,82]
})
result = df.groupby(['MotherID', 'ChildID'])['preDiabetes'].apply(list).reset_index()
result['newPreDiabetes'] = result['preDiabetes'].apply(
lambda x: 'Yes' if 'Yes' in x else 'No')
result = result.drop(columns=['preDiabetes'])
Output:
MotherID ChildID newPreDiabetes
0 102 13 Yes
1 455 20 No
2 571 82 Yes
3 946 702 No

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!
Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.
This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934
I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.
I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Proportion distribution of column values by date

I am trying to get the proportion of each category in the data set by day, to be able to plot it eventually.
Sample (daily_usage):
type date count
0 A 2016-03-01 70
1 A 2016-03-02 64
2 A 2016-03-03 38
3 A 2016-03-04 82
4 A 2016-03-05 37
...
412 G 2016-03-27 149
413 G 2016-03-28 382
414 G 2016-03-29 232
415 G 2016-03-30 312
416 G 2016-03-31 412
I plotted the mean and median by type just fine with the following code:
daily_usage.groupby('type')['count'].agg(['median','mean']).plot(kind='bar')
But I wanted a similar plot with the proportion of the daily counts instead. However, for plotting it eventually, I don't need to show the date. It would be just to show the average/median daily proportion for each type.
The proportion interpretation I mean is, for example, for the first line: type A happened 70 times in March 1; considering all other events in March 1, there is a sum of 948 events. The proportion of type A in March 1 is 70/948. This would be computed for all rows. The final plot will have to show each type on the x-axis, and the average daily proportion on the y-axis
I tried getting the proportion in two ways.
First one:
daily_usage['ratio'] = (daily_usage / daily_usage.groupby('date').transform(sum))['count']
The denominator in this first try gives me this sample output, so it looks like it should be very easy to divide the original count column by this new daily count column:
count
0 ... 948
1 ... 910
2 ... 588
3 ... 786
4 ... 530
5 ... 1043
Error:
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Second one:
daily_usage.div(day_total,axis='count')
where day_total = daily_usage.groupby('date').agg({'count':'sum'}).reset_index()
Error:
TypeError: ufunc true_divide cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
What's a better way to do this?
if you just want to have your new column in your dataframe you can do the following:
df['ratio'] = (df.groupby(['type','date'])['count'].transform(sum) / df.groupby('date')['count'].transform(sum))
However, it has nearly been 20 mins now that I'm trying to figure out what you're trying to plot exactly and since I still didn't really get your intention I ask from you to leave a detailed comment in case you need help plotting and precise what you want to plot and how ( one plot for the daily usage of each day or some other form ).
PS:
in my code df refers to your daily_usage dataframe.
Hope this was helpful.

Accessing columns with MultiIndex after using pandas groupby and aggregate

I am using the df.groupby() method:
g1 = df[['md', 'agd', 'hgd']].groupby(['md']).agg(['mean', 'count', 'std'])
It produces exactly what I want!
agd hgd
mean count std mean count std
md
-4 1.398350 2 0.456494 -0.418442 2 0.774611
-3 -0.281814 10 1.314223 -0.317675 10 1.161368
-2 -0.341940 38 0.882749 0.136395 38 1.240308
-1 -0.137268 125 1.162081 -0.103710 125 1.208362
0 -0.018731 603 1.108109 -0.059108 603 1.252989
1 -0.034113 178 1.128363 -0.042781 178 1.197477
2 0.118068 43 1.107974 0.383795 43 1.225388
3 0.452802 18 0.805491 -0.335087 18 1.120520
4 0.304824 1 NaN -1.052011 1 NaN
However, I now want to access the groupby object columns like a "normal" dataframe.
I will then be able to:
1) calculate the errors on the agd and hgd means
2) make scatter plots on md (x axis) vs agd mean (hgd mean) with appropriate error bars added.
Is this possible? Perhaps by playing with the indexing?
1) You can rename the columns and proceed as normal (will get rid of the multi-indexing)
g1.columns = ['agd_mean', 'agd_std','hgd_mean','hgd_std']
2) You can keep multi-indexing and use both levels in turn (docs)
g1['agd']['mean count']
It is possible to do what you are searching for and it is called transform. You will find an example that does exactly what you are searching for in the pandas documentation here.

Categories

Resources