Understanding apply and groupby in Pandas - python

I'm trying to understand an example from Python for Data Analysis book by Wes McKinney. I've looked through the pandas cookbook, documentation and SO but can't find an example like this one.
The example looks at the 2012 Federal Election Commission Database (https://github.com/wesm/pydata-book/blob/master/ch09.ipynb). The code below determines the top donor occupations donating to Obama and Romney.
I'm struggling to understand how the function takes a groupby object and performs another groupby operation on it. When I run this outside of the function I get an error. Could somebody shed some light on this behaviour?
Thanks,
Iwan
# top donor occupations donating to Obama or Romney
def get_top_amounts(group, key, n = 5):
totals = group.groupby(key)['contb_receipt_amt'].sum()
return totals.sort_values(ascending = False)[:n]
# first group by candidate
grouped = fec_mrbo.groupby('cand_nm')
# for each group, group again by contb_receipt_amt so we have a hierarchical index
# take the contribution amount
# then return the total amount for each occupation by cand sorted to give top n
grouped.apply(get_top_amounts, 'contbr_occupation', n= 5)
The result looks like this
cand_nm contbr_occupation
Obama, Barack RETIRED 25270507.23
ATTORNEY 11126932.97
INFORMATION REQUESTED 4849801.96
HOMEMAKER 4243394.30
PHYSICIAN 3732387.44
LAWYER 3159391.87
CONSULTANT 2459812.71
Romney, Mitt RETIRED 11266949.23
INFORMATION REQUESTED PER BEST EFFORTS 11173374.84
HOMEMAKER 8037250.86
ATTORNEY 5302578.82
PRESIDENT 2403439.77
EXECUTIVE 2230653.79
C.E.O. 1893931.11

When you use apply on a grouped dataframe you are actually iterating over the groups and pass each group the function, you're applying.
Let's look at a simple example:
import pandas as pd
df = pd.DataFrame({'col1': [1,1,1,1,2,2,2,2],
'col2': ['a','b','a','b','a','b','a','b'],
'value': [1,2,3,4,5,6,7,8]})
grouped = df.groupby('col1')
Now let's create a simple function which allows us to see what's getting passed to the function:
def print_group(group):
print(group)
print('=' * 10)
grouped.apply(print_group)
col1 col2 value
0 1 a 1
1 1 b 2
2 1 a 3
3 1 b 4
==========
col1 col2 value
0 1 a 1
1 1 b 2
2 1 a 3
3 1 b 4
==========
col1 col2 value
4 2 a 5
5 2 b 6
6 2 a 7
7 2 b 8
==========
As you can see each group is getting passed to the function as a separate dataframe. And of course you can apply all the normal functions to this subset.
The fact that you see the first group twice is due to internal reasons and cannot be changed, it's not a bug ;).
Let's create another function to proof this:
def second_group_sum(group):
res = group.groupby('col2').value.sum()
print(res)
print('=' * 10)
return res
grouped.apply(second_group_sum)
col2
a 4
b 6
Name: value, dtype: int64
==========
col2
a 4
b 6
Name: value, dtype: int64
==========
col2
a 12
b 14
Name: value, dtype: int64
==========
You could even go further and do group-apply-group-apply-group-apply etc etc...
I hope this helps a bit understand what's going on.
By the way if you use ipdb (debugging tool) you can set a breakpoint in the applied function a interact with the group dataframe.

Related

Count number of matching values from pandas groupby

I have created a pandas dataframe for a store
I have columns Transaction and Item_Type
import pandas as pd
data = {'Transaction':[1, 2, 2, 2, 3], 'Item_Type':['Food', 'Drink', 'Food', 'Drink', 'Food']}
df = pd.DataFrame(data, columns=['Transaction', 'Item_Type'])
Transaction Item_Type
1 Food
2 Drink
2 Food
2 Drink
3 Food
I am trying to group by transaction and count the number of drinks per transaction, but cannot find the right syntax to do it.
df = df.groupby(['Transaction','Item_Type']).size()
This sort of works, but gives me a multi-index Series, which I cannot yet figure out how to select drinks per transaction from it.
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
This seems clunky - is there a better way?
This stackoverflow seemed most similar Adding a 'count' column to the result of a groupby in pandas?
Another way possible with pivot_table:
s = df.pivot_table(index='Transaction',
columns='Item_Type',aggfunc=len).stack().astype(int)
Or:
s = df.pivot_table(index=['Transaction','Item_Type'],aggfunc=len) ##thanks #Ch3steR
s.index = s.index.map("{0[0]}/{0[1]}".format)
print(s)
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
Or if you wish to filter a particular category:
to_filter = 'Drink'
(df.pivot_table(index='Transaction',columns='Item_Type',aggfunc=len,fill_value=0)
.filter(items=[to_filter]))
Item_Type Drink
Transaction
1 0
2 2
3 0
​
Edit: replacing original xs approach with unstack after seeing anky's answer.
>>> df.groupby('Transaction')['Item_Type'].value_counts().unstack(fill_value=0)['Drink']
Transaction
1 0
2 2
3 0
Name: Drink, dtype: int64
With a particular condition, you can sum the Boolean Series, within group, after you check the condition.
df['Item_Type'].eq('Drink').groupby(df['Transaction']).sum()
#Transaction
#1 0.0
#2 2.0
#3 0.0
#Name: Item_Type, dtype: float64
I found a solution I think
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
df = df.groupby(['Transaction','Item_Type']).size().reset_index(name='counts')
Gives me the information I need
Transaction Item_Type counts
1 Food 1
2 Drink 2
2 Food 1
3 Food 1
You may use agg and value_counts
s = df.astype(str).agg('/'.join, axis=1).value_counts(sort=False)
Out[61]:
3/Food 1
2/Drink 2
1/Food 1
2/Food 1
dtype: int64
If you want to keep the original order, chain additional sort_index
s = df.astype(str).agg('/'.join, axis=1).value_counts().sort_index(kind='mergesort')
Out[62]:
1/Food 1
2/Drink 2
2/Food 1
3/Food 1
dtype: int64

How to concatenate partially sequential occurring rows in data frame using pandas

I have a csv as follows. which is broken into multiple rows.
like as follows
Names,text,conv_id
tim,hi,1234
jon,hello,1234
jon,how,1234
jon,are you,1234
tim,hey,1234
tim,i am good,1234
pam, me too,1234
jon,great,1234
jon,hows life,1234
So i want to concatenate the sequentially occuring elements into one row
as follows and make it more meaningful
Expected output:
Names,text,conv_id
tim,hi,1234
jon,hello how are you,1234
tim,hey i am good,1234
pam, me too,1234
jon,great hows life,1234
I tried a couple of things but I failed and couldn't do can anyone please guide me how to do this?
Thanks in advance.
You can use Series.shift + Series.cumsum
to be able to create the appropriate groups through groupby and then use join applied to each group using groupby.apply.'conv_id', an 'Names' are added to the groups so that they can be retrieved using Series.reset_index. Finally, DataFrame.reindex is used to place the columns in the initial order
groups=df['Names'].rename('groups').ne(df['Names'].shift()).cumsum()
new_df=( df.groupby([groups,'conv_id','Names'])['text']
.apply(lambda x: ','.join(x))
.reset_index(level=['Names','conv_id'])
.reindex(columns=df.columns) )
print(new_df)
Names text conv_id
1 tim hi 1234
2 jon hello,how,are you 1234
3 tim hey,i am good 1234
4 pam me too 1234
5 jon great,hows life 1234
Detail:
print(groups)
0 1
1 2
2 2
3 2
4 3
5 3
6 4
7 5
8 5
dtype: int64

Calculating mean value of item in several columns in pandas

I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64

Pandas substring search for filter

I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.
Main
col1 col2
0 1 take me home
1 2 country roads
2 2 country roads take
3 4 me home
Rules
col3 col4
0 1 take
1 2 home
2 3 country
3 4 om
4 2 take
A row in main is marked as pass if the following condition matches for any row in rules
The condition for passing is:
col1==col3 and col4 is substring of col2
Main
col1 col2 result
0 1 take me home Pass
1 2 country roads Fail
2 2 country roads take Pass
3 4 me home Pass
My initial approach was to parse Rules df and create a function out of it dynamically and then run
def action_function(row) -> object:
if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
return success_action(row) #mark as pass
return fail_action(row) #mark as fail
Main["result"] = self.df.apply(action_function, axis=1)
This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.
I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.
This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts.
Can you please suggest an approach for this?
Merge and then apply the condtion using np.where i.e
temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)
no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)
col1 col2 results
0 1 take me home Pass
2 2 country roads Fail
4 2 country roads take Pass
5 4 me home Pass

generating a column based on the values in another column in pandas (python)

The following is a subset of a data frame:
drug_id WD
lexapro.1 flu-like symptoms
lexapro.1 dizziness
lexapro.1 headache
lexapro.14 Dizziness
lexapro.14 headaches
lexapro.23 extremely difficult
lexapro.32 cry at anything
lexapro.32 Anxiety
I need to generate a column id based on the values in drug_id as follows:
id drug_id WD
1 lexapro.1 flu-like symptoms
1 lexapro.1 dizziness
1 lexapro.1 headache
2 lexapro.14 Dizziness
2 lexapro.14 headaches
3 lexapro.23 extremely difficult
4 lexapro.32 cry at anything
4 lexapro.32 Anxiety
I think I need to group them based on drug_id and then generate id based on the size of each group. But I do not know how to do it?
The shift+cumsum pattern mentioned by Boud is good, just make sure to sort by drug_id first. So something like,
df = df.sort_values('drug_id')
df['id'] = (df['drug_id'] != df['drug_id'].shift()).cumsum()
A different approach that does not involve sorting your dataframe would be to map a number to each unique drug_id.
uid = df['drug_id'].unique()
id_map = dict((x, y) for x, y in zip(uid, range(1, len(uid)+1)))
df['id'] = df['drug_id'].map(id_map)
Use the shift+cumsum pattern:
(df.drug_id!=df.drug_id.shift()).cumsum()
Out[5]:
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
Name: drug_id, dtype: int32

Categories

Resources