Select a specific group of a grouped dataframe with pandas - python

I have the following dataframe:
df.index = df['Date']
df.groupby([df.index.month, df['Category'])['Amount'].sum()
Date Category Amount
1 A -125.35
B -40.00
...
12 A 505.15
B -209.00
I would like to report the sum of the Amount for every Category B like:
Date Category Amount
1 B -40.00
...
12 B -209.00
I tried the df.get_group method but this method needs tuple that contains the Date and Category key. Is there a way to filter out only the Categories with B?

You can use IndexSlice:
# groupby here
df_group = df.groupby([df.index.month, df['Category'])['Amount'].sum()
# report only Category B
df_group.loc[pd.IndexSlice[:,'B'],:]
Or query:
# query works with index level name too
df_group.query('Category=="B"')
Output:
Amount
Date Category
1 B -40.0
12 B -209.0

apply a filter to your groupby dataframe where Category equals B
filter=df['Category']=='B'
df[filter].groupby([df.index.month, df['Category'])['Amount'].sum()

Related

Groupby to count the number of calls on different days by id

Given a dataframe like the one below:
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
I need to create another dataframe containing only the id and the number of calls made on different days. An example of output is as follows:
Id | Count
1 | 1
2 | 2
3 | 1
What I'm trying so far:
df2 = df.groupby(['id','date']).size().reset_index().rename(columns={0:'COUNT'})
df2
However, the way out is far from desired. Can anyone help?
You can make use of .nunique() [pandas-doc] to count the unique days per id:
table.groupby('id').date.nunique()
This gives us a series:
>>> df.groupby('id').date.nunique()
id
1 1
2 2
3 1
Name: date, dtype: int64
You can make use of .to_frame() [pandas-doc] to convert it to a dataframe:
>>> df.groupby('id').date.nunique().to_frame('count')
count
id
1 1
2 2
3 1
You can use pd.Dataframe function to convert the result into a dataframe and further rename the columns as per you like.
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
x = pd.DataFrame(df.groupby('id').date.nunique().reset_index())
x.columns = ['Id', 'Count']
print(x)

Conditional Calculation of Pandas Dataframe columns

I have a pandas dataframe which reads
Category Sales
A 10
B 20
I wanna do a conditional creation of new column target
And I want my target df to look like
Category Sales Target
A 10 5
B 20 10
I used the below code and it threw an error
if(df['Category']=='A'):
df['Target']=df['Sales']-5
else:
df['Target']=df['Sales']-10
Use vectorized numpy.where:
df['Target']= np.where(df['Category']=='A', df['Sales'] - 5, df['Sales'] - 10)
print (df)
Category Sales Target
0 A 10 5
1 B 20 10

Drop rows from each group if dates are within a given range

Given a DataFrame like below:
dfx = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"],
"date":["01/01/2014","01/31/2014","01/23/2014","01/01/2014","01/20/2014"]})
I want to remove "duplicates". "duplicates" are defined as those rows where the ID of the rows are the same, but the "date" between them is Less Than 30 days.
The resulting DataFrame upon removal of the "duplicates" is expected appear as:
ID date
A 01/01/2014
A 01/31/2014
C 01/23/2014
B 01/01/2014
Convert date to datetime.
Group date by ID and find difference between consecutive rows
Extract the days component from the timedelta difference and compare it to 30
Filter dfx based on the mask
dfx[~pd.to_datetime(dfx.date).groupby(dfx.ID).diff().dt.days.lt(30)]
ID date
0 A 01/01/2014
1 A 01/31/2014
2 C 01/23/2014
3 B 01/01/2014

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

pandas GroupBy aggregate only one column

I have a DataFrame of the following form:
>>> sales = pd.DataFrame({'seller_id':list('AAAABBBB'),'buyer_id':list('CCDECDEF'),\
'amount':np.random.randint(10,20,size=(8,))})
>>> sales = sales[['seller_id','buyer_id','amount']]
>>> sales
seller_id buyer_id amount
0 A C 18
1 A C 15
2 A D 11
3 A E 12
4 B C 16
5 B D 18
6 B E 16
7 B F 19
Now what I would like to do is for each seller calculate the share of total sale amount taken up by its largest buyer. I have code that does this, but I have to keep resetting the index and grouping again, which is wasteful. There has to be a better way. I would like a solution where I can aggregate one column at a time and keep the others grouped.
Here's my current code:
>>> gr2 = sales.groupby(['buyer_id','seller_id'])
>>> seller_buyer_level = gr2['amount'].sum() # sum over different purchases
>>> seller_buyer_level_reset = seller_buyer_level.reset_index('buyer_id')
>>> gr3 = seller_buyer_level_reset.groupby(seller_buyer_level_reset.index)
>>> result = gr3['amount'].max() / gr3['amount'].sum()
>>> result
seller_id
A 0.589286
B 0.275362
I simplified a bit. In reality I also have a time period column, and so I want to do this at the seller and time period level, that's why in gr3 I'm grouping by the multi-index (in this example, it appears as a single index).
I thought there would be a solution where instead of reducing and regrouping I would be able to aggregate only one index out of the group, leaving the others grouped, but couldn't find it in the documentation or online. Any ideas?
Here's a one-liner, but it resets the index once, too:
sales.groupby(['seller_id','buyer_id']).sum().\
reset_index(level=1).groupby(level=0).\
apply(lambda x: x.amount.max()/x.amount.sum())
#seller_id
#A 0.509091
#B 0.316667
#dtype: float64
I would do this using pivot_table and then broadcasting (see What does the term "broadcasting" mean in Pandas documentation?).
First, pivot the data with seller_id in the index and buyer_id in the columns:
sales_pivot = sales.pivot_table(index='seller_id', columns='buyer_id', values='amount', aggfunc='sum')
Then, divide the values in each row by the sum of said row:
result = sales_pivot.div(sales_pivot.sum(axis=1), axis=0)
Lastly, you can call result.max(axis=1) to see the top share for each seller.

Categories

Resources