Aggregating data using python - python

I have some data that I want to both sum and count based upon a certain field. My data looks like this
Value ID Object
100 ABD Type1
200 ABD Type1
400 ABD Type2
200 BCE Type1
100 BCE Type1
800 JHO Type3
600 TVM Type4
And I am trying to get to this where I have counted the number of unique Objects related to an ID and also summed the total value also related to that ID
ValueSum ID CountObject
700 ABD 2
300 BCE 1
800 JHO 1
600 TVM 1
What I have been taking a look at is using the .groupby.() function along with .count() and .sum() but I can't seem to get things in the right format.
Any help is much appreciated.
Thanks!

You can pass a dict of the funcs to perform on multiple columns on your df using groupby and agg:
In [289]:
gp = df.groupby('ID', as_index=False).agg({'Value':sum, 'Object':'nunique'})
gp = gp.rename(columns={'Value':'ValueSum', 'Object':'ObjectCount'})
gp
Out[289]:
ID ValueSum ObjectCount
0 ABD 700 2
1 BCE 300 1
2 JHO 800 1
3 TVM 600 1
Here we pass a dict with the corresponding column names and the func to perform, for the counting we use nunique which returns the number of unique values

Related

Pandas grouping with filtering on other columns

I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250

Groupby by sum of revenue and the corresponding highest contributing month - Pandas

I have a bill details data set and I want to do a groupby of the products based on the sum of their Total value, additionally i want a column which indicates the month which has produced the most revenue for the corresponding product
Data set:
Bill_Id Month Product_Id Net_Value
1 1 20 100
2 1 20 100
3 2 20 100
4 1 30 200
5 2 30 200
6 2 30 200
Desired_Result
Product_Id Total_revenue Top_Month
20 300 1
30 600 2
This just a sample dataset I have the transaction data of the entire year
Pivot the dataframe with aggfunc=sum, then use sum and idxmax along columns axis to find total revenue and month which has the highest contribution to total revenue, finally concat the individual components along column axis to get the result
s = df.pivot_table('Net_Value', 'Product_Id', 'Month', aggfunc='sum')
pd.concat([s.sum(1), s.idxmax(1)], axis=1, keys=['Total_revenue', 'Top_Month'])
Total_revenue Top_Month
Product_Id
20 300 1
30 600 2
Assuming that your only 1 Top month value is needed based on the maximum sum of Net_Revenue, Below is the code that might work for you.
We can achieve this in 3 stages as mentioned below:
1. Extracting the sum of net revenue based on product id
df_1 = df.groupby(['Product_Id']).agg({'Net_Value' : sum}).reset_index()
df_1 = df_1.rename(columns={'Net_Value' : 'Total_revenue'})
print(df_1)
Product_Id Total_revenue
0 20 300
1 30 600
2. Extracting the best contibuting month based on max sum net revenue for each product id
df_2 = df.groupby(['Product_Id', 'Month']).agg({'Net_Value' : sum}).sort_values('Net_Value', ascending=False).reset_index()
df_2 = df_2.drop_duplicates(subset=['Product_Id'])[['Product_Id', 'Month']]
print(df_2)
Product_Id Month
0 30 2
1 20 1
3. Final step is to merge this both dataframes into single based on product id
final_df = df_1.merge(df_2)
print(final_df)
Product_Id Total_revenue Month
0 20 300 1
1 30 600 2
Please do upvote the solution if it helps :)
Small modification over #Shubham's approach
result = (
df.pivot_table("Net_Value", "Product_Id", "Month", aggfunc="sum")
.agg(["sum", "idxmax"], axis=1)
.set_axis(["Total_revenue", "Top_Month"], axis=1)
)
As multiple columns are being interacted, I have used the apply function in addition to groupby:
Net_value is calucated using basic aggregate function sum
Top_month required interaction between columns so, first get the index of max Net_value using idxmax then using loc to find the month
The resultant Pandas Series object has the groupby column (Product_id) as index, so it make it a column I have used reset_index
def f(x):
d = {}
d['Net_Value'] = x['Net_Value'].sum()
d['Top_month'] = df.loc[x['Net_Value'].idxmax(), "Month"]
return pd.Series(d, index=['Net_Value', 'Top_month'])
df.groupby('Product_Id').apply(f).reset_index()
# Output
Product_Id Net_Value Top_month
0 20 300 1
1 30 600 2
Check out this amazing answer which helped me and can help you in the future as well.

How to find the percentage share of a field in Dataframe?

I have a Pandas DataFrame like below
df = pd.DataFrame({"ID" : [1,2,3,4],
"Amt" : [100, 200, 300, 400]})
My obejective is to find the percentage share of each amount and create a new field in this dataframe with it. My final DataFrame should look like
ID Amt Avg
1 100 10
2 200 20
3 300 30
4 400 40
To achieve this, I concatenated the Amt field to same DF. Then renamed it to Avg. and then calculated the perc using iterator
df = pd.concat([df, df['Amt']], axis=1)
df.columns=(['ID', 'Amt', 'Avg'])
for i in range(0,len(df)):
df['Avg'] = ((df['Avg']/df['Amt'].sum())*100)
I'm new to this. I've seen many achieve difficult objective with much simpler code.
So can someone please help me to know the better approach than this one?
I believe you need:
df['Avg'] = df['Amt']/df['Amt'].sum()*100
print (df)
ID Amt Avg
0 1 100 10.0
1 2 200 20.0
2 3 300 30.0
3 4 400 40.0

Reading data from Dataframe using other Dataframe data as iloc inputs

I'm trying to grab value from an existing df using iloc coordinates stored in another df, then stored that value in the second df.
df_source (source):
Category1 Category2 Category3
Bucket1 100 200 300
Bucket2 400 500 600
Bucket3 700 800 900
df_coord (coordinates):
Index_X Index_Y
0 0
1 1
2 2
Want:
df_coord
Index_X Index_Y Added
0 0 100
1 1 500
2 2 900
I'm more familiar with analytical language like SAS, where data is processed one line at a time, so the natural approach for me was this:
df_coord['Added'] = df_source.iloc[df_coord[Index_X][df_coord[Index_Y]]
When I tried this I got an error, which I understand as df_coord[Index_X] does not refer to the data on the same row. I have seen a few posts where using a "axis=1" option worked for their respective cases, but I can't figure out how to apply it to this case. Thank you.
You could index the underlying ndarray, i.e calling the values method, using the columns in df_coord as first and second axis:
df_coord['Added'] = df_source.values[df_coord.Index_X, df_coord.Index_Y]
Index_X Index_Y Added
0 0 0 100
1 1 1 500
2 2 2 900

How to roll up events into metadata from original dataframe

I have data that looks like
Name,Report_ID,Amount,Flag,Actions
Fizz,123,5,,A
Fizz,123,10,Y,A
Buzz,456,10,,B
Buzz,456,40,,C
Buzz,456,70,,D
Bazz,678,100,Y,F
From these individual operations, i'd like to create a new dataframe that captures various statistics / meta name. Mostly summations and counts of items / counts of unique entries. I'd like the output of the dataframe to look like the following:
Report_ID,Number of Flags,Number of Entries, Total,Unique Actions
123,1,2,15,1
456,0,3,120,3
678,1,1,100,1
I've tried using groupby, but I cannot merge all of the individual groupby objects back together correctly. So far I've tried
totals = raw_data.groupby('Report_ID')['Amount'].sum()
event_count = raw_data.groupby('Report_ID').size()
num_actions = raw_data.groupby('Report_ID').Actions.nunique()
output = pd.concat([totals,event_count,num_actions])
When I try this i get TypeError: cannot concatenate a non-NDFrame object. Any help would be appreciated!
You can use agg on the groupby
f = dict(Flag=['count', 'size'], Amount='sum', Actions='nunique')
df.groupby('Report_ID').agg(f)
Flag Amount Actions
count size sum nunique
Report_ID
123 1 2 15 1
456 0 3 120 3
678 1 1 100 1
You just need to specify axis=1 when concatenating:
event_count.name = 'Event Count' # Name the Series, as you did not group on one.
>>> pd.concat([totals, event_count, num_actions], axis=1)
Amount Event Count Actions
Report_ID
123 15 2 1
456 120 3 3
678 100 1 1

Categories

Resources