Pandas aggfunc sum based on multiple columns

Pandas aggfunc sum based on multiple columns - python

I'm trying to sum data from multiple columns in my dataframe by pivoting the table and using aggfunc. My dataframe gives emission data for various regions. I don't want to sum some rows so I make a selection of the rows that I want to sum. The output however is two rows for each column:
one is named True and gives the sum of the rows that I defined (this is the column that I want)
the other is named False and gives the sum of the remainder of the rows that I did not define (this one I would like to drop/omit)
The data is numeric regional data for multiple years so what I want to do is add data from some regions in order to get data for larger regions. The years are listed in columns.
The data looks something like this:
inp = [{'Scenario':'Baseline', 'Region':'CHINA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':5,'1995':10,'2000':15},
{'Scenario':'Baseline', 'Region':'INDIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':6,'1995':11,'2000':16},
{'Scenario':'Baseline', 'Region':'INDONESIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':7,'1995':12,'2000':17},
{'Scenario':'Baseline', 'Region':'KOREA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':8,'1995':13,'2000':18},
{'Scenario':'Baseline', 'Region':'JAPAN', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':9,'1995':14,'2000':19},
{'Scenario':'Baseline', 'Region':'THAILAND', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':10,'1995':15,'2000':20},
{'Scenario':'Baseline', 'Region':'RUSSIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':11,'1995':16,'2000':21}]
dt = pd.DataFrame(inp)
dt
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
4 9 14 19 JAPAN Baseline MtCO2eq Methane
5 10 15 20 THAILAND Baseline MtCO2eq Methane
6 11 16 21 RUSSIA Baseline MtCO2eq Methane
I run this piece of code:
dt_test = dt.pivot_table(dt,index=['Scenario','Variable','Unit'],
columns=[(df['Region'] == 'CHINA')|
(df['Region'] == 'INDIA')|
(df['Region'] == 'INDONESIA')
|(df['Region'] == 'KOREA')],
aggfunc=np.sum)
And get this as output:
1990 1995 2000
Region False True False True False True
Scenario Variable Unit
Baseline Methane MtCO2eq 46 10 76 15 106 20
If someone could help me out with either a way to drop this False column for all the years or another nifty way to get the totals that I want that would be amazing.

Use xs:
print (dt_test.xs(True, axis=1, level=1))
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66
But better is filter first by isin and boolean indexing:
df = df[df['Region'].isin(['CHINA','INDIA','INDONESIA','KOREA'])]
print (df)
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
And then aggregate sum per groups:
dt_test = df.groupby(['Scenario','Variable','Unit']).sum()
print (dt_test)
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66

Related

I want to create a new column territory based on the city column

Data Frame :
city Temperature
0 Chandigarh 15
1 Delhi 22
2 Kanpur 20
3 Chennai 26
4 Manali -2
0 Bengalaru 24
1 Coimbatore 35
2 Srirangam 36
3 Pondicherry 39
I need to create another column in data frame, which contains a boolean value for each city to indicate whether it's a union territory or not. Chandigarh, Pondicherry and Delhi are only 3 union territories here.
I have written below code
import numpy as np
conditions = [df3['city'] == 'Chandigarh',df3['city'] == 'Pondicherry',df3['city'] == 'Delhi']
values =[1,1,1]
df3['territory'] = np.select(conditions, values)
Is there any easier or efficient way that I can write?

You can use isin:
union_terrs = ["Chandigarh", "Pondicherry", "Delhi"]
df3["territory"] = df3["city"].isin(union_terrs).astype(int)
which checks each entry in city column and if it is in union_terrs, gives True and otherwise False. The astype makes True/False to 1/0 conversion,
to get
city Temperature territory
0 Chandigarh 15 1
1 Delhi 22 1
2 Kanpur 20 0
3 Chennai 26 0
4 Manali -2 0
0 Bengalaru 24 0
1 Coimbatore 35 0
2 Srirangam 36 0
3 Pondicherry 39 1

How to get top 3 sales in data frame after using group by and sorting in python?

recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!

You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18

I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18

Set multiple columns to zero based on a value in another column [duplicate]

This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!

Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS

Merge pandas groupBy objects

I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :
for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
#something ...
In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.
A smaller dummy example is as follows :
Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)
year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17
I split it into two 14 rows data and grouped them according to year, origin, dest.
for chunk in pd.read_csv('dummy.csv', chunksize=14):
xd = chunk.groupby(['origin','dest','year'])['export'].sum();
print(xd)
Results :
origin dest year
aus ind 2000 6
2001 8
chn aus 2001 40
ind aus 2000 19
2001 42
2002 30
chn 2000 9
2001 13
2002 14
Name: export, dtype: int64
origin dest year
aus chn 2002 7
2003 3
2004 17
2005 11
ind 2001 17
2002 14
chn aus 2001 15
2002 50
2003 40
Name: export, dtype: int64
How can I merge the two GroupBy objects?
Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.
The basic aim is :
Given origin country and dest country,
I need to plot total exports between them yearwise.
Querying this everytime over the whole data is taking a lot of time.
xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]
Hence I was thinking to save time by once arranging them in groupBy manner.
Any suggestion is greatly appreciated.

You can use pd.concat to join groupby results and then apply sum:
>>> pd.concat([xd0,xd1],axis=1)
export export
origin dest year
aus ind 2000 6 6
2001 8 8
chn aus 2001 40 40
ind aus 2000 19 19
2001 42 42
2002 30 30
chn 2000 9 9
2001 13 13
2002 14 14
>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin dest year
aus ind 2000 12
2001 16
chn aus 2001 80
ind aus 2000 38
2001 84
2002 60
chn 2000 18
2001 26
2002 28

Add calculated column to a pandas pivot table

I have created a pandas data frame and then converted it into pivot table.
My pivot table looks like this:
Operators TotalCB Qd(cb) Autopass(cb)
Aircel India 55 11 44
Airtel Ghana 20 17 3
Airtel India 41 9 9
Airtel Kenya 9 4 5
Airtel Nigeria 24 17 7
AT&T USA 18 10 8
I was wondering how to add calculated columns so that I get my pivot table with Autopass% (Autopass(cb)/TotalCB*100) just like we are able to create them in Excel using calculated field option.
I want my pivot table output to be something like below:
Operators TotalCB Qd(cb) Autopass(cb) Qd(cb)% Autopass(cb)%
Aircel India 55 11 44 20% 80%
Airtel Ghana 20 17 3 85% 15%
Airtel India 41 29 9 71% 22%
Airtel Kenya 9 4 5 44% 56%
AT&T USA 18 10 8 56% 44%
How do I define the function which calculates the percentage columns and how to apply that function to my two columns namely Qd(cb) and Autopass(cb) to give me additional calculated columns

This should do it, assuming data is your pivoted dataframe:
data['Autopass(cb)%'] = data['Autopass(cb)'] / data['TotalCB'] * 100
data['Qd(cb)%'] = data['Qd(cb)'] / data['TotalCB'] * 100
Adding a new column to a dataframe is as simple as df['colname'] = new_series. Here we assign it with your requested function, when we do it as a vector operation it creates a new series.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas aggfunc sum based on multiple columns - python

Related

I want to create a new column territory based on the city column

How to get top 3 sales in data frame after using group by and sorting in python?

Set multiple columns to zero based on a value in another column [duplicate]

Merge pandas groupBy objects

Add calculated column to a pandas pivot table

Categories

Resources