Sum duplicated rows on a multi-index pandas dataframe - python

Hello I'm having troubles dealing with Pandas. I'm trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I'd also like to add every unique product for each given year and give them a 0 value if they weren't listed.
Example: dataframe with multi-index and 3 different products (A,B,C):
volume1 volume2
year product
2010 A 10 12
A 7 3
B 7 7
2011 A 10 10
B 7 6
C 5 5
Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5
Any idea ? Thanks

You can make the second level of the index a CategoricalIndex and when you use groupby it will include all of the categories.
df.index.set_levels(pd.CategoricalIndex(df.index.levels[1]), 1, inplace=True)
df.groupby(level=[0, 1]).sum().fillna(0, downcast='infer')
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5

Use sum with unstack and stack:
df = df.sum(level=[0,1]).unstack(fill_value=0).stack()
#same as
#df = df.groupby(level=[0,1]).sum().unstack(fill_value=0).stack()
Alternative with reindex:
df = df.sum(level=[0,1])
#same as
#df = df.groupby(level=[0,1]).sum()
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0)
Alternative1, thanks #Wen:
df = df.sum(level=[0,1]).unstack().stack(dropna=False)
print (df)
volume1 volume2
year product
2010 A 17 15
B 7 7
C 0 0
2011 A 10 10
B 7 6
C 5 5

Related

Python - Creating a custom dataframe from transposing an existing one

I have a dataframe that consists of the following enteries:
The dataset consists information about each country for about 10 years for 5 indicators as represented above.
I am trying to convert Indicator column into rows for all the 5 indicators and similarly convert each year column into one merged columns.
Ideally the final output should look like:
So, the country column should have extra enetries according to the number of years and the values should transpose according to each indicator.
I tried using pandas in built functions such as melt and pivot but not getting anywhere.
Any guidance on this would be appricated. Thanks!
You can use stack and unstack:
out = (df.set_index(['Country Name', 'Indicator'])
.rename_axis(columns='Year').stack('Year')
.unstack('Indicator').rename_axis(columns=None)
.reset_index())
print(out)
# Output
Country Name Year I J K L
0 A 2008 0.002535 0.966967 0.033397 0.487713
1 A 2009 0.797714 0.642878 0.752803 0.527796
2 A 2010 0.773789 0.288100 0.013059 0.918957
3 B 2008 0.979707 0.968245 0.820731 0.309862
4 B 2009 0.086082 0.608920 0.629591 0.156926
5 B 2010 0.983092 0.536192 0.380157 0.091473
6 C 2008 0.834870 0.145200 0.225985 0.686520
7 C 2009 0.771646 0.834432 0.519951 0.651756
8 C 2010 0.003791 0.292212 0.257748 0.473694
Alternative with melt and pivot:
out = (df.melt(['Country Name', 'Indicator'], var_name='Year')
.pivot(['Country Name', 'Year'], 'Indicator', 'value')
.rename_axis(columns=None).reset_index())
Input data:
>>> df
Country Name Indicator 2008 2009 2010
0 A I 0.002535 0.797714 0.773789
1 A J 0.966967 0.642878 0.288100
2 A K 0.033397 0.752803 0.013059
3 A L 0.487713 0.527796 0.918957
4 B I 0.979707 0.086082 0.983092
5 B J 0.968245 0.608920 0.536192
6 B K 0.820731 0.629591 0.380157
7 B L 0.309862 0.156926 0.091473
8 C I 0.834870 0.771646 0.003791
9 C J 0.145200 0.834432 0.292212
10 C K 0.225985 0.519951 0.257748
11 C L 0.686520 0.651756 0.473694

sum duplicate row with condition using pandas

I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

Index duplicate rows in Python DataFrame

I am trying to add a column to index duplicate rows and order by another column.
Here's the example dataset:
df = pd.DataFrame({'Name' = ['A','A','A','B','B','B','B'], 'Score'=[9,10,10,8,7,8,8], 'Year'=[2019,2018,2017,2019,2018,2017,2016']})
I want to use ['Name', 'Score'] for identifying duplicates. Then index the duplicate order by Year to get following result:
Here rows 2 and 3 are duplicate rows because they have same name and score, so I order them by year and give index.
Is anyone have good idea to realize this in Python? Thank you so much!
You are looking for cumcount:
df['Index'] = (df.sort_values('Year', ascending=False)
.groupby(['Name','Score'])
.cumcount() + 1
)
Output:
Name Score Year Index
0 A 9 2019 1
1 A 10 2018 1
2 A 10 2017 2
3 B 8 2019 1
4 B 7 2018 1
5 B 8 2017 2
6 B 8 2016 3

Python: Calculate mathematical values in new row in dataframe based on few specific previous rows

I have the below pandas dataframe:
Input:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
My Expected Output is:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
Total Exp 10 12
The last tow is basically total of row 1 and row 3. This is a very simplified example, i actually have to perform complex calculation on a huge dataframe.
Is there a way in python to perform such calculation?
You can select rows by positions with DataFrame.iloc and sum, then assign to new row:
df.loc[len(df.index)] = df.iloc[0] + df.iloc[2]
Or:
df.loc[len(df.index)] = df.iloc[[0,2]].sum()
print (df)
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 8 10 12
EDIT: First idea is create index by A column, so you can use loc with new value of A, but last step is convert index to column by reset_index:
df = df.set_index('A')
df.loc['Total Exp'] = df.iloc[[0,2]].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Similar is possible selecting by loc by labels - here Expense and Travel:
df = df.set_index('A')
df.loc['Total Exp'] = df.loc[['Expense', 'Travel']].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or is possible filter out first column with 1: and add value back by Series.reindex:
df.loc[len(df.index)] = df.iloc[[0,2], 1:].sum().reindex(df.columns, fill_value='Total Exp')
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or you can set value of A separately:
s = df.iloc[[0,2]].sum()
s.loc['A'] = 'Total Exp'
df.loc[len(df.index)] = s
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12

Want to join last row of two dataframe on condition

quantity:
a b c
3 1 nan
3 2 8
7 5 9
4 8 nan
price
34
I have two dataframes quantity and price and I want to join last row of quantity dataframe to price where c is not nan
I wrote these query but didn't got the desired output:
price = pd.concat(price,quantity["a","b","c"].tail(1).isnotnull())
what I want is like:
price a b c
34 7 5 9
If your dfs are these:
df = pd.DataFrame([[3,1,np.nan], [3,2,8], [7,5,9], [4,8,np.nan]], columns=['a','b','c'])
df2 = pd.DataFrame([34], columns=['price'])
You can do in this way:
final_df = pd.concat([df.dropna(subset=['c']).tail(1).reset_index(drop=True), df2], axis=1)
Output:
a b c price
0 7 5 9.0 34
I believe you need remove missing values and for last row - added double [] for one row DataFrame:
df=pd.concat([price.reset_index(drop=True),
quantity[["a","b","c"]].dropna(subset=['c']).iloc[[-1]].reset_index(drop=True)],
axis=1)
print (df)
price a b c
0 34 7 5 9.0
Detail:
print (quantity[["a","b","c"]].dropna().iloc[[-1]])
a b c
2 7 5 9.0
I would filter the df on not null then simply add the price to it:
new_df = df[df['c'].notnull()]
Where c is your column name.
new_df['price'] = 32 # or the price from your df

Categories

Resources