sum duplicate row with condition using pandas

sum duplicate row with condition using pandas - python

I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1

Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1

You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

Related

Pandas: finding mean in the dataframe based on condition included in another

I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850

here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0

You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0

Number of different values / distinct in a column per ID in a sorted dataframe

I have a sorted dataframe with an ID, and a value column, which looks like:
ID value
A 10
A 10
A 10
B 15
B 15
C 10
C 10
...
How can i create a new dataframe, that it counts the "new" distinct values in terms of the number of different IDS, so that it basically goes over my dataframe and looks like:
Number of ID Number of distinct values
1 1
2 2
3 2
In that case above we have 3 different IDs, but ID A and C have the same value.
So the first row in the new dataframe:
Numer of ID = 1, because we have 1 different ID so far
Number of distinct values= 1 , because we have one distinct value so far
Second row:
Number of ID=2, because we are going to row 4 in the old dataframe( we only are interessted in new IDS)
Number of disntinct values=2, because the value changed to 15 and didn't occur so far

I think you need processing new DataFrame by DataFrame.drop_duplicates with factorize and cumsum:
Replace duplicated values to NaN, forward filling them and then call pd.factorize:
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
I change data for better testing:
print (df)
ID value
0 A 10
1 A 10
2 A 10
3 B 15
4 B 15
5 C 10
6 C 15
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
6 C 15 4 2
Working wrong if multiple values value per ID:
df = pd.DataFrame({'Number of ID': range(1, len(df1)+1),
'Number of distinct values': np.cumsum(pd.factorize(df1['value'])[0])+1})
print (df)
Number of ID Number of distinct values
0 1 1
1 2 2
2 3 2
3 4 3

Python: Calculate mathematical values in new row in dataframe based on few specific previous rows

I have the below pandas dataframe:
Input:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
My Expected Output is:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
Total Exp 10 12
The last tow is basically total of row 1 and row 3. This is a very simplified example, i actually have to perform complex calculation on a huge dataframe.
Is there a way in python to perform such calculation?

You can select rows by positions with DataFrame.iloc and sum, then assign to new row:
df.loc[len(df.index)] = df.iloc[0] + df.iloc[2]
Or:
df.loc[len(df.index)] = df.iloc[[0,2]].sum()
print (df)
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 8 10 12
EDIT: First idea is create index by A column, so you can use loc with new value of A, but last step is convert index to column by reset_index:
df = df.set_index('A')
df.loc['Total Exp'] = df.iloc[[0,2]].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Similar is possible selecting by loc by labels - here Expense and Travel:
df = df.set_index('A')
df.loc['Total Exp'] = df.loc[['Expense', 'Travel']].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or is possible filter out first column with 1: and add value back by Series.reindex:
df.loc[len(df.index)] = df.iloc[[0,2], 1:].sum().reindex(df.columns, fill_value='Total Exp')
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or you can set value of A separately:
s = df.iloc[[0,2]].sum()
s.loc['A'] = 'Total Exp'
df.loc[len(df.index)] = s
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!

Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

GroupBy one column, custom operation on another column of grouped records in pandas

I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so

df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.

You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

sum duplicate row with condition using pandas - python

Can try summing only the unique values per group: def sum_unique(s): return s.unique().sum() df2 = df.groupby('Name', sort=False, as_index=False).agg( Name=('Name', 'first'), rent=('rent', sum_unique), sale=('sale', sum_unique) ) df2: Name rent sale 0 A 180 7 1 B 1 4 2 M 14 20 3 O 10 1

You can first groupby by Name and rent, and then just by Name: df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

Related

Pandas: finding mean in the dataframe based on condition included in another

Number of different values / distinct in a column per ID in a sorted dataframe

Python: Calculate mathematical values in new row in dataframe based on few specific previous rows

python pandas - remove duplicates in a column and keep rows according to a complex criteria

GroupBy one column, custom operation on another column of grouped records in pandas

Categories

Resources