Pandas How to combine two rows in group with complex rules/conditions - python

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
"ID": ['company A', 'company A', 'company A', 'company B','company B', 'company B', 'company C', 'company C','company C','company C', 'company D', 'company D','company D'],
'Sender': [28, 'remove1', 'flag_source', 56, 28, 312, 'remove2', 'flag_source', 78, 102, 26, 101, 96],
'Receiver': [129, 28, 'remove1', 172, 56, 28, 61, 'remove2', 12, 78, 98, 26, 101],
'Date': ['2020-04-12', '2020-03-20', '2020-03-20', '2019-02-11', '2019-01-31', '2018-04-02', '2020-06-29', '2020-06-29', '2019-11-29', '2019-10-01', '2020-04-03', '2020-01-30', '2019-10-18'],
'Sender_type': ['house', 'temp', 'house', 'house', 'house', 'house', 'temp', 'house', 'house','house','house', 'temp', 'house'],
'Receiver_type': ['house', 'house', 'temp', 'house','house','house','house', 'temp', 'house','house','house','house','temp'],
'Price': [32, 50, 47, 21, 23, 19, 52, 39, 12, 22, 61, 53, 19]
})
The df is like this below:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A remove1 28 2020-03-20 temp house 50 # combine this row with below
2 company A flag_source remove1 2020-03-20 house temp 47 # combine this row with above
3 company B 56 172 2019-02-11 house house 21
4 company B 28 56 2019-01-31 house house 23
5 company B 312 28 2018-04-02 house house 19
6 company C remove2 61 2020-06-29 temp house 52 # combine this row and below
7 company C flag_source remove2 2020-06-29 house temp 39 # combine this row with above
8 company C 78 12 2019-11-29 house house 12
9 company C 102 78 2019-10-01 house house 22
10 company D 26 98 2020-04-03 house house 61
11 company D 101 26 2020-01-30 temp house 53
12 company D 96 101 2019-10-18 house temp 19
I wish to combine/merge two rows for each group 'ID' (company x) by the following rule: combine the row in 'Sender' that contains a'flag_source' and its above row into one new row. In this new row: the Sender is the flag_source, 'Revceiver' is its above value (remove the two 'remove' values), Date is the above date, Sender_type and Receiver_type are 'house', and 'Price' is the previous above value. Then remove the two rows. For example, for company A, it will combine line 1 and line 2 to generate the new row below:
ID Sender Receiver Date Sender_type Receiver_type Price
company A flag_source 28 2020-03-20 house house 50
Then use this new row to replace the previous two lines. Same rules for the other groups(in this case only apply to company A and C). In the end, I wish to have a result like this:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A flag_source 28 2020-03-20 house house 50 # new row
2 company B 56 172 2019-02-11 house house 21
3 company B 28 56 2019-01-31 house house 23
4 company B 312 28 2018-04-02 house house 19
5 company C flag_source 61 2020-06-29 house house 52 # new row
6 company C 78 12 2019-11-29 house house 12
7 company C 102 78 2019-10-01 house house 22
8 company D 26 98 2020-04-03 house house 61
9 company D 101 26 2020-01-30 temp house 53
10 company D 96 101 2019-10-18 house temp 19
Hopefully my explanation for the question is clear.
As this is a brief sample, the real case has many data like this, I wrote a loop but very slow and unproductive, so please help if you have any ideas and effective way. Many many thanks for help!

I believe the following is working:
mask = df.Sender == 'flag_source'
df[mask] = df.shift()
df.loc[mask, 'Sender'] = 'flag_source'
df.loc[mask, ['Sender_type','Receiver_type']] = 'house'
df = df[~mask.shift(-1).fillna(False).astype(bool)].reset_index(drop=True)
So the steps are (by line):
make a mask of the rows you need to channge
set those rows equal to the previous row with 'shift'
rewrite Sender for those rows to flag_source
also rewrite Sender_type and Receiver_type
remove the previous rows, by using a shift again on the mask. This seems a little convoluted; you could also do something like a loc against rows that don't contain the string remove
Output:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32.0
1 company A flag_source 28 2020-03-20 house house 50.0
2 company B 56 172 2019-02-11 house house 21.0
3 company B 28 56 2019-01-31 house house 23.0
4 company B 312 28 2018-04-02 house house 19.0
5 company C flag_source 61 2020-06-29 house house 52.0
6 company C 78 12 2019-11-29 house house 12.0
7 company C 102 78 2019-10-01 house house 22.0
8 company D 26 98 2020-04-03 house house 61.0
9 company D 101 26 2020-01-30 temp house 53.0
10 company D 96 101 2019-10-18 house temp 19.0

Related

Fill the empty cells with their neighbours if they are not empty based on another column Pandas

Good afternoon,
What is the simplest way to replace empty values to another value in one column if the text in the another column is equal?
For example, we have the dataframe:
Name
Life
Score
Joe
789
45
Joe
563
13
Nick
24
Nick
45
155
Alice
188
34
Alice
43
Kate
43543
Kate
232
And the result should be:
Name
Life
Score
Joe
789
45
Joe
563
13
Nick
45
24
Nick
45
155
Alice
188
34
Alice
188
43
Kate
43543
Kate
232
Thanks for all help!
You can do it like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Name': ['Joe', 'Joe', 'Nick', 'Nick', 'Alice', 'Alice', 'Kate', 'Kate'],
'Life': [789.0, np.nan, np.nan, 45.0, 188.0, np.nan, np.nan, np.nan],
'Score': [45, 13, 24, 155, 34, 43, 43543, 232]
})
print(df)
# output:
# Name Life Score
# 0 Joe 789.0 45
# 1 Joe NaN 13
# 2 Nick NaN 24
# 3 Nick 45.0 155
# 4 Alice 188.0 34
# 5 Alice NaN 43
# 6 Kate NaN 43543
# 7 Kate NaN 232
df['Life'] = df.groupby('Name')['Life'].transform('first')
print(df)
# output:
# Name Life Score
# 0 Joe 789.0 45
# 1 Joe 789.0 13
# 2 Nick 45.0 24
# 3 Nick 45.0 155
# 4 Alice 188.0 34
# 5 Alice 188.0 43
# 6 Kate NaN 43543
# 7 Kate NaN 232

How can I create a loop to merge two DataFrames?

I have two DataFrames:
name
age
weight
sex
d_type
john
21
56
M
futboll
martha
25
43
F
soccer
esthela
29
53
F
judo
harry
18
72
M
karate
irving
24
61
M
karate
jerry
21
56
M
soccer
john_2
26
69
M
futboll
malina
22
53
F
soccer
And
d_type
impact
founds_in
futboll
high
federal
soccer
medium
state
judo
medium
federal
karate
high
federal
At the end I want a DF like this.
name
age
weight
sex
d_type
impact
founds_in
john
21
56
M
futboll
high
federal
martha
25
43
F
soccer
medium
state
esthela
29
53
F
judo
medium
federal
harry
18
72
M
karate
high
federal
irving
24
61
M
karate
high
federal
jerry
21
56
M
soccer
medium
state
john_2
26
69
M
futboll
high
federal
malina
22
53
F
soccer
medium
state
How can I do this in pandas? I need a loop or it's better try in Linux?
In Python
df1 = pd.DataFrame({'name': ['john', 'martha'],
'age': [21, 25],
'd_type': ['futbol', 'soccer']})
df2 = pd.DataFrame({'d_type': ['futbol', 'soccer'],
'impact': ['high', 'medium'],
'founds_in': ['federal', 'state']})
df1.merge(df2, on = 'd_type').set_index('name')
which gives:
name
age
d_type
impact
founds_in
john
21
futbol
high
federal
martha
25
soccer
medium
state

How to add a row to every group with pandas groupby?

I wish to add a new row in the first line within each group, my raw dataframe is:
df = pd.DataFrame({
'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
'2019-06-20', '2019-01-30', '2018-10-23',
'2018-08-29', '2020-05-21', '2019-11-22',
'2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})
it is like this:
ID From_num To_num Date
0 James 78 96 2020-05-12
1 James 420 78 2020-02-02
2 James Started 420 2019-06-18
3 Max 298 36 2019-06-20
4 Max 36 78 2019-01-30
5 Max 298 36 2018-10-23
6 Max Started 298 2018-08-29
7 Park Started 311 2020-05-21
8 Tom 60 150 2019-11-22
9 Tom 520 520 2019-08-26
10 Tom 99 78 2018-12-11
11 Tom Started 99 2018-10-09
12 Wong Started 39 2019-02-01
For each person ('ID'), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:
ID From_num To_num Date
0 James 78 96 2020-05-13 # row added, Date + 1
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21 # row added, Date + 1
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22 # Row added, Date + 1
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23 # Row added, Date + 1
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02 # Row added Date + 1
17 Wong Started 39 2019-02-01
I wrote some loop conditions but quite slow, If you have any good ideas, please help. Thanks a lot
Let's try groupby.apply here. We'll append a row to each group at the start, like this:
def augment_group(group):
first_row = group.iloc[[0]]
first_row['Date'] += pd.Timedelta(days=1)
return first_row.append(group)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
(df.groupby('ID', as_index=False, group_keys=False)
.apply(augment_group)
.reset_index(drop=True))
ID From_num To_num Date
0 James 78 96 2020-05-13
1 James 78 96 2020-05-12
2 James 420 78 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-06-21
5 Max 298 36 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park Started 311 2020-05-22
10 Park Started 311 2020-05-21
11 Tom 60 150 2019-11-23
12 Tom 60 150 2019-11-22
13 Tom 520 520 2019-08-26
14 Tom 99 78 2018-12-11
15 Tom Started 99 2018-10-09
16 Wong Started 39 2019-02-02
17 Wong Started 39 2019-02-01
Although I agree with #Joran Beasley in the comments that this feels like somewhat of an XY problem. Perhaps try clarifying the problem you're trying to solve, instead of asking how to implement what you think is the solution to your issue?

Pandas Dataframe: How to fillna object types with one of value in the columns?

For example,I have a df.
data = {'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange',],
'price': [25, 94, 57, 62, 70,],
'company':[np.nan,'coca-cola',np.nan,np.nan,np.nan,]}
df = pd.DataFrame(data)
OUT:
company price product
0 NaN 25 coca
1 coca-cola 94 NaN
2 NaN 57 pepsi
3 NaN 62 pepsi
4 NaN 70 orange
I want to fillna the object type with value in the corresponding columns.
Expected OUTPUT:
df
company price product
0 coca-cola 25 coca
1 coca-cola 94 coca
2 coca-cola 57 pepsi
3 coca-cola 62 pepsi
4 coca-cola 70 orange
I try as follows.
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
pass
else:
df[col].fillna(df[col][0])
But df[col][0] maybe Nan,so I should fill a value that's not nan.
So how to do it?
Just use ffill and bfill
df.ffill().bfill()
product price company
0 coca 25 coca-cola
1 coca 94 coca-cola
2 pepsi 57 coca-cola
3 pepsi 62 coca-cola
4 orange 70 coca-cola
try this:
for i in range(0,df.company.count()):
if((df.company[i].isnull()==True)&(df.product[i].isnull()==False))
df.company[i]=df.product[i]
if((df.company[i].isnull()==False)&(df.product[i].isnull()==True))
df.product[i]=df.company[i]
Or you can try this:
df['product'] = df['product'].fillna(df['company'])
df['company'] = df['company'].fillna(df['product'])

Pandas Python Groupby Cummulative Sum Reverse

I have found Pandas groupby cumulative sum and found it very useful. However, I would like to determine how to calculate a reverse cumulative sum.
The link suggests the following.
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
In order to reverse sum, I tried slicing the data, but it fails.
df.groupby(by=['name','day']).ix[::-1, 'no'].sum().groupby(level=[0]).cumsum()
Jack | Monday | 10 | 90
Jack | Tuesday | 30 | 80
Jack | Wednesday | 50 | 50
Jill | Monday | 40 | 80
Jill | Wednesday | 40 | 40
EDIT:
Based on feedback, I tried to implement the code and make the dataframe larger:
import pandas as pd
df = pd.DataFrame(
{'name': ['Jack', 'Jack', 'Jack', 'Jill', 'Jill'],
'surname' : ['Jones','Jones','Jones','Smith','Smith'],
'car' : ['VW','Mazda','VW','Merc','Merc'],
'country' : ['UK','US','UK','EU','EU'],
'year' : [1980,1980,1980,1980,1980],
'day': ['Monday', 'Tuesday','Wednesday','Monday','Wednesday'],
'date': ['2016-02-31','2016-01-31','2016-01-31','2016-01-31','2016-01-31'],
'no': [10,30,50,40,40],
'qty' : [100,500,200,433,222]})
I then try and group on a number of columns but it fails to apply the grouping.
df = df.groupby(by=['name','surname','car','country','year','day','date']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1].reset_index()
Why is the case? I expect Jack Jones with car Mazda to be a separate cumulative quantity from Jack Jones with a VW.
You can use double iloc:
df = df.groupby(by=['name','day']).sum().iloc[::-1].groupby(level=[0]).cumsum().iloc[::-1]
print (df)
no
name day
Jack Monday 90
Tuesday 80
Wednesday 50
Jill Monday 80
Wednesday 40
For another column solution is simplify:
df = df.groupby(by=['name','day']).sum()
df['new'] = df.iloc[::-1].groupby(level=[0]).cumsum()
print (df)
no new
name day
Jack Monday 10 90
Tuesday 30 80
Wednesday 50 50
Jill Monday 40 80
Wednesday 40 40
EDIT:
There is problem in second groupby need to append more levels - level=[0,1,2] means group by first name, second surname and third car levels.
df1 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum())
print (df1)
no qty
name surname car country year day date
Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
VW UK 1980 Monday 2016-02-31 10 100
Wednesday 2016-01-31 50 200
Jill Smith Merc EU 1980 Monday 2016-01-31 40 433
Wednesday 2016-01-31 40 222
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(level=[0,1,2])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222
Or is possible select by names - see groupby enhancements in 0.20.1+:
df2 = (df.groupby(by=['name','surname','car','country','year','day','date'])
.sum()
.iloc[::-1]
.groupby(['name','surname','car'])
.cumsum()
.iloc[::-1]
.reset_index())
print (df2)
name surname car country year day date no qty
0 Jack Jones Mazda US 1980 Tuesday 2016-01-31 30 500
1 Jack Jones VW UK 1980 Monday 2016-02-31 60 300
2 Jack Jones VW UK 1980 Wednesday 2016-01-31 50 200
3 Jill Smith Merc EU 1980 Monday 2016-01-31 80 655
4 Jill Smith Merc EU 1980 Wednesday 2016-01-31 40 222

Categories

Resources