add two dataframe with unequal index [duplicate]

add two dataframe with unequal index [duplicate] - python

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Pandas Merging 101
(8 answers)
Closed 1 year ago.
a:
Length 10kN
0 0.0 5
1 0.5 5
2 1.0 5
3 1.5 5
4 2.0 5
5 2.5 5
6 3.0 5
7 3.5 5
8 4.0 5
9 4.5 5
10 5.0 5
11 5.0 -5
12 5.5 -5
13 6.0 -5
14 6.5 -5
15 7.0 -5
16 7.5 -5
17 8.0 -5
18 8.5 -5
19 9.0 -5
20 9.5 -5
21 10.0 -5
b:
Length1 20kN
0 0.0 50
1 0.5 45
2 1.0 40
3 1.5 35
4 2.0 30
5 2.5 25
6 3.0 20
7 3.5 15
8 4.0 10
9 4.5 5
10 5.0 0
11 5.5 -5
12 6.0 -10
13 6.5 -15
14 7.0 -20
15 7.5 -25
16 8.0 -30
17 8.5 -35
18 9.0 -40
19 9.5 -45
20 10.0 -50
c as a result of my code below:
Length 10kN Length1 20kN Total
0 0.0 5 0.0 50.0 55.0
1 0.5 5 0.5 45.0 50.0
2 1.0 5 1.0 40.0 45.0
3 1.5 5 1.5 35.0 40.0
4 2.0 5 2.0 30.0 35.0
5 2.5 5 2.5 25.0 30.0
6 3.0 5 3.0 20.0 25.0
7 3.5 5 3.5 15.0 20.0
8 4.0 5 4.0 10.0 15.0
9 4.5 5 4.5 5.0 10.0
10 5.0 5 5.0 0.0 5.0
11 5.0 -5 5.5 -5.0 -10.0
12 5.5 -5 6.0 -10.0 -15.0
13 6.0 -5 6.5 -15.0 -20.0
14 6.5 -5 7.0 -20.0 -25.0
15 7.0 -5 7.5 -25.0 -30.0
16 7.5 -5 8.0 -30.0 -35.0
17 8.0 -5 8.5 -35.0 -40.0
18 8.5 -5 9.0 -40.0 -45.0
19 9.0 -5 9.5 -45.0 -50.0
20 9.5 -5 10.0 -50.0 -55.0
21 10.0 -5 NaN NaN NaN
Code I tried:
import pandas as pd
a=pd.read_csv("first.csv")
b=pd.read_csv("second.csv")
c=pd.concat([a,b], axis=1)
c['Total']=c['10kN']+c['20kN']
print(c['Total'])
print(a)
print(b)
print(c)
I want add two column 10kN and 20kN for same length

Related

Calculated column with shift

This is the base DataFrame:
g_accessor number_opened number_closed
0 49 - 20 3.0 1.0
1 50 - 20 2.0 14.0
2 51 - 20 1.0 6.0
3 52 - 20 0.0 6.0
4 1 - 21 1.0 4.0
5 2 - 21 3.0 5.0
6 3 - 21 4.0 11.0
7 4 - 21 2.0 7.0
8 5 - 21 6.0 10.0
9 6 - 21 2.0 8.0
10 7 - 21 4.0 9.0
11 8 - 21 2.0 3.0
12 9 - 21 2.0 1.0
13 10 - 21 1.0 11.0
14 11 - 21 6.0 3.0
15 12 - 21 3.0 3.0
16 13 - 21 2.0 6.0
17 14 - 21 5.0 9.0
18 15 - 21 9.0 13.0
19 16 - 21 7.0 7.0
20 17 - 21 9.0 4.0
21 18 - 21 3.0 8.0
22 19 - 21 6.0 3.0
23 20 - 21 6.0 1.0
24 21 - 21 3.0 5.0
25 22 - 21 5.0 3.0
26 23 - 21 1.0 0.0
I want to add a calculated new column number_active which relies on previous values. For this I'm trying to use pd.DataFrame.shift(), like this:
# Creating new column and setting all rows to 0
df['number_active'] = 0
# Active from previous period
PREVIOUS_PERIOD_ACTIVE = 22
# Calculating active value for first period in the DataFrame, based on `PREVIOUS_PERIOD_ACTIVE`
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
# Calculating all columns using DataFrame.shift()
df['number_active'] = (df['number_opened'] + df['number_active'].shift(1)) - df['number_closed']
# Recalculating first active value as it was overwritten in the previous step.
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
The result:
g_accessor number_opened number_closed number_active
0 49 - 20 3.0 1.0 24.0
1 50 - 20 2.0 14.0 12.0
2 51 - 20 1.0 6.0 -5.0
3 52 - 20 0.0 6.0 -6.0
4 1 - 21 1.0 4.0 -3.0
5 2 - 21 3.0 5.0 -2.0
6 3 - 21 4.0 11.0 -7.0
7 4 - 21 2.0 7.0 -5.0
8 5 - 21 6.0 10.0 -4.0
9 6 - 21 2.0 8.0 -6.0
10 7 - 21 4.0 9.0 -5.0
11 8 - 21 2.0 3.0 -1.0
12 9 - 21 2.0 1.0 1.0
13 10 - 21 1.0 11.0 -10.0
14 11 - 21 6.0 3.0 3.0
15 12 - 21 3.0 3.0 0.0
16 13 - 21 2.0 6.0 -4.0
17 14 - 21 5.0 9.0 -4.0
18 15 - 21 9.0 13.0 -4.0
19 16 - 21 7.0 7.0 0.0
20 17 - 21 9.0 4.0 5.0
21 18 - 21 3.0 8.0 -5.0
22 19 - 21 6.0 3.0 3.0
23 20 - 21 6.0 1.0 5.0
24 21 - 21 3.0 5.0 -2.0
25 22 - 21 5.0 3.0 2.0
26 23 - 21 1.0 0.0 1.0
Oddly, it seems that only the first active value (index 1) is calculated correctly (since the value at index 0 is calculated independently, via df.iat). For the rest of the values it seems that number_closed is interpreted as negative value - for some reason.
What am I missing/doing wrong?

You are assuming that the result for the previous row is available when the current row is calculated. This is not how pandas calculations work. Pandas calculations treat each row in isolation, unless you are applying multi-row operations like cumsum and shift.
I would calculate the number active with a minimal example as:
df = pandas.DataFrame({'ignore': ['a','b','c','d','e'], 'number_opened': [3,4,5,4,3], 'number_closed':[1,2,2,1,2]})
df['number_active'] = df['number_opened'].cumsum() + 22 - df['number_closed'].cumsum()
This gives a result of:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
29
3
d
4
1
32
4
e
3
2
33
The code in your question with my minimal example gave:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
3
3
d
4
1
3
4
e
3
2
1

Create a new column that stores a ratio score in Pandas

I have a DataFrame like this:
RANK STA RUN BIB NAME FINISH FINISH.1 FINISH.2 COURSE
0 1 3 3 1 ingenting 3.0 0.00 NaN LØYPE 1
1 2 8 2 3 ingenting 4.0 1.97 NaN LØYPE 3
2 3 9 3 3 ingenting 5.0 2.06 NaN LØYPE 1
3 4 2 2 1 ingenting 6.0 3.21 NaN STRAIGHT-GLIDING
4 5 5 1 2 ingenting 6.0 3.32 NaN LØYPE 1
5 6 1 1 1 ingenting 6.0 3.34 NaN STRAIGHT-GLIDING
6 7 4 4 1 ingenting 6.0 3.43 NaN LØYPE 1
7 8 13 7 3 ingenting 6.0 3.48 NaN STRAIGHT-GLIDING
8 9 12 6 3 ingenting 6.0 3.65 NaN STRAIGHT-GLIDING
9 10 11 5 3 ingenting NaN 4.19 NaN STRAIGHT-GLIDING
10 11 6 2 2 ingenting 7.0 4.20 NaN LØYPE 3
11 12 14 3 2 ingenting 7.0 4.30 NaN STRAIGHT-GLIDING
12 13 10 4 3 ingenting 8.0 5.14 NaN LØYPE 2
13 14 7 1 3 ingenting 8.0 5.75 NaN LØYPE 3
The Data Frame consists of different athletes (BIB) in different courses (COURSES). Each BIB also has his own RUN number. My main interest is the FINISH column. Now I want to obtain the following:
I want to find the first STRAIGHT-GLIDING FINISH time for each BIB.
Next, I want "store" this as a reference time.
Next, for each observation (13 in this example) I want to compute this BIB's FINISH time subtracted from this BIB's STRAIGHT-GLIDING time.
The solution should add a new column with this information, for each observation. To give you an example, in observation 0, the FINISH time is 3.0 and his first STRAIGHT-GLIDING time is '3.21'. I, therefore, want to create a value 3.0 - 3.21. How can I accomplish this?

Here's my answer. It's a little longer :)
# Create filter for 'STRAIGHT-GLIDING'
sg_filt = df['COURSE'] == 'STRAIGHT-GLIDING'
# Create 'STRAIGHT-GLIDING' only dataframe using filter
sg_only = df.loc[sg_filt].copy()
# Preview new DataFrame
sg_only
Rank STA RUN BIB NAME FINISH FINISH.1 COURSE
3 4 2 2 1 ingenting 6.0 3.21 STRAIGHT-GLIDING
5 6 1 1 1 ingenting 6.0 3.34 STRAIGHT-GLIDING
7 8 13 7 3 ingenting 6.0 3.48 STRAIGHT-GLIDING
8 9 12 6 3 ingenting 6.0 3.65 STRAIGHT-GLIDING
9 10 11 5 3 ingenting NaN 4.19 STRAIGHT-GLIDING
11 12 14 3 2 ingenting 7.0 4.30 STRAIGHT-GLIDING
# Create DataFrame on only first times per BIB
first_times = sg_only[sg_only.groupby(['BIB','COURSE']).cumcount() == 0][['BIB','FINISH']].copy()
# Change column name on first_times dataFrame for merge
first_times.rename(columns={'FINISH':'Reference_Time'},inplace=True)
# Merge original DataFrame with first_times DataFrame to get reference time
final_df = pd.merge(df,first_times,on='BIB',how='left')
Rank STA RUN BIB NAME FINISH FINISH.1 COURSE Reference_Time
0 1 3 3 1 ingenting 3.0 0.00 LØYPE 1 6.0
1 2 8 2 3 ingenting 4.0 1.97 LØYPE 3 6.0
2 3 9 3 3 ingenting 5.0 2.06 LØYPE 1 6.0
3 4 2 2 1 ingenting 6.0 3.21 STRAIGHT-GLIDING 6.0
4 5 5 1 2 ingenting 6.0 3.32 LØYPE 1 7.0
5 6 1 1 1 ingenting 6.0 3.34 STRAIGHT-GLIDING 6.0
6 7 4 4 1 ingenting 6.0 3.43 LØYPE 1 6.0
7 8 13 7 3 ingenting 6.0 3.48 STRAIGHT-GLIDING 6.0
8 9 12 6 3 ingenting 6.0 3.65 STRAIGHT-GLIDING 6.0
9 10 11 5 3 ingenting NaN 4.19 STRAIGHT-GLIDING 6.0
10 11 6 2 2 ingenting 7.0 4.20 LØYPE 3 7.0
11 12 14 3 2 ingenting 7.0 4.30 STRAIGHT-GLIDING 7.0
12 13 10 4 1 ingenting 8.0 5.14 LØYPE 2 6.0
13 14 7 1 1 ingenting 8.0 5.75 LØYPE 3 6.0
# Create FINISH_TIME column
final_df['FINISH_TIME'] = final_df['FINISH'] - final_df['Reference_Time']
Rank STA RUN BIB NAME FINISH FINISH.1 COURSE Reference_Time FINISH_TIME
0 1 3 3 1 ingenting 3.0 0.00 LØYPE 1 6.0 3.0
1 2 8 2 3 ingenting 4.0 1.97 LØYPE 3 6.0 -2.0
2 3 9 3 3 ingenting 5.0 2.06 LØYPE 1 6.0 -1.0
3 4 2 2 1 ingenting 6.0 3.21 STRAIGHT-GLIDING 6.0 0.0
4 5 5 1 2 ingenting 6.0 3.32 LØYPE 1 7.0 -1.0
5 6 1 1 1 ingenting 6.0 3.34 STRAIGHT-GLIDING 6.0 0.0
6 7 4 4 1 ingenting 6.0 3.43 LØYPE 1 6.0 0.0
7 8 13 7 3 ingenting 6.0 3.48 STRAIGHT-GLIDING 6.0 0.0
8 9 12 6 3 ingenting 6.0 3.65 STRAIGHT-GLIDING 6.0 0.0
9 10 11 5 3 ingenting NaN 4.19 STRAIGHT-GLIDING 6.0 NaN
10 11 6 2 2 ingenting 7.0 4.20 LØYPE 3 7.0 0.0
11 12 14 3 2 ingenting 7.0 4.30 STRAIGHT-GLIDING 7.0 0.0
12 13 10 4 1 ingenting 8.0 5.14 LØYPE 2 6.0 2.0
13 14 7 1 1 ingenting 8.0 5.75 LØYPE 3 6.0 2.0

Here's my solution (I hope I understood you correctly):
import pandas as pd
import numpy as np
previousBib = ""
for i in range(df.shape[0]):
currentBib = df.BIB.to_numpy()[i]
if (currentBib != previousBib):
instances_BibI = df.loc[df.BIB == currentBib]
instances_BibI = instances_BibI.sort_values(by=["RUN"]) # To ensure that the first gliding finish is the first race with that finish
first_StraightGliding_Finish = instances_BibI.loc[instances_BibI.COURSE == "STRAIGHT-GLIDING"].FINISH_1.to_numpy()[0]
df.at[i, 'FINISH_2'] = df.iloc[i, 5] - first_StraightGliding_Finish
previousBib = currentBib
where df is your example dataframe
My sample output (sorted by BIB and RUN) is as follows:
RANK STA RUN BIB NAME FINISH FINISH_1 FINISH_2 COURSE
5 6 1 1 1 Olle 6.0 3.34 2.66 STRAIGHT-GLIDING
3 4 2 2 1 Olle 6.0 3.21 2.66 STRAIGHT-GLIDING
0 1 3 3 1 Olle 3.0 0.00 -0.34 Loop1
6 7 4 4 1 Olle 6.0 3.43 2.66 Loop1
4 5 5 1 2 Olle 6.0 3.32 1.70 Loop1
10 11 6 2 2 Olle 7.0 4.20 2.70 Loop3
11 12 14 3 2 Olle 7.0 4.30 2.70 STRAIGHT-GLIDING
13 14 7 1 3 Olle 8.0 5.75 3.81 Loop3
1 2 8 2 3 Olle 4.0 1.97 -0.19 Loop3
2 3 9 3 3 Olle 5.0 2.06 0.81 Loop1
12 13 10 4 3 Olle 8.0 5.14 3.81 Loop2
9 10 11 5 3 Olle NaN 4.19 NaN STRAIGHT-GLIDING
8 9 12 6 3 Olle 6.0 3.65 1.81 STRAIGHT-GLIDING
7 8 13 7 3 Olle 6.0 3.48 1.81 STRAIGHT-GLIDING
where the FINISH_2 column is the subtracted time value you wanted to calculate

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

Pandas Count frequency of values by column

I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
After reading the data from a text file with
import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')
I want to: (1) get the frequency of values larger than zero for each column; (2) get the sum of values in each column; (3) find the maximum value in each column.
I managed to obtain (2) using
N = df.apply(lambda x: np.sum(x))
But could not figure out how to achieve (1) and (3).
I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows).
Thanks in advance for any hints and suggestions.

Your 1st
df.gt(0).sum()
2nd
df.sum()
3rd
df.max()

You can use mask and describe to get a bunch of stats by column.
df.mask(df <= 0).describe().T
Output:
count mean std min 25% 50% 75% max
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0
The reason to use mask is that count counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count.
And, finally, we can add "sum" too, using assign:
df.mask(df<=0).describe().T.assign(sum=df.sum())
Output:
count mean std min 25% 50% 75% max sum
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0 42
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0 38
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0 39
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0 47
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0 46
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0 50
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0 63
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0 48
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0 42
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0 43

Pandas groupby add value of one line in group to all lines of the group

Given a DataFrame df that looks roughly like this:
TripID time Latitude SectorID sector_leave_time
0 42 7 52.5 5 8
1 42 8 52.6 5 8
2 42 9 52.7 6 10
3 42 10 52.8 6 10
4 5 9 50.1 2 10
5 5 10 50.0 2 10
6 5 11 49.9 1 12
7 5 12 49.8 1 12
I already computed the time at which a trip leaves a sector by getting the maximum timestamp within the sector. Now, I would like to add another column for the latitude at the point of sector_leave_time for each trip and sector, so the DataFrame becomes this:
TripID time Latitude SectorID sector_leave_time sector_leave_lat
0 42 7 52.5 5 8 52.6
1 42 8 52.6 5 8 52.6
2 42 9 52.7 6 10 52.8
3 42 10 52.8 6 10 52.8
4 5 9 50.1 2 10 50.0
5 5 10 50.0 2 10 50.0
6 5 11 49.9 1 12 49.8
7 5 12 49.8 1 12 49.8
So far I've only been able to add the sector_leave_lat to the line where time == sector_leave_time, i.e. when the trip leaves the sector, using the following line of code:
df['sector_leave_lat'] = df.groupby('TripID').apply(lambda x : x.loc[x['time'] == x['sector_leave_time'], 'Latitude']).reset_index().set_index('level_1')['Latitude']
I know this line looks awful and I would like to add sector_leave_lat to all entries of that trip within that sector. I'm kind of running out of ideas, so I hope someone may be able to help.

The problem is not that complicated if you are familiar with SQL :)
The following code should do the trick :
#Given your dataframe :
df
TripID time Latitude SectorID sector_leave_time
0 42.0 7.0 52.5 5.0 8.0
1 42.0 8.0 52.6 5.0 8.0
2 42.0 9.0 52.7 6.0 10.0
3 42.0 10.0 52.8 6.0 10.0
4 5.0 9.0 50.1 2.0 10.0
5 5.0 10.0 50.0 2.0 10.0
6 5.0 11.0 49.9 1.0 12.0
7 5.0 12.0 49.8 1.0 12.0
# Get the Latitude corresponding to time = sector_leave_time
df_max_lat = df.loc[df_merged.time==df.sector_leave_time, ['TripID', 'Latitude', 'SectorID']]
# Then you have :
TripID Latitude SectorID
1 42.0 52.6 5.0
3 42.0 52.8 6.0
5 5.0 50.0 2.0
7 5.0 49.8 1.0
# Add the max latitude to original dataframe applying a left join
pd.merge(df, df_max_lat, on=['TripID', 'SectorID'], how='left', suffixes=('','_sector_leave'))
# You're getting :
TripID time Latitude SectorID sector_leave_time Latitude_sector_leave
0 42.0 7.0 52.5 5.0 8.0 52.6
1 42.0 8.0 52.6 5.0 8.0 52.6
2 42.0 9.0 52.7 6.0 10.0 52.8
3 42.0 10.0 52.8 6.0 10.0 52.8
4 5.0 9.0 50.1 2.0 10.0 50.0
5 5.0 10.0 50.0 2.0 10.0 50.0
6 5.0 11.0 49.9 1.0 12.0 49.8
7 5.0 12.0 49.8 1.0 12.0 49.8
There you go :)

for each trip-sector combination you want the last Latitude, sorted by time.
df['sector_leave_lat'] = df.sort_values('time').groupby(
['TripID', 'SectorID']
).transform('last')['Latitude']
outputs:
TripID time Latitude SectorID sector_leave_time test
0 42 7 52.5 5 8 52.6
1 42 8 52.6 5 8 52.6
2 42 9 52.7 6 10 52.8
3 42 10 52.8 6 10 52.8
4 5 9 50.1 2 10 50.0
5 5 10 50.0 2 10 50.0
6 5 11 49.9 1 12 49.8
7 5 12 49.8 1 12 49.8
As the sample data already appears sorted by time within each trip-sector group, the sorting here may be redundant

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

add two dataframe with unequal index [duplicate] - python

Related

Calculated column with shift

Create a new column that stores a ratio score in Pandas

Substraction between two dataframe's column

Pandas Count frequency of values by column

Pandas groupby add value of one line in group to all lines of the group

Categories

Resources