Pandas dataframe expand rows in specific times - python

I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3

You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0

Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3

Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0

Related

Merge two dataframes in pandas with common info as columns or as cells

I have two pandas dateframes in python, df_main and df_aux.
df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:
Date
Description
2022-01-01 13:45:23
A
2022-01-01 14:22:00
C
2022-01-01 16:15:33
D
2022-01-01 16:21:22
E
2022-01-02 13:21:56
B
2022-01-02 14:45:41
B
2022-01-02 15:11:34
C
df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:
Initial_Date
Final_Date
A
B
C
D
2022-01-01 12:00:00
2022-01-01 12:59:59
2
0
1
2
2022-01-01 13:00:00
2022-01-01 13:59:59
3
2
4
5
2022-01-01 14:00:00
2022-01-01 14:59:59
2
2
7
0
2022-01-01 15:00:00
2022-01-01 15:59:59
5
2
2
0
2022-01-02 12:00:00
2022-01-02 12:59:59
1
1
0
3
2022-01-02 13:00:00
2022-01-02 13:59:59
5
5
0
3
2022-01-02 14:00:00
2022-01-02 14:59:59
2
3
2
1
2022-01-02 15:00:00
2022-01-02 15:59:59
3
4
1
0
So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.
Following the same logic, the full df_main would be,
Date
Description
People_prev_hour
2022-01-01 13:45:23
A
2
2022-01-01 14:22:00
C
4
2022-01-01 16:15:33
D
0
2022-01-01 16:21:22
E
NaN
2022-01-02 13:21:56
B
1
2022-01-02 14:45:41
B
5
2022-01-02 15:11:34
F
NaN
Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.
I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!
First idea was use merge_asof, because hours not overlaping intervals:
df1 = pd.merge_asof(df_main,
df_aux.assign(Initial_Date = df_aux['Initial_Date'] + pd.Timedelta(1, 'hour')),
left_on='Date',
right_on='Initial_Date')
Then use indexing lookup:
idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
Another idea with IntervalIndex:
s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date + pd.Timedelta(1, 'hour'),
df_aux.Final_Date + pd.Timedelta(1, 'hour'), 'both')
df1 = df_aux.set_index(s).loc[df_main.Date]
print (df1)
Initial_Date \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00
Final_Date A B C D
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59 2 0 1 2
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59 3 2 4 5
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59 1 1 0 3
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59 5 5 0 3
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59 2 3 2 1
idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0

count groups of values with aggregated value

I have a dataset like this one:
DateTime
Value
2022-01-01 11:03:45
0
2022-01-01 11:03:50
40
2022-01-01 11:03:55
50
2022-01-01 11:04:00
60
2022-01-01 11:04:05
5
2022-01-01 11:04:10
4
2022-01-01 11:04:15
3
2022-01-01 11:04:20
0
2022-01-01 11:04:25
0
2022-01-01 11:04:30
40
2022-01-01 11:04:35
50
2022-01-01 11:04:40
4
2022-01-01 11:04:45
3
2022-01-01 11:04:50
0
2022-01-02 11:03:45
0
2022-01-02 11:03:50
5
2022-01-02 11:03:55
50
2022-01-02 11:04:00
60
2022-01-02 11:04:05
5
2022-01-02 11:04:10
4
2022-01-02 11:04:15
3
2022-01-02 11:04:20
0
2022-01-02 11:04:25
49
2022-01-02 11:04:30
40
2022-01-02 11:04:35
50
2022-01-02 11:04:40
4
2022-01-02 11:04:45
3
2022-01-02 11:04:50
0
as you can see I have some timestamps with values. It is a measurement of a device. It takes a sample every 5 seconds. It is only a subset of all data. There are some group with low value and high value. I define high value if it is greater then 10. If consecutive rows have high value then I consider it as a group. What I would like to achieve:
count number of groups in day
for each group calculate duration
I will show example of my desired result below:
DateTime
Value
GroupId
Duration (in seconds)
2022-01-01 11:03:45
0
NaN
Nan
2022-01-01 11:03:50
40
1
15
2022-01-01 11:03:55
50
1
15
2022-01-01 11:04:00
60
1
15
2022-01-01 11:04:05
5
NaN
Nan
2022-01-01 11:04:10
4
NaN
Nan
2022-01-01 11:04:15
3
NaN
Nan
2022-01-01 11:04:20
0
NaN
Nan
2022-01-01 11:04:25
0
NaN
Nan
2022-01-01 11:04:30
40
2
10
2022-01-01 11:04:35
50
2
10
2022-01-01 11:04:40
4
NaN
Nan
2022-01-01 11:04:45
3
NaN
Nan
2022-01-01 11:04:50
0
NaN
Nan
2022-01-02 11:03:45
0
NaN
Nan
2022-01-02 11:03:50
5
NaN
Nan
2022-01-02 11:03:55
50
1
10
2022-01-02 11:04:00
60
1
10
2022-01-02 11:04:05
5
NaN
Nan
2022-01-02 11:04:10
4
NaN
Nan
2022-01-02 11:04:15
3
NaN
Nan
2022-01-02 11:04:20
0
NaN
Nan
2022-01-02 11:04:25
49
2
15
2022-01-02 11:04:30
40
2
15
2022-01-02 11:04:35
50
2
15
2022-01-02 11:04:40
4
NaN
Nan
2022-01-02 11:04:45
3
NaN
Nan
2022-01-02 11:04:50
0
NaN
Nan
I know how to read data in Pandas and do basic manipulation, can you give me any hints on how to find those groups and how to measure their duration and assign a number to them? THanks!
For GroupId greate groups by consecutive values greater like 10 and aggregate cumulative sum by GroupBy.cumsum, then per dates and GroupId get maximal and minimal datetime and subtract, last add 5 seconds because sample every 5 seconds:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
g = df.groupby([date,'GroupId'])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds().add(5))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN
Another idea for count Duration by previous matched value per groups:
df['DateTime'] = pd.to_datetime(df['DateTime'])
s = df['Value'].gt(10)
date = df['DateTime'].dt.date
df['GroupId'] = s.ne(s.shift())[s].groupby(date).cumsum()
prev = df.groupby(date)['GroupId'].bfill(limit=1)
g = df.groupby([date,prev])['DateTime']
df['Duration (in seconds)'] = (g.transform('max').sub(g.transform('min'))
.dt.total_seconds()
.where(s))
print (df)
DateTime Value GroupId Duration (in seconds)
0 2022-01-01 11:03:45 0 NaN NaN
1 2022-01-01 11:03:50 40 1.0 15.0
2 2022-01-01 11:03:55 50 1.0 15.0
3 2022-01-01 11:04:00 60 1.0 15.0
4 2022-01-01 11:04:05 5 NaN NaN
5 2022-01-01 11:04:10 4 NaN NaN
6 2022-01-01 11:04:15 3 NaN NaN
7 2022-01-01 11:04:20 0 NaN NaN
8 2022-01-01 11:04:25 0 NaN NaN
9 2022-01-01 11:04:30 40 2.0 10.0
10 2022-01-01 11:04:35 50 2.0 10.0
11 2022-01-01 11:04:40 4 NaN NaN
12 2022-01-01 11:04:45 3 NaN NaN
13 2022-01-01 11:04:50 0 NaN NaN
14 2022-01-02 11:03:45 0 NaN NaN
15 2022-01-02 11:03:50 5 NaN NaN
16 2022-01-02 11:03:55 50 1.0 10.0
17 2022-01-02 11:04:00 60 1.0 10.0
18 2022-01-02 11:04:05 5 NaN NaN
19 2022-01-02 11:04:10 4 NaN NaN
20 2022-01-02 11:04:15 3 NaN NaN
21 2022-01-02 11:04:20 0 NaN NaN
22 2022-01-02 11:04:25 49 2.0 15.0
23 2022-01-02 11:04:30 40 2.0 15.0
24 2022-01-02 11:04:35 50 2.0 15.0
25 2022-01-02 11:04:40 4 NaN NaN
26 2022-01-02 11:04:45 3 NaN NaN
27 2022-01-02 11:04:50 0 NaN NaN

Remove row comparing two datetime [duplicate]

This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Pandas: changing values of a group under a condition

My df:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1,1,1,2,2],
'time':['2020-01-01 12:00:15','2020-01-01 12:00:30','2020-01-01 12:00:45','2020-01-03 08:00:00','2020-01-03 08:00:15'],
'time1':['2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 12:00:00','2020-01-01 12:00:00'],
'numb':[1,5,8,0,4]})
df['time'] = pd.to_datetime(df['time'])
df['time1'] = pd.to_datetime(df['time1'])
df['numb_diff'] = df['numb'] - df['numb'].shift()
Output:
id time time1 numb numb_diff
0 1 2020-01-01 12:00:15 2020-01-01 12:00:00 1 NaN
1 1 2020-01-01 12:00:30 2020-01-01 12:00:00 5 4.0
2 1 2020-01-01 12:00:45 2020-01-01 12:00:00 8 3.0
3 2 2020-01-03 08:00:00 2020-01-01 12:00:00 0 -8.0
4 2 2020-01-03 08:00:15 2020-01-01 12:00:00 4 4.0
now i would like to set time1 to the lowest value of time of the group(id) whenever the first entry of this id at the position numb_diff is <0.
Expected Output:
id time time1 numb numb_diff
0 1 2020-01-01 12:00:15 2020-01-01 12:00:00 1 NaN
1 1 2020-01-01 12:00:30 2020-01-01 12:00:00 5 4.0
2 1 2020-01-01 12:00:45 2020-01-01 12:00:00 8 3.0
3 2 2020-01-03 08:00:00 2020-01-03 08:00:00 0 -8.0 #Changing time1 to the min of time the group(id = 2)
4 2 2020-01-03 08:00:15 2020-01-03 08:00:00 4 4.0
Lets create a boolean mask representing the condition where num_diff is less than zero then groupby this mask on id and transform using first, finally use boolean indexing with this mask to substitute values in time1:
m = df['numb_diff'].lt(0).groupby(df['id']).transform('first')
df.loc[m, 'time1'] = df.groupby('id')['time'].transform('min')
id time time1 numb numb_diff
0 1 2020-01-01 12:00:15 2020-01-01 12:00:00 1 NaN
1 1 2020-01-01 12:00:30 2020-01-01 12:00:00 5 4.0
2 1 2020-01-01 12:00:45 2020-01-01 12:00:00 8 3.0
3 2 2020-01-03 08:00:00 2020-01-03 08:00:00 0 -8.0
4 2 2020-01-03 08:00:15 2020-01-03 08:00:00 4 4.0

Categories

Resources