This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0
Related
I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3
You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3
Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
This is a sample of my df, consisting of temperature and rain (mm) per city:
Datetime
Berlin_temperature
Dublin_temperature
London_temperature
Paris_temperature
Berlin_rain
Dublin_rain
London_rain
Paris_rain
2022-01-01 10:00:00
24
24
24
24
10
10
10
10
2022-01-01 11:00:00
24
24
24
24
10
10
10
10
2022-01-01 12:00:00
24
24
24
24
10
10
10
10
2022-01-01 13:00:00
24
24
24
24
10
10
10
10
I want to achieve following output as a dataframe:
Datetime
City
Temperature
Rainfall
2022-01-01 10:00:00
Berlin
24
10
2022-01-01 10:00:00
Dublin
24
10
2022-01-01 10:00:00
London
24
10
2022-01-01 10:00:00
Paris
24
10
2022-01-01 11:00:00
Berlin
24
10
2022-01-01 11:00:00
Dublin
24
10
2022-01-01 11:00:00
London
24
10
2022-01-01 11:00:00
Paris
24
10
2022-01-01 12:00:00
...
...
...
At the moment I don't know how to achieve this by transposing or something similar. How would this be possible?
Use DataFrame.stack with MultiIndex created by splitted columns with _ - but first convert Datetime to index by DataFrame.set_index:
df1 = df.set_index('Datetime')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(0).rename_axis(['Datetime','City']).reset_index()
print (df1)
Datetime City rain temperature
0 2022-01-01 10:00:00 Berlin 10 24
1 2022-01-01 10:00:00 Dublin 10 24
2 2022-01-01 10:00:00 London 10 24
3 2022-01-01 10:00:00 Paris 10 24
4 2022-01-01 11:00:00 Berlin 10 24
5 2022-01-01 11:00:00 Dublin 10 24
6 2022-01-01 11:00:00 London 10 24
7 2022-01-01 11:00:00 Paris 10 24
8 2022-01-01 12:00:00 Berlin 10 24
9 2022-01-01 12:00:00 Dublin 10 24
10 2022-01-01 12:00:00 London 10 24
11 2022-01-01 12:00:00 Paris 10 24
12 2022-01-01 13:00:00 Berlin 10 24
13 2022-01-01 13:00:00 Dublin 10 24
14 2022-01-01 13:00:00 London 10 24
15 2022-01-01 13:00:00 Paris 10 24
Using janitor.pivot_longer:
import janitor
out = df.pivot_longer(index='Datetime', names_to=('City', '.value'),
names_sep='_', sort_by_appearance=True)
Output:
Datetime City temperature rain
0 2022-01-01 10:00:00 Berlin 24 10
1 2022-01-01 10:00:00 Dublin 24 10
2 2022-01-01 10:00:00 London 24 10
3 2022-01-01 10:00:00 Paris 24 10
4 2022-01-01 11:00:00 Berlin 24 10
5 2022-01-01 11:00:00 Dublin 24 10
6 2022-01-01 11:00:00 London 24 10
7 2022-01-01 11:00:00 Paris 24 10
8 2022-01-01 12:00:00 Berlin 24 10
9 2022-01-01 12:00:00 Dublin 24 10
10 2022-01-01 12:00:00 London 24 10
11 2022-01-01 12:00:00 Paris 24 10
12 2022-01-01 13:00:00 Berlin 24 10
13 2022-01-01 13:00:00 Dublin 24 10
14 2022-01-01 13:00:00 London 24 10
15 2022-01-01 13:00:00 Paris 24 10
Let's suppose we have a pandas dataframe with work shifts:
df_aux = pd.DataFrame({'Worker' : ['Alice','Alice','Alice','Alice','Alice', 'Bob','Bob','Bob'],
'Shift_start' : ['2022-01-01 10:00:00', '2022-01-01 10:30:00', '2022-01-01 11:45:00', '2022-01-01 12:45:00', '2022-01-01 13:15:00', '2022-01-01 10:30:00', '2022-01-01 12:00:00', '2022-01-01 13:15:00'],
'Shift_end' : ['2022-01-01 10:15:00', '2022-01-01 11:45:00', '2022-01-01 12:30:00', '2022-01-01 13:15:00', '2022-01-01 14:00:00', '2022-01-01 11:30:00', '2022-01-01 13:10:00', '2022-01-01 14:30:00'],
'Position' : [1, 1, 2, 2, 2, 1, 2, 3],
'Role' : ['A', 'B', 'B', 'A', 'B', 'A', 'B', 'A']})
Worker
Shift_start
Shift_end
Position
Role
Alice
2022-01-01 10:00:00
2022-01-01 10:15:00
1
A
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
Alice
2022-01-01 11:45:00
2022-01-01 12:30:00
2
B
Alice
2022-01-01 12:45:00
2022-01-01 13:15:00
2
A
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
Bob
2022-01-01 10:30:00
2022-01-01 11:30:00
1
A
Bob
2022-01-01 12:00:00
2022-01-01 13:10:00
2
B
Bob
2022-01-01 13:15:00
2022-01-01 14:30:00
3
A
The Position column refers to the place where the workers are, while there are two roles, A and B (let's say there are main and auxiliar, for example). I would need to compute the time each worker is at the current position, regardless of their role, and the time they are in the same position AND role at the time of certain events. These events are given in a df_main, which records the time and position:
df_main = pd.DataFrame({'Event_time' : ['2022-01-01 11:05:00', '2022-01-01 12:35:00', '2022-01-01 13:25:00'] ,
'Position' : [1, 2, 2]})
Event_time
Position
2022-01-01 11:05:00
1
2022-01-01 12:35:00
2
2022-01-01 13:25:00
2
The idea would be to perform a merge between df_main and df_aux to have the following info:
Event_time
Worker
Shift_start
Shift_end
Position
Role
Time_in_position
Time_in_position_role
2022-01-01 11:05:00
Alice
2022-01-01 10:30:00
2022-01-01 11:45:00
1
B
1 hours 05 minutes
0 hours 35 minutes
2022-01-01 11:05:00
Bob
2022-01-01 10:30:00
2022-01-01 13:30:00
1
A
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 12:35:00
Bob
2022-01-01 12:00:00
2022-01-01 15:10:00
2
B
0 hours 35 minutes
0 hours 35 minutes
2022-01-01 13:25:00
Alice
2022-01-01 13:15:00
2022-01-01 14:00:00
2
B
1 hours 40 minutes
0 hours 10 minutes
The first row is duplicated, because both Alice and Bob were in that position at the time of the event, but with different roles. I managed to compute the Time_in_position_role column:
df_full = df_main.merge(df_aux, on='Position')
df_full = df_full[(df_full['Event_time']>df_full['Shift_start']) & (df_full['Event_time']<df_full['Shift_end'])]
df_full['Time_in_position_role'] = df_full['Event_time'] - df_full['Shift_start']
But I am unable to do the same for the Time_in_position one. Any ideas?
The logic is:
For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
Join this with your result df and filter with same logic for "Time_in_position".
# For each "Worker", find the time period for which he was in particular position. If there are multiple rows, then merge them.
def sort_n_rank(g):
df_g = g.apply(pd.Series)
df_g = df_g.sort_values(0)
return (df_g[1] != df_g[1].shift(1)).cumsum()
df_aux["start_position"] = df_aux[["Shift_start", "Position"]].apply(tuple, axis=1)
df_aux["rank"] = df_aux.groupby("Worker")[["start_position"]].transform(sort_n_rank)
df_worker_position = df_aux.groupby(["Worker", "rank"]) \
.agg( \
Shift_start_min = ("Shift_start", "min"),
Shift_end_max = ("Shift_end", "max"),
Position = ("Position", "first")
) \
.reset_index()
df_full = df_full.merge(df_worker_position, on=["Worker", "Position"])
df_full = df_full[(df_full["Event_time"] > df_full["Shift_start_min"]) & (df_full["Event_time"] < df_full["Shift_end_max"])]
df_full["Time_in_position"] = df_full["Event_time"] - df_full["Shift_start_min"]
Output:
Event_time Worker Shift_start Shift_end Position Role Time_in_position Time_in_position_role
0 2022-01-01 11:05:00 Alice 2022-01-01 10:30:00 2022-01-01 11:45:00 1 B 0 days 01:05:00 0 days 00:35:00
1 2022-01-01 11:05:00 Bob 2022-01-01 10:30:00 2022-01-01 11:30:00 1 A 0 days 00:35:00 0 days 00:35:00
2 2022-01-01 12:35:00 Bob 2022-01-01 12:00:00 2022-01-01 13:10:00 2 B 0 days 00:35:00 0 days 00:35:00
3 2022-01-01 13:25:00 Alice 2022-01-01 13:15:00 2022-01-01 14:00:00 2 B 0 days 01:40:00 0 days 00:10:00
I have two pandas dateframes in python, df_main and df_aux.
df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:
Date
Description
2022-01-01 13:45:23
A
2022-01-01 14:22:00
C
2022-01-01 16:15:33
D
2022-01-01 16:21:22
E
2022-01-02 13:21:56
B
2022-01-02 14:45:41
B
2022-01-02 15:11:34
C
df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:
Initial_Date
Final_Date
A
B
C
D
2022-01-01 12:00:00
2022-01-01 12:59:59
2
0
1
2
2022-01-01 13:00:00
2022-01-01 13:59:59
3
2
4
5
2022-01-01 14:00:00
2022-01-01 14:59:59
2
2
7
0
2022-01-01 15:00:00
2022-01-01 15:59:59
5
2
2
0
2022-01-02 12:00:00
2022-01-02 12:59:59
1
1
0
3
2022-01-02 13:00:00
2022-01-02 13:59:59
5
5
0
3
2022-01-02 14:00:00
2022-01-02 14:59:59
2
3
2
1
2022-01-02 15:00:00
2022-01-02 15:59:59
3
4
1
0
So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.
Following the same logic, the full df_main would be,
Date
Description
People_prev_hour
2022-01-01 13:45:23
A
2
2022-01-01 14:22:00
C
4
2022-01-01 16:15:33
D
0
2022-01-01 16:21:22
E
NaN
2022-01-02 13:21:56
B
1
2022-01-02 14:45:41
B
5
2022-01-02 15:11:34
F
NaN
Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.
I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!
First idea was use merge_asof, because hours not overlaping intervals:
df1 = pd.merge_asof(df_main,
df_aux.assign(Initial_Date = df_aux['Initial_Date'] + pd.Timedelta(1, 'hour')),
left_on='Date',
right_on='Initial_Date')
Then use indexing lookup:
idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
Another idea with IntervalIndex:
s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date + pd.Timedelta(1, 'hour'),
df_aux.Final_Date + pd.Timedelta(1, 'hour'), 'both')
df1 = df_aux.set_index(s).loc[df_main.Date]
print (df1)
Initial_Date \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00
Final_Date A B C D
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59 2 0 1 2
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59 3 2 4 5
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59 1 1 0 3
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59 5 5 0 3
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59 2 3 2 1
idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
I want to resample a dataframe and ffill values, see code below. However, the last value isn't forward filled.
df=pd.DataFrame(index=pd.date_range(start='1/1/2022',periods=5,freq='h'),data=range(0,5))
print(df.resample('15T').ffill())
results in:
0
2022-01-01 00:00:00 0
2022-01-01 00:15:00 0
2022-01-01 00:30:00 0
2022-01-01 00:45:00 0
2022-01-01 01:00:00 1
2022-01-01 01:15:00 1
2022-01-01 01:30:00 1
2022-01-01 01:45:00 1
2022-01-01 02:00:00 2
2022-01-01 02:15:00 2
2022-01-01 02:30:00 2
2022-01-01 02:45:00 2
2022-01-01 03:00:00 3
2022-01-01 03:15:00 3
2022-01-01 03:30:00 3
2022-01-01 03:45:00 3
2022-01-01 04:00:00 4
I would like the last entry to also occur 3 more times. Currently I handle this by adding an extra entry manually, resample and then drop the last value, but that seems cumbersome. I hope there is a more elegant way.
As #mozway mentioned, this is just the way resampling works in pandas. Alternatively, you can manually do the upsampling with a join.
df = pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5,freq='1h'),
data=range(0,5)
)
pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5*4,freq='15min')
).join(df).ffill()