I want to resample a dataframe and ffill values, see code below. However, the last value isn't forward filled.
df=pd.DataFrame(index=pd.date_range(start='1/1/2022',periods=5,freq='h'),data=range(0,5))
print(df.resample('15T').ffill())
results in:
0
2022-01-01 00:00:00 0
2022-01-01 00:15:00 0
2022-01-01 00:30:00 0
2022-01-01 00:45:00 0
2022-01-01 01:00:00 1
2022-01-01 01:15:00 1
2022-01-01 01:30:00 1
2022-01-01 01:45:00 1
2022-01-01 02:00:00 2
2022-01-01 02:15:00 2
2022-01-01 02:30:00 2
2022-01-01 02:45:00 2
2022-01-01 03:00:00 3
2022-01-01 03:15:00 3
2022-01-01 03:30:00 3
2022-01-01 03:45:00 3
2022-01-01 04:00:00 4
I would like the last entry to also occur 3 more times. Currently I handle this by adding an extra entry manually, resample and then drop the last value, but that seems cumbersome. I hope there is a more elegant way.
As #mozway mentioned, this is just the way resampling works in pandas. Alternatively, you can manually do the upsampling with a join.
df = pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5,freq='1h'),
data=range(0,5)
)
pd.DataFrame(
index=pd.date_range(start='1/1/2022',periods=5*4,freq='15min')
).join(df).ffill()
Related
I have two pandas dateframes in python, df_main and df_aux.
df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:
Date
Description
2022-01-01 13:45:23
A
2022-01-01 14:22:00
C
2022-01-01 16:15:33
D
2022-01-01 16:21:22
E
2022-01-02 13:21:56
B
2022-01-02 14:45:41
B
2022-01-02 15:11:34
C
df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:
Initial_Date
Final_Date
A
B
C
D
2022-01-01 12:00:00
2022-01-01 12:59:59
2
0
1
2
2022-01-01 13:00:00
2022-01-01 13:59:59
3
2
4
5
2022-01-01 14:00:00
2022-01-01 14:59:59
2
2
7
0
2022-01-01 15:00:00
2022-01-01 15:59:59
5
2
2
0
2022-01-02 12:00:00
2022-01-02 12:59:59
1
1
0
3
2022-01-02 13:00:00
2022-01-02 13:59:59
5
5
0
3
2022-01-02 14:00:00
2022-01-02 14:59:59
2
3
2
1
2022-01-02 15:00:00
2022-01-02 15:59:59
3
4
1
0
So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.
Following the same logic, the full df_main would be,
Date
Description
People_prev_hour
2022-01-01 13:45:23
A
2
2022-01-01 14:22:00
C
4
2022-01-01 16:15:33
D
0
2022-01-01 16:21:22
E
NaN
2022-01-02 13:21:56
B
1
2022-01-02 14:45:41
B
5
2022-01-02 15:11:34
F
NaN
Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.
I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!
First idea was use merge_asof, because hours not overlaping intervals:
df1 = pd.merge_asof(df_main,
df_aux.assign(Initial_Date = df_aux['Initial_Date'] + pd.Timedelta(1, 'hour')),
left_on='Date',
right_on='Initial_Date')
Then use indexing lookup:
idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
Another idea with IntervalIndex:
s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date + pd.Timedelta(1, 'hour'),
df_aux.Final_Date + pd.Timedelta(1, 'hour'), 'both')
df1 = df_aux.set_index(s).loc[df_main.Date]
print (df1)
Initial_Date \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00
Final_Date A B C D
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59 2 0 1 2
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59 3 2 4 5
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59 1 1 0 3
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59 5 5 0 3
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59 2 3 2 1
idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
This question already has answers here:
How to compare two date columns in a dataframe using pandas?
(2 answers)
Closed 5 months ago.
I have a dataset of New York taxi rides. There are some wrong values in the pickup_datetime and dropoff_datetime, because the dropoff is before than the pickup. How can i compare this two values and drop the row?
You can do with:
import pandas as pd
import numpy as np
df=pd.read_excel("Taxi.xlsx")
#convert to datetime
df["Pickup"]=pd.to_datetime(df["Pickup"])
df["Drop"]=pd.to_datetime(df["Drop"])
#Identify line to remove
df["ToRemove"]=np.where(df["Pickup"]>df["Drop"],1,0)
print(df)
#filter the dataframe if pickup time is higher than drop time
dfClean=df[df["ToRemove"]==0]
print(dfClean)
Result:
df:
Pickup Drop Price
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3
3 2022-01-01 11:00:00 2022-01-01 10:05:00 10
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5
5 2022-01-01 11:40:00 2022-01-01 08:45:00 8
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3
Df Clean
Pickup Drop Price ToRemove
0 2022-01-01 10:00:00 2022-01-01 10:05:00 5 0
1 2022-01-01 10:20:00 2022-01-01 10:25:00 8 0
2 2022-01-01 10:40:00 2022-01-01 10:45:00 3 0
4 2022-01-01 11:20:00 2022-01-01 11:25:00 5 0
6 2022-01-01 12:00:00 2022-01-01 12:05:00 3 0
7 2022-01-01 12:20:00 2022-01-01 12:25:00 10 0
8 2022-01-01 12:40:00 2022-01-01 12:45:00 5 0
9 2022-01-01 13:00:00 2022-01-01 13:05:00 8 0
10 2022-01-01 13:20:00 2022-01-01 13:25:00 3 0
May I know any better way to get the system downtime period.
The system will detect a value and transmit its reading for every 20 minutes.
This code is used to identify the similar data of 0.00 that measured no data.
However, I would like to find out how long its had lost data in this consecutive timestamp.
Like an example, 2.20am, 2,40am, 3.00am having the same value, and thus the total lost time is 1hour. Start time, end time in the dataframe. Please help.
df_same_values = df.groupby((df['Water Quality Sensor'].shift() != df['Water Quality Sensor']).\
cumsum()).filter(lambda x: len(x) >= 3)
Water Quality Sensor
date
2022-01-01 02:20:00 2.500000
2022-01-01 02:40:00 2.500000
2022-01-01 03:00:00 2.500000
2022-01-01 09:00:00 0.000000
2022-01-01 09:20:00 0.000000
2022-01-01 09:40:00 0.000000
2022-01-01 10:00:00 0.000000
2022-01-01 10:20:00 0.000000
2022-01-01 10:40:00 0.000000
2022-01-01 12:40:00 0.000000
2022-01-01 13:00:00 0.000000
2022-01-01 13:20:00 0.000000
2022-01-01 18:00:00 2.500000
2022-01-01 18:20:00 2.500000
2022-01-01 18:40:00 2.500000
2022-01-01 19:00:00 2.500000
2022-01-02 03:00:00 2.500000
2022-01-02 03:20:00 2.500000
2022-01-02 03:40:00 2.500000
2022-01-02 04:00:00 2.500000
2022-01-02 04:20:00 2.500000
2022-01-02 04:40:00 2.500000
2022-01-02 05:00:00 2.500000
2022-01-02 05:20:00 2.500000
2022-01-02 18:00:00 2.500000
2022-01-02 18:20:00 2.500000
2022-01-02 18:40:00 2.500000
2022-01-02 19:00:00 2.500000
2022-01-02 19:20:00 2.500000
If your data are truly this small, why not just iterate over the rows while recording appropriate values?
down_flag = False
down_start = None
down_stop = None
for i, row in df.iterrows():
if row['Water Quality Sensor'] == 0 and down_flag == False:
down_flag = True
down_start = row['date']
if row['Water Quality Sensor'] > 0 and down_flag == True:
down_flag = False
down_stop = df.loc[i-1, 'date']
down_time = down_stop - down_start
print(down_time)
gives
0 days 04:20:00
as expected. Note this only records the last portion of downtime if there are multiple outages.
I have a dataframe with a datetimeindex as shown with the format, the raw data is supposed to contains record every hourly for a year(each day having 24 record). Some hours/days are missing and not recorded in the data.
How can i get a list of all the missing datetimeindex hour.
Example: 01 hour is missing, how can i find and print out 2012-10-02 01:00:00
I'm currently able to get the missing days but unable to do so for the hour.
missing_day = pd.date_range(start = mdf.index[0], end = mdf.index[-1]).difference(mdf.index)
missing_day = missing_day.strftime('%Y%m%d')
missing = pd.Series(missing_day).array
for i in missing:
print(i)
for x in range(24):
x = str(x)
m = i + x
m = datetime.strptime(m,'%Y%m%d%H')
print(m)
Output(printing 24 hour for each missing days)
What would be the best way to list out all of the missing datetime.
Use set predicates to find missing index:
out = pd.date_range(df.index.min(), df.index.max(), freq='H').difference(df.index)
print(out)
# Output
DatetimeIndex(['2022-01-01 06:00:00', '2022-01-01 12:00:00',
'2022-01-01 14:00:00', '2022-01-01 16:00:00'],
dtype='datetime64[ns]', freq=None)
Setup:
df = pd.DataFrame({'A':[0]}, index=pd.date_range('2022-01-01', freq='H', periods=24))
df = df.sample(n=20).sort_index()
print(df)
# Output
A
2022-01-01 00:00:00 0
2022-01-01 01:00:00 0
2022-01-01 02:00:00 0
2022-01-01 03:00:00 0
2022-01-01 04:00:00 0
2022-01-01 05:00:00 0
2022-01-01 07:00:00 0
2022-01-01 08:00:00 0
2022-01-01 09:00:00 0
2022-01-01 10:00:00 0
2022-01-01 11:00:00 0
2022-01-01 13:00:00 0
2022-01-01 15:00:00 0
2022-01-01 17:00:00 0
2022-01-01 18:00:00 0
2022-01-01 19:00:00 0
2022-01-01 20:00:00 0
2022-01-01 21:00:00 0
2022-01-01 22:00:00 0
2022-01-01 23:00:00 0
I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).