Replace duplicated time index and fullfilling by time interpolation - python

I have a dataframe with a wrong time stamp
The time index is wrong, instead of being sampled in periods of 1 min contains duplicated indexes with multiples of 10minutes
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
...
2021-08-01 00:10:00
2021-08-01 00:10:00
....
2021-08-01 00:20:00
2021-08-01 00:20:00
... and so on
The desired result after the postprocessing should be
2021-08-01 00:00:00
2021-08-01 00:01:00
2021-08-01 00:02:00
2021-08-01 00:03:00
...
2021-08-01 00:10:00
2021-08-01 00:11:00
...and so on
I have been trying with pandas.index functions to fullfill the duplicated indexes with nans and then interpolate to 1min but without success
Any hint?

Yo can add timedeltas by 1 minutes by counter by duplicated indices by GroupBy.cumcount with to_timedelta:
print (df)
b
a
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:10:00 1
2021-08-01 00:10:00 1
2021-08-01 00:20:00 1
2021-08-01 00:20:00 1
df.index = pd.to_datetime(df.index)
df.index += pd.to_timedelta(df.groupby(level=0).cumcount(), 'Min')
print (df)
b
2021-08-01 00:00:00 1
2021-08-01 00:01:00 1
2021-08-01 00:02:00 1
2021-08-01 00:03:00 1
2021-08-01 00:10:00 1
2021-08-01 00:11:00 1
2021-08-01 00:20:00 1
2021-08-01 00:21:00 1

Related

Problems with replacing NaT in pandas correctly

I have a dataframe that contains some NaT values.
Date Value
6312957 2012-01-01 23:58:00 -49
6312958 2012-01-01 23:59:00 -49
6312959 NaT -48
6312960 2012-01-02 00:01:00 -47
6312961 2012-01-02 00:02:00 -46
I try to replace these NAT by adding a minute to the previous entry.
indices_of_NAT = np.flatnonzero(pd.isna(df.loc[:, "Date"]))
df.loc[indices_of_NAT, "Date"] = df.loc[indices_of_NAT - 1, "Date"] + pd.Timedelta(minutes=1)
This produces the correct timestamps and indices, which I checked manually. The only problem is that they don't replace the NaT values for whatever reason. I wonder if something goes wrong with the indexing in my last line of code. Is there something obvious I am missing?
You can fillna with the shifted values + 1 min:
df['Date'] = df['Date'].fillna(df['Date'].shift().add(pd.Timedelta('1min')))
Another method is to interpolate. For this you need to temporarily convert to a number. This way you can fill more than one gap and the increment will be calculated automatically, and there are many nice interpolation methods (see doc):
df['Date'] = (pd.to_datetime(pd.to_numeric(df['Date'])
.mask(df['Date'].isna())
.interpolate('linear'))
)
Example:
Date Value shift interpolate
0 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
1 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
2 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
3 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
4 NaT -48 2012-01-02 00:02:00 2012-01-02 00:01:20
5 NaT -48 NaT 2012-01-02 00:01:40
6 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
Use Series.fillna with shifted values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
Or with forward filling missing values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
You can see difference with another data:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
df['Date2'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
print (df)
Date Value Date1 Date2
6312957 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
6312958 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
6312959 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
6312960 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
6312961 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
6312962 NaT -47 2012-01-02 00:03:00 2012-01-02 00:03:00
6312963 NaT -47 NaT 2012-01-02 00:03:00
6312967 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00

Assign first element of groupby to a column yields NaN

Why does this not work out?
I get the right results if I just print it out, but if I use the same to assign it to the df column, I get Nan values...
print(df.groupby('cumsum').first()['Date'])
cumsum
1 2021-01-05 11:00:00
2 2021-01-06 08:00:00
3 2021-01-06 10:00:00
4 2021-01-06 13:00:00
5 2021-01-06 14:00:00
...
557 2021-08-08 08:00:00
558 2021-08-08 09:00:00
559 2021-08-08 11:00:00
560 2021-08-08 13:00:00
561 2021-08-08 18:00:00
Name: Date, Length: 561, dtype: datetime64[ns]
vs
df["Date_First"] = df.groupby('cumsum').first()['Date']
Date
2021-01-01 00:00:00 NaT
2021-01-01 01:00:00 NaT
2021-01-01 02:00:00 NaT
2021-01-01 03:00:00 NaT
2021-01-01 04:00:00 NaT
..
2021-08-08 14:00:00 NaT
2021-08-08 15:00:00 NaT
2021-08-08 16:00:00 NaT
2021-08-08 17:00:00 NaT
2021-08-08 18:00:00 NaT
Name: Date_Last, Length: 5268, dtype: datetime64[ns]
What happens here?
I used an exmpmle form here, but want to get the first elements.
https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
What happens here?
If use:
print(df.groupby('cumsum')['Date'].first())
#print(df.groupby('cumsum').first()['Date'])
output are aggregated values by column cumsum with aggregated function first.
So in index are unique values cumsum, so if assign to new column there is mismatch with original index and output are NaNs.
Solution is use GroupBy.transform, which repeat aggregated values to Series (column) with same size like original DataFrame, so index is same like original and assign working perfectly:
df["Date_First"] = df.groupby('cumsum')['Date'].transform("first")

How to fill zeroes of a day with the previous day values in Pandas

I have some days with complete zeroes and would like to replace them with the previous day values as shown here.
Input
count
2020-02-01 00:00:00 12
2020-02-01 00:01:00 3
2020-02-01 00:02:00 14
2020-02-01 00:03:00 0
2020-02-01 00:04:00 22
2020-02-02 00:00:00 0
2020-02-02 00:01:00 0
2020-02-02 00:02:00 0
2020-02-02 00:03:00 0
2020-02-02 00:04:00 0
2020-02-03 00:00:00 2
2020-02-03 00:01:00 4
2020-02-03 00:02:00 1
2020-02-03 00:03:00 0
2020-02-03 00:04:00 22
Output
count
2020-02-01 00:00:00 12
2020-02-01 00:01:00 3
2020-02-01 00:02:00 14
2020-02-01 00:03:00 0
2020-02-01 00:04:00 22
2020-02-02 00:00:00 12
2020-02-02 00:01:00 3
2020-02-02 00:02:00 14
2020-02-02 00:03:00 0
2020-02-02 00:04:00 22
2020-02-03 00:00:00 2
2020-02-03 00:01:00 4
2020-02-03 00:02:00 1
2020-02-03 00:03:00 0
2020-02-03 00:04:00 22
I was trying something like this but couldn't solve it.
df = df.fillna(0)
df = df.reset_index()
df['Date'] = df['index'].dt.date
df['Time'] = df['index'].dt.time
df.set_index(pd.to_datetime(df.Date + ' ' + df.Time), inplace=True)
for ind in df[df.count.eq(0)].index:
df.loc[ind, 'count'] = df.loc[ind - pd.Timedelta('1D'), 'count']
df.reset_index(drop=True, inplace=True)
you can use mask to replace the 0s with nan, then groupby the time in the DatetimeIndex and ffill, then fillna with 0 to complete the time where no value before.
df_ = (df.mask(df.eq(0))
.groupby(df.index.time)
.ffill() #add the parameter limit=1 if you want to fill only one day after
.fillna(0)
)
print (df_)
count
2020-02-01 00:00:00 12.0
2020-02-01 00:01:00 3.0
2020-02-01 00:02:00 14.0
2020-02-01 00:03:00 0.0
2020-02-01 00:04:00 22.0
2020-02-02 00:00:00 12.0
2020-02-02 00:01:00 3.0
2020-02-02 00:02:00 14.0
2020-02-02 00:03:00 0.0
2020-02-02 00:04:00 22.0
2020-02-03 00:00:00 2.0
2020-02-03 00:01:00 4.0
2020-02-03 00:02:00 1.0
2020-02-03 00:03:00 0.0
2020-02-03 00:04:00 22.0
If you want to fill with previous values ONLY if all values of the day are 0, then in mask above, change df.eq(0) by df['count'].eq(0).groupby(df.index.date).transform('all'). In this case it does not change the result.
If you want to fill with the average of the same time until the current time, then you can use expanding like:
(df.mask(df.eq(0))
.groupby(df.index.time)
.expanding().mean()
.fillna(0)
.reset_index(level=0, drop=True).sort_index()
)

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Calculate difference between 'times' rows in DataFrame Pandas

My DataFrame is in the Form:
TimeWeek TimeSat TimeHoli
0 6:40:00 8:00:00 8:00:00
1 6:45:00 8:05:00 8:05:00
2 6:50:00 8:09:00 8:10:00
3 6:55:00 8:11:00 8:14:00
4 6:58:00 8:13:00 8:17:00
5 7:40:00 8:15:00 8:21:00
I need to find the time difference between each row in TimeWeek , TimeSat and TimeHoli, the output must be
TimeWeekDiff TimeSatDiff TimeHoliDiff
00:05:00 00:05:00 00:05:00
00:05:00 00:04:00 00:05:00
00:05:00 00:02:00 00:04:00
00:03:00 00:02:00 00:03:00
00:02:00 00:02:00 00:04:00
I tried using (d['TimeWeek']-df['TimeWeek'].shift().fillna(0) , it throws an error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Probably because of the presence of ':' in the column. How do I resolve this?
It looks like the error is thrown because the data is in the form of a string instead of a timestamp. First convert them to timestamps:
df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
They will contain today's date by default, but this shouldn't matter once you difference the time (hopefully you don't have to worry about differencing 23:55 and 00:05 across dates).
Once converted, simply difference the DataFrame:
>>> df2 - df2.shift()
TimeWeek TimeSat TimeHoli
0 NaT NaT NaT
1 00:05:00 00:05:00 00:05:00
2 00:05:00 00:04:00 00:05:00
3 00:05:00 00:02:00 00:04:00
4 00:03:00 00:02:00 00:03:00
5 00:42:00 00:02:00 00:04:00
Depending on your needs, you can just take rows 1+ (ignoring the NaTs):
(df2 - df2.shift()).iloc[1:, :]
or you can fill the NaTs with zeros:
(df2 - df2.shift()).fillna(0)
Forget everything I just said. Pandas has great timedelta parsing.
df["TimeWeek"] = pd.to_timedelta(df["TimeWeek"])
(d['TimeWeek']-df['TimeWeek'].shift().fillna(pd.to_timedelta("00:00:00"))
>>> import pandas as pd
>>> df = pd.DataFrame({'TimeWeek': ['6:40:00', '6:45:00', '6:50:00', '6:55:00', '7:40:00']})
>>> df["TimeWeek_date"] = pd.to_datetime(df["TimeWeek"], format="%H:%M:%S")
>>> print df
TimeWeek TimeWeek_date
0 6:40:00 1900-01-01 06:40:00
1 6:45:00 1900-01-01 06:45:00
2 6:50:00 1900-01-01 06:50:00
3 6:55:00 1900-01-01 06:55:00
4 7:40:00 1900-01-01 07:40:00
>>> df['TimeWeekDiff'] = (df['TimeWeek_date'] - df['TimeWeek_date'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
>>> print df
TimeWeek TimeWeek_date TimeWeekDiff
0 6:40:00 1900-01-01 06:40:00 06:40:00
1 6:45:00 1900-01-01 06:45:00 00:05:00
2 6:50:00 1900-01-01 06:50:00 00:05:00
3 6:55:00 1900-01-01 06:55:00 00:05:00
4 7:40:00 1900-01-01 07:40:00 00:45:00

Categories

Resources