Problems with replacing NaT in pandas correctly - python

I have a dataframe that contains some NaT values.
Date Value
6312957 2012-01-01 23:58:00 -49
6312958 2012-01-01 23:59:00 -49
6312959 NaT -48
6312960 2012-01-02 00:01:00 -47
6312961 2012-01-02 00:02:00 -46
I try to replace these NAT by adding a minute to the previous entry.
indices_of_NAT = np.flatnonzero(pd.isna(df.loc[:, "Date"]))
df.loc[indices_of_NAT, "Date"] = df.loc[indices_of_NAT - 1, "Date"] + pd.Timedelta(minutes=1)
This produces the correct timestamps and indices, which I checked manually. The only problem is that they don't replace the NaT values for whatever reason. I wonder if something goes wrong with the indexing in my last line of code. Is there something obvious I am missing?

You can fillna with the shifted values + 1 min:
df['Date'] = df['Date'].fillna(df['Date'].shift().add(pd.Timedelta('1min')))
Another method is to interpolate. For this you need to temporarily convert to a number. This way you can fill more than one gap and the increment will be calculated automatically, and there are many nice interpolation methods (see doc):
df['Date'] = (pd.to_datetime(pd.to_numeric(df['Date'])
.mask(df['Date'].isna())
.interpolate('linear'))
)
Example:
Date Value shift interpolate
0 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
1 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
2 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
3 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
4 NaT -48 2012-01-02 00:02:00 2012-01-02 00:01:20
5 NaT -48 NaT 2012-01-02 00:01:40
6 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00

Use Series.fillna with shifted values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
Or with forward filling missing values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
You can see difference with another data:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
df['Date2'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
print (df)
Date Value Date1 Date2
6312957 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
6312958 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
6312959 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
6312960 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
6312961 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
6312962 NaT -47 2012-01-02 00:03:00 2012-01-02 00:03:00
6312963 NaT -47 NaT 2012-01-02 00:03:00
6312967 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00

Related

Replace duplicated time index and fullfilling by time interpolation

I have a dataframe with a wrong time stamp
The time index is wrong, instead of being sampled in periods of 1 min contains duplicated indexes with multiples of 10minutes
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
...
2021-08-01 00:10:00
2021-08-01 00:10:00
....
2021-08-01 00:20:00
2021-08-01 00:20:00
... and so on
The desired result after the postprocessing should be
2021-08-01 00:00:00
2021-08-01 00:01:00
2021-08-01 00:02:00
2021-08-01 00:03:00
...
2021-08-01 00:10:00
2021-08-01 00:11:00
...and so on
I have been trying with pandas.index functions to fullfill the duplicated indexes with nans and then interpolate to 1min but without success
Any hint?
Yo can add timedeltas by 1 minutes by counter by duplicated indices by GroupBy.cumcount with to_timedelta:
print (df)
b
a
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:10:00 1
2021-08-01 00:10:00 1
2021-08-01 00:20:00 1
2021-08-01 00:20:00 1
df.index = pd.to_datetime(df.index)
df.index += pd.to_timedelta(df.groupby(level=0).cumcount(), 'Min')
print (df)
b
2021-08-01 00:00:00 1
2021-08-01 00:01:00 1
2021-08-01 00:02:00 1
2021-08-01 00:03:00 1
2021-08-01 00:10:00 1
2021-08-01 00:11:00 1
2021-08-01 00:20:00 1
2021-08-01 00:21:00 1

How to reset a date time column to be in increments of one minute in python

I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00

Round Datetime Object DOWN in Python

I'm trying to round a datetime object DOWN in Python, and am having a few problems. There is lots on here about rounding datetime but I can't find anything specific to my needs.
I'm trying to get a date range of 15 minute intervals, with .now() being the end point. To get my end= I do:
pd.Timestamp.now().round('15min')
which returns:
2019-08-16 11:15:00 which is exactly what I want, however, if I run this at 11:23 say, it will return me 2019-08-16 11:30:00, and that's not actually what I want, I want it to round down to 2019-08-16 11:15:00 up until the moment we strike 11:30.
Is there a simple way to get it to round down as I haven't had any luck finding the answer if so.
Cheers for any help
Use Timestamp.floor:
print (pd.Timestamp('2019-08-16 11:15:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:23:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:30:00').floor('15min'))
2019-08-16 11:30:00
For testing:
df = pd.DataFrame({'dates':pd.date_range('2009-01-01', freq='T', periods=20)})
df['new'] = df['dates'].dt.floor('15min')
print (df)
0 2009-01-01 00:00:00 2009-01-01 00:00:00
1 2009-01-01 00:01:00 2009-01-01 00:00:00
2 2009-01-01 00:02:00 2009-01-01 00:00:00
3 2009-01-01 00:03:00 2009-01-01 00:00:00
4 2009-01-01 00:04:00 2009-01-01 00:00:00
5 2009-01-01 00:05:00 2009-01-01 00:00:00
6 2009-01-01 00:06:00 2009-01-01 00:00:00
7 2009-01-01 00:07:00 2009-01-01 00:00:00
8 2009-01-01 00:08:00 2009-01-01 00:00:00
9 2009-01-01 00:09:00 2009-01-01 00:00:00
10 2009-01-01 00:10:00 2009-01-01 00:00:00
11 2009-01-01 00:11:00 2009-01-01 00:00:00
12 2009-01-01 00:12:00 2009-01-01 00:00:00
13 2009-01-01 00:13:00 2009-01-01 00:00:00
14 2009-01-01 00:14:00 2009-01-01 00:00:00
15 2009-01-01 00:15:00 2009-01-01 00:15:00
16 2009-01-01 00:16:00 2009-01-01 00:15:00
17 2009-01-01 00:17:00 2009-01-01 00:15:00
18 2009-01-01 00:18:00 2009-01-01 00:15:00
19 2009-01-01 00:19:00 2009-01-01 00:15:00

Calculate difference between 'times' rows in DataFrame Pandas

My DataFrame is in the Form:
TimeWeek TimeSat TimeHoli
0 6:40:00 8:00:00 8:00:00
1 6:45:00 8:05:00 8:05:00
2 6:50:00 8:09:00 8:10:00
3 6:55:00 8:11:00 8:14:00
4 6:58:00 8:13:00 8:17:00
5 7:40:00 8:15:00 8:21:00
I need to find the time difference between each row in TimeWeek , TimeSat and TimeHoli, the output must be
TimeWeekDiff TimeSatDiff TimeHoliDiff
00:05:00 00:05:00 00:05:00
00:05:00 00:04:00 00:05:00
00:05:00 00:02:00 00:04:00
00:03:00 00:02:00 00:03:00
00:02:00 00:02:00 00:04:00
I tried using (d['TimeWeek']-df['TimeWeek'].shift().fillna(0) , it throws an error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Probably because of the presence of ':' in the column. How do I resolve this?
It looks like the error is thrown because the data is in the form of a string instead of a timestamp. First convert them to timestamps:
df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
They will contain today's date by default, but this shouldn't matter once you difference the time (hopefully you don't have to worry about differencing 23:55 and 00:05 across dates).
Once converted, simply difference the DataFrame:
>>> df2 - df2.shift()
TimeWeek TimeSat TimeHoli
0 NaT NaT NaT
1 00:05:00 00:05:00 00:05:00
2 00:05:00 00:04:00 00:05:00
3 00:05:00 00:02:00 00:04:00
4 00:03:00 00:02:00 00:03:00
5 00:42:00 00:02:00 00:04:00
Depending on your needs, you can just take rows 1+ (ignoring the NaTs):
(df2 - df2.shift()).iloc[1:, :]
or you can fill the NaTs with zeros:
(df2 - df2.shift()).fillna(0)
Forget everything I just said. Pandas has great timedelta parsing.
df["TimeWeek"] = pd.to_timedelta(df["TimeWeek"])
(d['TimeWeek']-df['TimeWeek'].shift().fillna(pd.to_timedelta("00:00:00"))
>>> import pandas as pd
>>> df = pd.DataFrame({'TimeWeek': ['6:40:00', '6:45:00', '6:50:00', '6:55:00', '7:40:00']})
>>> df["TimeWeek_date"] = pd.to_datetime(df["TimeWeek"], format="%H:%M:%S")
>>> print df
TimeWeek TimeWeek_date
0 6:40:00 1900-01-01 06:40:00
1 6:45:00 1900-01-01 06:45:00
2 6:50:00 1900-01-01 06:50:00
3 6:55:00 1900-01-01 06:55:00
4 7:40:00 1900-01-01 07:40:00
>>> df['TimeWeekDiff'] = (df['TimeWeek_date'] - df['TimeWeek_date'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
>>> print df
TimeWeek TimeWeek_date TimeWeekDiff
0 6:40:00 1900-01-01 06:40:00 06:40:00
1 6:45:00 1900-01-01 06:45:00 00:05:00
2 6:50:00 1900-01-01 06:50:00 00:05:00
3 6:55:00 1900-01-01 06:55:00 00:05:00
4 7:40:00 1900-01-01 07:40:00 00:45:00

averaging every five minutes data as one datapoint in pandas dataframe

I have a Dataframe in Pandas like this
1. 2013-10-09 09:00:05
2. 2013-10-09 09:01:00
3. 2013-10-09 09:02:00
4. ............
5. ............
6. ............
7. 2013-10-10 09:15:05
8. 2013-10-10 09:16:00
9. 2013-10-10 09:17:00
I would like reduce the size of the Dataframe by averaging every 5 mins data and forming 1 datapoint for it ..like this
1. 2013-10-09 09:05:00
2. 2013-10-09 09:10:00
3. 2013-10-09 09:15:00
Can someone help me with this ??
you may want to look at pandas.resample:
df['Data'].resample('5Min', how='mean')
or, as how = 'mean' is default parameter:
df['Data'].resample('5Min')
For example:
>>> rng = pd.date_range('1/1/2012', periods=10, freq='Min')
>>> df = pd.DataFrame({'Data':np.random.randint(0, 500, len(rng))}, index=rng)
>>> df
Data
2012-01-01 00:00:00 488
2012-01-01 00:01:00 172
2012-01-01 00:02:00 276
2012-01-01 00:03:00 5
2012-01-01 00:04:00 233
2012-01-01 00:05:00 266
2012-01-01 00:06:00 103
2012-01-01 00:07:00 40
2012-01-01 00:08:00 274
2012-01-01 00:09:00 494
>>>
>>> df['Data'].resample('5Min')
2012-01-01 00:00:00 234.8
2012-01-01 00:05:00 235.4
You can find more examples here.

Categories

Resources