Problems with replacing NaT in pandas correctly

Problems with replacing NaT in pandas correctly - python

I have a dataframe that contains some NaT values.
Date Value
6312957 2012-01-01 23:58:00 -49
6312958 2012-01-01 23:59:00 -49
6312959 NaT -48
6312960 2012-01-02 00:01:00 -47
6312961 2012-01-02 00:02:00 -46
I try to replace these NAT by adding a minute to the previous entry.
indices_of_NAT = np.flatnonzero(pd.isna(df.loc[:, "Date"]))
df.loc[indices_of_NAT, "Date"] = df.loc[indices_of_NAT - 1, "Date"] + pd.Timedelta(minutes=1)
This produces the correct timestamps and indices, which I checked manually. The only problem is that they don't replace the NaT values for whatever reason. I wonder if something goes wrong with the indexing in my last line of code. Is there something obvious I am missing?

You can fillna with the shifted values + 1 min:
df['Date'] = df['Date'].fillna(df['Date'].shift().add(pd.Timedelta('1min')))
Another method is to interpolate. For this you need to temporarily convert to a number. This way you can fill more than one gap and the increment will be calculated automatically, and there are many nice interpolation methods (see doc):
df['Date'] = (pd.to_datetime(pd.to_numeric(df['Date'])
.mask(df['Date'].isna())
.interpolate('linear'))
)
Example:
Date Value shift interpolate
0 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
1 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
2 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
3 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
4 NaT -48 2012-01-02 00:02:00 2012-01-02 00:01:20
5 NaT -48 NaT 2012-01-02 00:01:40
6 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00

Use Series.fillna with shifted values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
Or with forward filling missing values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
You can see difference with another data:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
df['Date2'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
print (df)
Date Value Date1 Date2
6312957 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
6312958 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
6312959 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
6312960 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
6312961 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
6312962 NaT -47 2012-01-02 00:03:00 2012-01-02 00:03:00
6312963 NaT -47 NaT 2012-01-02 00:03:00
6312967 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00

Related

Replace duplicated time index and fullfilling by time interpolation

I have a dataframe with a wrong time stamp
The time index is wrong, instead of being sampled in periods of 1 min contains duplicated indexes with multiples of 10minutes
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
2021-08-01 00:00:00
...
2021-08-01 00:10:00
2021-08-01 00:10:00
....
2021-08-01 00:20:00
2021-08-01 00:20:00
... and so on
The desired result after the postprocessing should be
2021-08-01 00:00:00
2021-08-01 00:01:00
2021-08-01 00:02:00
2021-08-01 00:03:00
...
2021-08-01 00:10:00
2021-08-01 00:11:00
...and so on
I have been trying with pandas.index functions to fullfill the duplicated indexes with nans and then interpolate to 1min but without success
Any hint?

Yo can add timedeltas by 1 minutes by counter by duplicated indices by GroupBy.cumcount with to_timedelta:
print (df)
b
a
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:00:00 1
2021-08-01 00:10:00 1
2021-08-01 00:10:00 1
2021-08-01 00:20:00 1
2021-08-01 00:20:00 1
df.index = pd.to_datetime(df.index)
df.index += pd.to_timedelta(df.groupby(level=0).cumcount(), 'Min')
print (df)
b
2021-08-01 00:00:00 1
2021-08-01 00:01:00 1
2021-08-01 00:02:00 1
2021-08-01 00:03:00 1
2021-08-01 00:10:00 1
2021-08-01 00:11:00 1
2021-08-01 00:20:00 1
2021-08-01 00:21:00 1

How to reset a date time column to be in increments of one minute in python

I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?

you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.

First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00

Round Datetime Object DOWN in Python

I'm trying to round a datetime object DOWN in Python, and am having a few problems. There is lots on here about rounding datetime but I can't find anything specific to my needs.
I'm trying to get a date range of 15 minute intervals, with .now() being the end point. To get my end= I do:
pd.Timestamp.now().round('15min')
which returns:
2019-08-16 11:15:00 which is exactly what I want, however, if I run this at 11:23 say, it will return me 2019-08-16 11:30:00, and that's not actually what I want, I want it to round down to 2019-08-16 11:15:00 up until the moment we strike 11:30.
Is there a simple way to get it to round down as I haven't had any luck finding the answer if so.
Cheers for any help

Use Timestamp.floor:
print (pd.Timestamp('2019-08-16 11:15:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:23:00').floor('15min'))
2019-08-16 11:15:00
print (pd.Timestamp('2019-08-16 11:30:00').floor('15min'))
2019-08-16 11:30:00
For testing:
df = pd.DataFrame({'dates':pd.date_range('2009-01-01', freq='T', periods=20)})
df['new'] = df['dates'].dt.floor('15min')
print (df)
0 2009-01-01 00:00:00 2009-01-01 00:00:00
1 2009-01-01 00:01:00 2009-01-01 00:00:00
2 2009-01-01 00:02:00 2009-01-01 00:00:00
3 2009-01-01 00:03:00 2009-01-01 00:00:00
4 2009-01-01 00:04:00 2009-01-01 00:00:00
5 2009-01-01 00:05:00 2009-01-01 00:00:00
6 2009-01-01 00:06:00 2009-01-01 00:00:00
7 2009-01-01 00:07:00 2009-01-01 00:00:00
8 2009-01-01 00:08:00 2009-01-01 00:00:00
9 2009-01-01 00:09:00 2009-01-01 00:00:00
10 2009-01-01 00:10:00 2009-01-01 00:00:00
11 2009-01-01 00:11:00 2009-01-01 00:00:00
12 2009-01-01 00:12:00 2009-01-01 00:00:00
13 2009-01-01 00:13:00 2009-01-01 00:00:00
14 2009-01-01 00:14:00 2009-01-01 00:00:00
15 2009-01-01 00:15:00 2009-01-01 00:15:00
16 2009-01-01 00:16:00 2009-01-01 00:15:00
17 2009-01-01 00:17:00 2009-01-01 00:15:00
18 2009-01-01 00:18:00 2009-01-01 00:15:00
19 2009-01-01 00:19:00 2009-01-01 00:15:00

Calculate difference between 'times' rows in DataFrame Pandas

My DataFrame is in the Form:
TimeWeek TimeSat TimeHoli
0 6:40:00 8:00:00 8:00:00
1 6:45:00 8:05:00 8:05:00
2 6:50:00 8:09:00 8:10:00
3 6:55:00 8:11:00 8:14:00
4 6:58:00 8:13:00 8:17:00
5 7:40:00 8:15:00 8:21:00
I need to find the time difference between each row in TimeWeek , TimeSat and TimeHoli, the output must be
TimeWeekDiff TimeSatDiff TimeHoliDiff
00:05:00 00:05:00 00:05:00
00:05:00 00:04:00 00:05:00
00:05:00 00:02:00 00:04:00
00:03:00 00:02:00 00:03:00
00:02:00 00:02:00 00:04:00
I tried using (d['TimeWeek']-df['TimeWeek'].shift().fillna(0) , it throws an error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Probably because of the presence of ':' in the column. How do I resolve this?

It looks like the error is thrown because the data is in the form of a string instead of a timestamp. First convert them to timestamps:
df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
They will contain today's date by default, but this shouldn't matter once you difference the time (hopefully you don't have to worry about differencing 23:55 and 00:05 across dates).
Once converted, simply difference the DataFrame:
>>> df2 - df2.shift()
TimeWeek TimeSat TimeHoli
0 NaT NaT NaT
1 00:05:00 00:05:00 00:05:00
2 00:05:00 00:04:00 00:05:00
3 00:05:00 00:02:00 00:04:00
4 00:03:00 00:02:00 00:03:00
5 00:42:00 00:02:00 00:04:00
Depending on your needs, you can just take rows 1+ (ignoring the NaTs):
(df2 - df2.shift()).iloc[1:, :]
or you can fill the NaTs with zeros:
(df2 - df2.shift()).fillna(0)

Forget everything I just said. Pandas has great timedelta parsing.
df["TimeWeek"] = pd.to_timedelta(df["TimeWeek"])
(d['TimeWeek']-df['TimeWeek'].shift().fillna(pd.to_timedelta("00:00:00"))

>>> import pandas as pd
>>> df = pd.DataFrame({'TimeWeek': ['6:40:00', '6:45:00', '6:50:00', '6:55:00', '7:40:00']})
>>> df["TimeWeek_date"] = pd.to_datetime(df["TimeWeek"], format="%H:%M:%S")
>>> print df
TimeWeek TimeWeek_date
0 6:40:00 1900-01-01 06:40:00
1 6:45:00 1900-01-01 06:45:00
2 6:50:00 1900-01-01 06:50:00
3 6:55:00 1900-01-01 06:55:00
4 7:40:00 1900-01-01 07:40:00
>>> df['TimeWeekDiff'] = (df['TimeWeek_date'] - df['TimeWeek_date'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
>>> print df
TimeWeek TimeWeek_date TimeWeekDiff
0 6:40:00 1900-01-01 06:40:00 06:40:00
1 6:45:00 1900-01-01 06:45:00 00:05:00
2 6:50:00 1900-01-01 06:50:00 00:05:00
3 6:55:00 1900-01-01 06:55:00 00:05:00
4 7:40:00 1900-01-01 07:40:00 00:45:00

averaging every five minutes data as one datapoint in pandas dataframe

I have a Dataframe in Pandas like this
1. 2013-10-09 09:00:05
2. 2013-10-09 09:01:00
3. 2013-10-09 09:02:00
4. ............
5. ............
6. ............
7. 2013-10-10 09:15:05
8. 2013-10-10 09:16:00
9. 2013-10-10 09:17:00
I would like reduce the size of the Dataframe by averaging every 5 mins data and forming 1 datapoint for it ..like this
1. 2013-10-09 09:05:00
2. 2013-10-09 09:10:00
3. 2013-10-09 09:15:00
Can someone help me with this ??

you may want to look at pandas.resample:
df['Data'].resample('5Min', how='mean')
or, as how = 'mean' is default parameter:
df['Data'].resample('5Min')
For example:
>>> rng = pd.date_range('1/1/2012', periods=10, freq='Min')
>>> df = pd.DataFrame({'Data':np.random.randint(0, 500, len(rng))}, index=rng)
>>> df
Data
2012-01-01 00:00:00 488
2012-01-01 00:01:00 172
2012-01-01 00:02:00 276
2012-01-01 00:03:00 5
2012-01-01 00:04:00 233
2012-01-01 00:05:00 266
2012-01-01 00:06:00 103
2012-01-01 00:07:00 40
2012-01-01 00:08:00 274
2012-01-01 00:09:00 494
>>>
>>> df['Data'].resample('5Min')
2012-01-01 00:00:00 234.8
2012-01-01 00:05:00 235.4
You can find more examples here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with replacing NaT in pandas correctly - python

Related

Replace duplicated time index and fullfilling by time interpolation

How to reset a date time column to be in increments of one minute in python

Round Datetime Object DOWN in Python

Calculate difference between 'times' rows in DataFrame Pandas

averaging every five minutes data as one datapoint in pandas dataframe

Categories

Resources