Calculate difference between 'times' rows in DataFrame Pandas - python

My DataFrame is in the Form:
TimeWeek TimeSat TimeHoli
0 6:40:00 8:00:00 8:00:00
1 6:45:00 8:05:00 8:05:00
2 6:50:00 8:09:00 8:10:00
3 6:55:00 8:11:00 8:14:00
4 6:58:00 8:13:00 8:17:00
5 7:40:00 8:15:00 8:21:00
I need to find the time difference between each row in TimeWeek , TimeSat and TimeHoli, the output must be
TimeWeekDiff TimeSatDiff TimeHoliDiff
00:05:00 00:05:00 00:05:00
00:05:00 00:04:00 00:05:00
00:05:00 00:02:00 00:04:00
00:03:00 00:02:00 00:03:00
00:02:00 00:02:00 00:04:00
I tried using (d['TimeWeek']-df['TimeWeek'].shift().fillna(0) , it throws an error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
Probably because of the presence of ':' in the column. How do I resolve this?

It looks like the error is thrown because the data is in the form of a string instead of a timestamp. First convert them to timestamps:
df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
They will contain today's date by default, but this shouldn't matter once you difference the time (hopefully you don't have to worry about differencing 23:55 and 00:05 across dates).
Once converted, simply difference the DataFrame:
>>> df2 - df2.shift()
TimeWeek TimeSat TimeHoli
0 NaT NaT NaT
1 00:05:00 00:05:00 00:05:00
2 00:05:00 00:04:00 00:05:00
3 00:05:00 00:02:00 00:04:00
4 00:03:00 00:02:00 00:03:00
5 00:42:00 00:02:00 00:04:00
Depending on your needs, you can just take rows 1+ (ignoring the NaTs):
(df2 - df2.shift()).iloc[1:, :]
or you can fill the NaTs with zeros:
(df2 - df2.shift()).fillna(0)

Forget everything I just said. Pandas has great timedelta parsing.
df["TimeWeek"] = pd.to_timedelta(df["TimeWeek"])
(d['TimeWeek']-df['TimeWeek'].shift().fillna(pd.to_timedelta("00:00:00"))

>>> import pandas as pd
>>> df = pd.DataFrame({'TimeWeek': ['6:40:00', '6:45:00', '6:50:00', '6:55:00', '7:40:00']})
>>> df["TimeWeek_date"] = pd.to_datetime(df["TimeWeek"], format="%H:%M:%S")
>>> print df
TimeWeek TimeWeek_date
0 6:40:00 1900-01-01 06:40:00
1 6:45:00 1900-01-01 06:45:00
2 6:50:00 1900-01-01 06:50:00
3 6:55:00 1900-01-01 06:55:00
4 7:40:00 1900-01-01 07:40:00
>>> df['TimeWeekDiff'] = (df['TimeWeek_date'] - df['TimeWeek_date'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))
>>> print df
TimeWeek TimeWeek_date TimeWeekDiff
0 6:40:00 1900-01-01 06:40:00 06:40:00
1 6:45:00 1900-01-01 06:45:00 00:05:00
2 6:50:00 1900-01-01 06:50:00 00:05:00
3 6:55:00 1900-01-01 06:55:00 00:05:00
4 7:40:00 1900-01-01 07:40:00 00:45:00

Related

Problems with replacing NaT in pandas correctly

I have a dataframe that contains some NaT values.
Date Value
6312957 2012-01-01 23:58:00 -49
6312958 2012-01-01 23:59:00 -49
6312959 NaT -48
6312960 2012-01-02 00:01:00 -47
6312961 2012-01-02 00:02:00 -46
I try to replace these NAT by adding a minute to the previous entry.
indices_of_NAT = np.flatnonzero(pd.isna(df.loc[:, "Date"]))
df.loc[indices_of_NAT, "Date"] = df.loc[indices_of_NAT - 1, "Date"] + pd.Timedelta(minutes=1)
This produces the correct timestamps and indices, which I checked manually. The only problem is that they don't replace the NaT values for whatever reason. I wonder if something goes wrong with the indexing in my last line of code. Is there something obvious I am missing?
You can fillna with the shifted values + 1 min:
df['Date'] = df['Date'].fillna(df['Date'].shift().add(pd.Timedelta('1min')))
Another method is to interpolate. For this you need to temporarily convert to a number. This way you can fill more than one gap and the increment will be calculated automatically, and there are many nice interpolation methods (see doc):
df['Date'] = (pd.to_datetime(pd.to_numeric(df['Date'])
.mask(df['Date'].isna())
.interpolate('linear'))
)
Example:
Date Value shift interpolate
0 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
1 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
2 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
3 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
4 NaT -48 2012-01-02 00:02:00 2012-01-02 00:01:20
5 NaT -48 NaT 2012-01-02 00:01:40
6 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
Use Series.fillna with shifted values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
Or with forward filling missing values with add 1 minute:
df['Date'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
You can see difference with another data:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = df['Date'].fillna(df['Date'].shift() + pd.Timedelta(minutes=1))
df['Date2'] = df['Date'].fillna(df['Date'].ffill() + pd.Timedelta(minutes=1))
print (df)
Date Value Date1 Date2
6312957 2012-01-01 23:58:00 -49 2012-01-01 23:58:00 2012-01-01 23:58:00
6312958 2012-01-01 23:59:00 -49 2012-01-01 23:59:00 2012-01-01 23:59:00
6312959 NaT -48 2012-01-02 00:00:00 2012-01-02 00:00:00
6312960 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00
6312961 2012-01-02 00:02:00 -46 2012-01-02 00:02:00 2012-01-02 00:02:00
6312962 NaT -47 2012-01-02 00:03:00 2012-01-02 00:03:00
6312963 NaT -47 NaT 2012-01-02 00:03:00
6312967 2012-01-02 00:01:00 -47 2012-01-02 00:01:00 2012-01-02 00:01:00

How to fill zeroes of a day with the previous day values in Pandas

I have some days with complete zeroes and would like to replace them with the previous day values as shown here.
Input
count
2020-02-01 00:00:00 12
2020-02-01 00:01:00 3
2020-02-01 00:02:00 14
2020-02-01 00:03:00 0
2020-02-01 00:04:00 22
2020-02-02 00:00:00 0
2020-02-02 00:01:00 0
2020-02-02 00:02:00 0
2020-02-02 00:03:00 0
2020-02-02 00:04:00 0
2020-02-03 00:00:00 2
2020-02-03 00:01:00 4
2020-02-03 00:02:00 1
2020-02-03 00:03:00 0
2020-02-03 00:04:00 22
Output
count
2020-02-01 00:00:00 12
2020-02-01 00:01:00 3
2020-02-01 00:02:00 14
2020-02-01 00:03:00 0
2020-02-01 00:04:00 22
2020-02-02 00:00:00 12
2020-02-02 00:01:00 3
2020-02-02 00:02:00 14
2020-02-02 00:03:00 0
2020-02-02 00:04:00 22
2020-02-03 00:00:00 2
2020-02-03 00:01:00 4
2020-02-03 00:02:00 1
2020-02-03 00:03:00 0
2020-02-03 00:04:00 22
I was trying something like this but couldn't solve it.
df = df.fillna(0)
df = df.reset_index()
df['Date'] = df['index'].dt.date
df['Time'] = df['index'].dt.time
df.set_index(pd.to_datetime(df.Date + ' ' + df.Time), inplace=True)
for ind in df[df.count.eq(0)].index:
df.loc[ind, 'count'] = df.loc[ind - pd.Timedelta('1D'), 'count']
df.reset_index(drop=True, inplace=True)
you can use mask to replace the 0s with nan, then groupby the time in the DatetimeIndex and ffill, then fillna with 0 to complete the time where no value before.
df_ = (df.mask(df.eq(0))
.groupby(df.index.time)
.ffill() #add the parameter limit=1 if you want to fill only one day after
.fillna(0)
)
print (df_)
count
2020-02-01 00:00:00 12.0
2020-02-01 00:01:00 3.0
2020-02-01 00:02:00 14.0
2020-02-01 00:03:00 0.0
2020-02-01 00:04:00 22.0
2020-02-02 00:00:00 12.0
2020-02-02 00:01:00 3.0
2020-02-02 00:02:00 14.0
2020-02-02 00:03:00 0.0
2020-02-02 00:04:00 22.0
2020-02-03 00:00:00 2.0
2020-02-03 00:01:00 4.0
2020-02-03 00:02:00 1.0
2020-02-03 00:03:00 0.0
2020-02-03 00:04:00 22.0
If you want to fill with previous values ONLY if all values of the day are 0, then in mask above, change df.eq(0) by df['count'].eq(0).groupby(df.index.date).transform('all'). In this case it does not change the result.
If you want to fill with the average of the same time until the current time, then you can use expanding like:
(df.mask(df.eq(0))
.groupby(df.index.time)
.expanding().mean()
.fillna(0)
.reset_index(level=0, drop=True).sort_index()
)

Split Time Series Data Into Time Intervals in one line (PythonicWay) - Hourly

I have a minute data that has the time column. I want to create new column with just hours with date time format, for example format ='%Y-%m-%d %H:%M:%S'. I know in R, we can use something like,
value$hour<- cut(as.POSIXct(paste(value$time),
format="%Y-%m-%d %H:%M:%S"), breaks="hour")
When I do this, I get the following output, (which i need)
time hour
2017-02-10 00:00:00 2017-02-10 00:00:00
2017-02-10 00:01:00 2017-02-10 00:00:00
2017-02-10 00:02:00 2017-02-10 00:00:00
2017-02-10 00:03:00 2017-02-10 00:00:00
....
2017-12-1 10:05:00 2017-12-01 10:00:00
2017-12-1 10:06:00 2017-12-01 10:00:00
I am also aware that there are many threads that discusses about dt.date, dt.hour etc. I can do the following in python like this,
value['date'] = value['time'].dt.date
value['hour'] = value['time'].dt.hour
Is there any way that I can do in python that is similar to R as mentioned above in one line?
Any thoughts would be appreciated. Thanks in advance!
You need dt.floor:
df['hour'] = df['time'].dt.floor('H')
print (df)
time hour
0 2017-02-10 00:00:00 2017-02-10 00:00:00
1 2017-02-10 00:01:00 2017-02-10 00:00:00
2 2017-02-10 00:02:00 2017-02-10 00:00:00
3 2017-02-10 00:03:00 2017-02-10 00:00:00
4 2017-12-01 10:05:00 2017-12-01 10:00:00
5 2017-12-01 10:06:00 2017-12-01 10:00:00
If need convert to datetime column time add to_datetime:
df['hour'] = pd.to_datetime(df['time']).dt.floor('H')
print (df)
time hour
0 2017-02-10 00:00:00 2017-02-10 00:00:00
1 2017-02-10 00:01:00 2017-02-10 00:00:00
2 2017-02-10 00:02:00 2017-02-10 00:00:00
3 2017-02-10 00:03:00 2017-02-10 00:00:00
4 2017-12-1 10:05:00 2017-12-01 10:00:00
5 2017-12-1 10:06:00 2017-12-01 10:00:00

Problems resampling pandas time series with time gap between indices

I want to resample the pandas series
import pandas as pd
index_1 = pd.date_range('1/1/2000', periods=4, freq='T')
index_2 = pd.date_range('1/2/2000', periods=3, freq='T')
series = pd.Series(range(4), index=index_1)
series=series.append(pd.Series(range(3), index=index_2))
print series
>>>2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 00:02:00 2
such that the resulting DataSeries only contains every second entry, i.e
>>>2000-01-01 00:00:00 0
2000-01-01 00:02:00 2
2000-01-02 00:00:00 0
2000-01-02 00:02:00 2
using the (poorly documented) resample method of pandas in the following way:
resampled_series = series.resample('2T', closed='right')
print resampled_series
I get
>>>1999-12-31 23:58:00 0.0
2000-01-01 00:00:00 1.5
2000-01-01 00:02:00 3.0
2000-01-01 00:04:00 NaN
2000-01-01 00:56:00 NaN
...
2000-01-01 23:54:00 NaN
2000-01-01 23:56:00 NaN
2000-01-01 23:58:00 0.0
2000-01-02 00:00:00 1.5
2000-01-02 00:02:00 3.0
Why does it start 2 minutes earlier than the original series? why does it contain all the time steps inbetween, which are not contained in the original series? How can I get my desired result?
resample() is not the right function for your purpose.
try this:
series[series.index.minute % 2 == 0]

Keeping only data for which timedelta=1minute with pandas

Let's generate 10 rows of a time series with non-constant time step :
import pandas as pd
import numpy as np
x = pd.DataFrame(np.random.random(10),pd.date_range('1/1/2011', periods=5, freq='1min') \
.union(pd.date_range('1/2/2011', periods=5, freq='1min')))
Example of data:
2011-01-01 00:00:00 0.144852
2011-01-01 00:01:00 0.510248
2011-01-01 00:02:00 0.911903
2011-01-01 00:03:00 0.392504
2011-01-01 00:04:00 0.054307
2011-01-02 00:00:00 0.918862
2011-01-02 00:01:00 0.988054
2011-01-02 00:02:00 0.780668
2011-01-02 00:03:00 0.831947
2011-01-02 00:04:00 0.707357
Now let's define r as the so-called "returns" (difference between consecutive rows):
r = x[1:] - x[:-1].values
How to clean the data by removing the r[i] for which the time difference was not 1 minute? (here there is exactly one such row in r to clean)
IIUC I think you want the following:
In [26]:
x[(x.index.to_series().diff() == pd.Timedelta(1, 'm')) | (x.index.to_series().diff().isnull())]
Out[26]:
0
2011-01-01 00:00:00 0.367675
2011-01-01 00:01:00 0.128325
2011-01-01 00:02:00 0.772191
2011-01-01 00:03:00 0.638847
2011-01-01 00:04:00 0.476668
2011-01-02 00:01:00 0.992888
2011-01-02 00:02:00 0.944810
2011-01-02 00:03:00 0.171831
2011-01-02 00:04:00 0.316064
This converts the index to a series using to_seriesso we can call diff and we can then compare this with a timedelta of 1 minute, we also handle the first row case where diff will return NaT

Categories

Resources