I have a bunch of data point for each there are two columns: start_dt and end_dt. I am wondering how can I split the time gap between start_dt and end_dt into 5 minutes interval?
For instance,
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 11:00
=====================================================
What I am looking for is:
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 10:05
1+++++++2019-01-01 10:05 +++++++ 2019-01-01 10:10
1+++++++2019-01-01 10:10 +++++++ 2019-01-01 10:15
1+++++++2019-01-01 10:15 +++++++ 2019-01-01 10:20
==================================================
and so fort
is there any function out of the box to do so?
If not, any help to create this function is wonderful
If you have two Python datetime objects representing a timespan, and you just want to break that timespan up into 5 minute intervals represented by datetime objects, you could just do this:
import datetime
d1 = datetime.datetime(2019, 1, 1, 10, 0)
d2 = datetime.datetime(2019, 1, 1, 11, 0)
delta = datetime.timedelta(minutes=5)
times = []
while d1 < d2:
times.append(d1)
d1 += delta
times.append(d2)
for i in range(len(times) - 1):
print("{} - {}".format(times[i], times[i+1]))
Output:
2019-01-01 10:00:00 - 2019-01-01 10:05:00
2019-01-01 10:05:00 - 2019-01-01 10:10:00
2019-01-01 10:10:00 - 2019-01-01 10:15:00
2019-01-01 10:15:00 - 2019-01-01 10:20:00
2019-01-01 10:20:00 - 2019-01-01 10:25:00
2019-01-01 10:25:00 - 2019-01-01 10:30:00
2019-01-01 10:30:00 - 2019-01-01 10:35:00
2019-01-01 10:35:00 - 2019-01-01 10:40:00
2019-01-01 10:40:00 - 2019-01-01 10:45:00
2019-01-01 10:45:00 - 2019-01-01 10:50:00
2019-01-01 10:50:00 - 2019-01-01 10:55:00
2019-01-01 10:55:00 - 2019-01-01 11:00:00
This should handle a period that isn't an even multiple of the delta, giving you a shorter interval at the end.
I don't know pyspark, but if you are using pandas this works. (and pyspark may be similar):
1:create data
import pandas as pd
import numpy as np
data = pd.DataFrame({
'id':[1, 2],
'start_tm': pd.date_range('2019-01-01 00:00', periods=2, freq='D'),
'end_dt': pd.date_range('2019-01-01 00:30', periods=2, freq='D')})
# pandas dataframe is similar to the data in pyspark
output
id start_tm end_dt
1 2019-01-01 2019-01-01 00:30:00
2 2019-01-02 2019-01-02 00:30:00
2: split columns
period = np.timedelta64(5, 'm') # 5 minutes
idx = (data['end_dt'] - data['start_tm']) > period
while idx.any():
new_data = data[idx].copy()
new_data['start_tm'] = new_data['start_tm'] + period
data.loc[idx, 'end_dt'] = (data[idx]['start_tm'] + period).values
data = pd.concat([data, new_data], axis=0)
idx = (data['end_dt'] - data['start_tm']) > period
output
id start_tm end_dt
1 2019-01-01 00:00:00 2019-01-01 00:05:00
2 2019-01-02 00:00:00 2019-01-02 00:05:00
1 2019-01-01 00:05:00 2019-01-01 00:10:00
2 2019-01-02 00:05:00 2019-01-02 00:10:00
1 2019-01-01 00:10:00 2019-01-01 00:15:00
2 2019-01-02 00:10:00 2019-01-02 00:15:00
1 2019-01-01 00:15:00 2019-01-01 00:20:00
2 2019-01-02 00:15:00 2019-01-02 00:20:00
1 2019-01-01 00:20:00 2019-01-01 00:25:00
2 2019-01-02 00:20:00 2019-01-02 00:25:00
1 2019-01-01 00:25:00 2019-01-01 00:30:00
2 2019-01-02 00:25:00 2019-01-02 00:30:00
Related
i have some issues with my dataresampling in pandas. I´m trying to upsample 15 min values to 1min values. The resampled dataframe values shoud contain the sum spliited equaly between the two values of the original dataframe. This codes generates an extraction of the problem.
import pandas as pd
import numpy as np
dates = pd.DataFrame(pd.date_range(start="20190101",end="20200101", freq="15min"))
values = pd.DataFrame(np.random.randint(0,10,size=(35041, 1)))
df = pd.concat([dates,values], axis = 1)
df = df.set_index(pd.DatetimeIndex(df.iloc[:,0]))
print(df.resample("min").agg("sum").head(16))
This is an example output:
2019-01-01 00:00:00 3
2019-01-01 00:01:00 0
2019-01-01 00:02:00 0
2019-01-01 00:03:00 0
2019-01-01 00:04:00 0
2019-01-01 00:05:00 0
2019-01-01 00:06:00 0
2019-01-01 00:07:00 0
2019-01-01 00:08:00 0
2019-01-01 00:09:00 0
2019-01-01 00:10:00 0
2019-01-01 00:11:00 0
2019-01-01 00:12:00 0
2019-01-01 00:13:00 0
2019-01-01 00:14:00 0
2019-01-01 00:15:00 3
The values shown as 0 should be replaced by the sum of the two values (in this exapmle: 2019-01-01 00:00:00 3; and 2019-01-01 00:15:00 3) which equals to 6 and this should be evenly distibuted over the timearea.
2019-01-01 00:00:00 6/15
2019-01-01 00:01:00 6/15
2019-01-01 00:02:00 6/15
2019-01-01 00:03:00 6/15
2019-01-01 00:04:00 6/15
2019-01-01 00:05:00 6/15
2019-01-01 00:06:00 6/15
2019-01-01 00:07:00 6/15
2019-01-01 00:08:00 6/15
2019-01-01 00:09:00 6/15
2019-01-01 00:10:00 6/15
2019-01-01 00:11:00 6/15
2019-01-01 00:12:00 6/15
2019-01-01 00:13:00 6/15
2019-01-01 00:14:00 6/15
2019-01-01 00:15:00 6/15
This should be done for each resampled group over the whole Dataframe.
In other word the sum of the original dataframe and the resampled dataframe should be equal.
Thanks for your help.
First of all, personally, I would recommend working with a series, if there is only one column.
series = pd.Series(index=pd.date_range(start="20190101",end="20200101",
freq="15min"), data=(np.random.randint(0,10,size=(35041,))).tolist())
Then, I would create a new index with minutely values, calculate the cumulative sum of the values and interpolate between these values. In your use case "linear" is suggested as interpolation method:
beginning = series.index[0]
end = series.index[-1]
new_index = pd.date_range(start, end, freq="1T")
cumsum = series.cumsum()
cumsum = result.reindex(new_index)
cumsum = result.interpolate("linear")
Afterwards, you get an interpolated cumulative sum, which you can convert back to your searched values via:
series_upsampled = cumsum.diff()
If you want, you can shift the series_upsampled by 1, doing
series_upsampled = series_upsampled.shift(-1)
Pay attention to NaN value at the beginning (or if you shift your series, at the end).
I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')
I have a df:
dates values
2020-01-01 00:15:00 38.61487
2020-01-01 00:30:00 36.905204
2020-01-01 00:45:00 35.136584
2020-01-01 01:00:00 33.60378
2020-01-01 01:15:00 32.306791999999994
2020-01-01 01:30:00 31.304574
I am creating a new column named start as follows:
df = df.rename(columns={'dates': 'end'})
df['start']= df['end'].shift(1)
When I do this, I get the following:
end values start
2020-01-01 00:15:00 38.61487 NaT
2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
2020-01-01 01:00:00 33.60378 2020-01-01 00:45:00
2020-01-01 01:15:00 32.306791999999994 2020-01-01 01:00:00
2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
I want to fill that NaT value with
2020-01-01 00:00:00
How can this be done?
Use Series.fillna with datetimes, e.g. by Timestamp:
df['start']= df['end'].shift().fillna(pd.Timestamp('2020-01-01'))
Or if pandas 0.24+ with fill_value parameter:
df['start']= df['end'].shift(fill_value=pd.Timestamp('2020-01-01'))
If all datetimes are regular, always difference 15 minutes is possible subtracting by offsets.DateOffset:
df['start']= df['end'] - pd.offsets.DateOffset(minutes=15)
print (df)
end values start
0 2020-01-01 00:15:00 38.614870 2020-01-01 00:00:00
1 2020-01-01 00:30:00 36.905204 2020-01-01 00:15:00
2 2020-01-01 00:45:00 35.136584 2020-01-01 00:30:00
3 2020-01-01 01:00:00 33.603780 2020-01-01 00:45:00
4 2020-01-01 01:15:00 32.306792 2020-01-01 01:00:00
5 2020-01-01 01:30:00 31.304574 2020-01-01 01:15:00
How about that?
df = pd.DataFrame(columns = ['end'])
df.loc[:, 'end'] = pd.date_range(start=pd.Timestamp(2019,1,1,0,15), end=pd.Timestamp(2019,1,2), freq='15min')
df.loc[:, 'start'] = df.loc[:, 'end'].shift(1)
delta = df.loc[df.index[3], 'end'] - df.loc[df.index[2], 'end']
df.loc[df.index[0], 'start'] = df.loc[df.index[1], 'start'] - delta
df
end start
0 2019-01-01 00:15:00 2019-01-01 00:00:00
1 2019-01-01 00:30:00 2019-01-01 00:15:00
2 2019-01-01 00:45:00 2019-01-01 00:30:00
3 2019-01-01 01:00:00 2019-01-01 00:45:00
4 2019-01-01 01:15:00 2019-01-01 01:00:00
... ... ...
91 2019-01-01 23:00:00 2019-01-01 22:45:00
92 2019-01-01 23:15:00 2019-01-01 23:00:00
93 2019-01-01 23:30:00 2019-01-01 23:15:00
94 2019-01-01 23:45:00 2019-01-01 23:30:00
95 2019-01-02 00:00:00 2019-01-01 23:45:00
I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64
I have following dataframe in pandas
code srt_date srt_time end_time fina_datetime
123 2019-01-01 23:23:00 00:12:00 2019-01-02 00:13:00
123 2019-01-02 00:13:00 00:14:00 2019-01-02 00:15:00
123 2019-01-02 23:00:00 00:15:00 2019-01-03 00:16:00
I want to calculate fina_datetime - end_time for which I am doing following thing in pandas
df['end_time'] = df['srt_date'].map(str) +" "+ df['end_time'].map(str)
df['end_time'] = pd.to_datetime(df['end_time'], format = "%Y-%m-%d %H:%M:%S")
df['latency_in_secs'] = [x-y for x, y in zip(df['final_datetime'] , df['end_time'])]
df['latency_in_secs'] = df.latency_in_secs.dt.total_seconds()
Above code has issues when date is entering into next date e.g. 1st and 3rd row. How do I do it in pandas?
My desired dataframe would be
code srt_date srt_time end_time fina_datetime latency_in_secs
123 2019-01-01 23:23:00 00:12:00 2019-01-02 00:13:00 60
123 2019-01-02 00:13:00 00:14:00 2019-01-02 00:15:00 60
123 2019-01-02 23:00:00 00.15:00 2019-01-03 00:16:00 60
IIUC, you can mask where the end_time < srt_time and add the date by one:
# convert to timedelta
df['srt_time'] = pd.to_timedelta(df['srt_time'])
df['end_time'] = pd.to_timedelta(df['end_time'])
# convert to datetime
df['srt_date'] = pd.to_datetime(df['srt_date'])
df['fina_datetime'] = pd.to_datetime(df['fina_datetime'])
# the normal end
end_dates = df['srt_date'] + df['end_time']
# increase the end time with end_time < srt_time by one day
end_dates.loc[df['end_time'].le(df['srt_time'])] += pd.to_timedelta(1, unit='D')
# substract:
df['latency_in_secs'] = (df['fina_datetime'].sub(end_dates)
.dt.total_seconds()
)
Output:
code srt_date srt_time end_time fina_datetime latency_in_secs
0 123 2019-01-01 23:23:00 00:12:00 2019-01-02 00:13:00 60.0
1 123 2019-01-02 00:13:00 00:14:00 2019-01-02 00:15:00 60.0
2 123 2019-01-02 23:00:00 00:15:00 2019-01-03 00:16:00 60.0