Timestamp range to hours Python Pandas - python

Is it a way to split timestamp range to hours ?
FROM:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 14:25:00
TO:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 13:00:00
Tom
2019-01-01 13:00:00
2019-01-01 14:00:00
Tom
2019-01-01 14:00:00
2019-01-01 14:25:00

First get all the ranges between start.floor('h') and stop.ceil('h') with hour frequency using pd.date_range, then return start, range from second to second last, and stop, it'll give a list, so explode it, assign stop to this range by shifting it by -1, and finally assign start to the range, and dropna rows (this na will appear due to the effect of shift, and is not required.)
def getRange(row):
rang = pd.date_range(row['start'].floor('h'), row['stop'].ceil('h'), freq='h')
return [row['start']] + rang.to_list()[1:-1] + [row['stop']]
df =(df.assign(range=df.apply(getRange, axis=1))
.drop(columns=['start', 'stop'])
.explode('range')
)
df = df.assign(stop=df['range'].shift(-1)).dropna().rename(columns={'range':'start'})
OUTPUT:
Person start stop
0 Tom 2019-01-01 12:15:00 2019-01-01 13:00:00
0 Tom 2019-01-01 13:00:00 2019-01-01 14:00:00
0 Tom 2019-01-01 14:00:00 2019-01-01 14:25:00

Related

Python Pandas Upsampling on average values between data points (15min to 1min)

i have some issues with my dataresampling in pandas. I´m trying to upsample 15 min values to 1min values. The resampled dataframe values shoud contain the sum spliited equaly between the two values of the original dataframe. This codes generates an extraction of the problem.
import pandas as pd
import numpy as np
dates = pd.DataFrame(pd.date_range(start="20190101",end="20200101", freq="15min"))
values = pd.DataFrame(np.random.randint(0,10,size=(35041, 1)))
df = pd.concat([dates,values], axis = 1)
df = df.set_index(pd.DatetimeIndex(df.iloc[:,0]))
print(df.resample("min").agg("sum").head(16))
This is an example output:
2019-01-01 00:00:00 3
2019-01-01 00:01:00 0
2019-01-01 00:02:00 0
2019-01-01 00:03:00 0
2019-01-01 00:04:00 0
2019-01-01 00:05:00 0
2019-01-01 00:06:00 0
2019-01-01 00:07:00 0
2019-01-01 00:08:00 0
2019-01-01 00:09:00 0
2019-01-01 00:10:00 0
2019-01-01 00:11:00 0
2019-01-01 00:12:00 0
2019-01-01 00:13:00 0
2019-01-01 00:14:00 0
2019-01-01 00:15:00 3
The values shown as 0 should be replaced by the sum of the two values (in this exapmle: 2019-01-01 00:00:00 3; and 2019-01-01 00:15:00 3) which equals to 6 and this should be evenly distibuted over the timearea.
2019-01-01 00:00:00 6/15
2019-01-01 00:01:00 6/15
2019-01-01 00:02:00 6/15
2019-01-01 00:03:00 6/15
2019-01-01 00:04:00 6/15
2019-01-01 00:05:00 6/15
2019-01-01 00:06:00 6/15
2019-01-01 00:07:00 6/15
2019-01-01 00:08:00 6/15
2019-01-01 00:09:00 6/15
2019-01-01 00:10:00 6/15
2019-01-01 00:11:00 6/15
2019-01-01 00:12:00 6/15
2019-01-01 00:13:00 6/15
2019-01-01 00:14:00 6/15
2019-01-01 00:15:00 6/15
This should be done for each resampled group over the whole Dataframe.
In other word the sum of the original dataframe and the resampled dataframe should be equal.
Thanks for your help.
First of all, personally, I would recommend working with a series, if there is only one column.
series = pd.Series(index=pd.date_range(start="20190101",end="20200101",
freq="15min"), data=(np.random.randint(0,10,size=(35041,))).tolist())
 Then, I would create a new index with minutely values, calculate the cumulative sum of the values and interpolate between these values. In your use case "linear" is suggested as interpolation method:
beginning = series.index[0]
end = series.index[-1]
new_index = pd.date_range(start, end, freq="1T")
cumsum = series.cumsum()
cumsum = result.reindex(new_index)
cumsum = result.interpolate("linear")
Afterwards, you get an interpolated cumulative sum, which you can convert back to your searched values via:
series_upsampled = cumsum.diff()
If you want, you can shift the series_upsampled by 1, doing
series_upsampled = series_upsampled.shift(-1)
Pay attention to NaN value at the beginning (or if you shift your series, at the end).

combine columns with different data types to make a single dateTime column in pandas data frames

I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')

How can I control the hourly GROUPBY setting in Pandas?

I have the following data set:
time value
2019-01-01 8:00:00 10
2019-01-01 8:30:00 20
2019-01-01 9:00:00 30
2019-01-01 9:30:00 100
2019-01-01 10:00:00 400
By using the pd.groupby(pd.Grouper(key = 'time', freq = '1h')).sum().reset_index(). It returned:
time value
2019-01-01 8:00:00 30
2019-01-01 9:00:00 130
2019-01-01 10:00:00 400
It based on any related the Hour value to have a group aggregation. But How can I control the group time setting? Since I would like to make any >8 to <= 9 as 9 group. For example:
time value
2019-01-01 8:00:00 10
2019-01-01 9:00:00 50
2019-01-01 10:00:00 500
IIUC ceil
Yourdf=df.groupby(df.index.ceil('H')).sum()
value
time
2019-01-01 08:00:00 10
2019-01-01 09:00:00 50
2019-01-01 10:00:00 500
Or resample
df.resample('H',closed='right').sum()
value
time
2019-01-01 07:00:00 10
2019-01-01 08:00:00 50
2019-01-01 09:00:00 500
Use closed='right' i.e.
pd.Grouper(key = 'time', freq = '1h', closed='right')

Replace "flatline" repeated data in Pandas series with nan

I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64

grouping multiple time values into a start and finish time

I have a dataframe as follows
import pandas as pd
import numpy as np
IDs = ['A','A','A','B','B']
times = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h')
times_2 = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h') + pd.Timedelta('15min')
Vals = [np.random.randint(15,250) for x in enumerate(times)]
df = pd.DataFrame({'id' : IDs*5,
'Start' : times,
'End' : times_2,
'Value' : Vals},columns=['id','Start','End','Value'])
this gives me a df as follows.
print(df.head(5))
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 00:15:00 52
1 A 2019-01-01 01:00:00 2019-01-01 01:15:00 69
2 A 2019-01-01 02:00:00 2019-01-01 02:15:00 209
3 B 2019-01-01 03:00:00 2019-01-01 03:15:00 163
4 B 2019-01-01 04:00:00 2019-01-01 04:15:00 70
now what I'm trying to do is apply a group by to my data frame to get the sum of the value column, however, whilst doing this I would like to retain the min start and max end time of my df.
so my example output would be as follows :
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 22:15:00 2007
1 B 2019-01-01 03:00:00 2019-01-02 00:15:00 1385
The only way I've sort of made this work is pass the min and max of each unique ID by start and end time, pass these to a list and then manually create the start and end times, but it was slow and messy and prone to error... hoping someone here can guide me as to what I'm missing.
Using groupby with agg
df.groupby('id').agg({'Start':'min','End':'max','Value':'sum'})#reset_index()
Out[92]:
Start End Value
id
A 2019-01-01 00:00:00 2019-01-01 22:15:00 2152
B 2019-01-01 03:00:00 2019-01-02 00:15:00 972

Categories

Resources