Python Pandas Upsampling on average values between data points (15min to 1min) - python

i have some issues with my dataresampling in pandas. I´m trying to upsample 15 min values to 1min values. The resampled dataframe values shoud contain the sum spliited equaly between the two values of the original dataframe. This codes generates an extraction of the problem.
import pandas as pd
import numpy as np
dates = pd.DataFrame(pd.date_range(start="20190101",end="20200101", freq="15min"))
values = pd.DataFrame(np.random.randint(0,10,size=(35041, 1)))
df = pd.concat([dates,values], axis = 1)
df = df.set_index(pd.DatetimeIndex(df.iloc[:,0]))
print(df.resample("min").agg("sum").head(16))
This is an example output:
2019-01-01 00:00:00 3
2019-01-01 00:01:00 0
2019-01-01 00:02:00 0
2019-01-01 00:03:00 0
2019-01-01 00:04:00 0
2019-01-01 00:05:00 0
2019-01-01 00:06:00 0
2019-01-01 00:07:00 0
2019-01-01 00:08:00 0
2019-01-01 00:09:00 0
2019-01-01 00:10:00 0
2019-01-01 00:11:00 0
2019-01-01 00:12:00 0
2019-01-01 00:13:00 0
2019-01-01 00:14:00 0
2019-01-01 00:15:00 3
The values shown as 0 should be replaced by the sum of the two values (in this exapmle: 2019-01-01 00:00:00 3; and 2019-01-01 00:15:00 3) which equals to 6 and this should be evenly distibuted over the timearea.
2019-01-01 00:00:00 6/15
2019-01-01 00:01:00 6/15
2019-01-01 00:02:00 6/15
2019-01-01 00:03:00 6/15
2019-01-01 00:04:00 6/15
2019-01-01 00:05:00 6/15
2019-01-01 00:06:00 6/15
2019-01-01 00:07:00 6/15
2019-01-01 00:08:00 6/15
2019-01-01 00:09:00 6/15
2019-01-01 00:10:00 6/15
2019-01-01 00:11:00 6/15
2019-01-01 00:12:00 6/15
2019-01-01 00:13:00 6/15
2019-01-01 00:14:00 6/15
2019-01-01 00:15:00 6/15
This should be done for each resampled group over the whole Dataframe.
In other word the sum of the original dataframe and the resampled dataframe should be equal.
Thanks for your help.

First of all, personally, I would recommend working with a series, if there is only one column.
series = pd.Series(index=pd.date_range(start="20190101",end="20200101",
freq="15min"), data=(np.random.randint(0,10,size=(35041,))).tolist())
 Then, I would create a new index with minutely values, calculate the cumulative sum of the values and interpolate between these values. In your use case "linear" is suggested as interpolation method:
beginning = series.index[0]
end = series.index[-1]
new_index = pd.date_range(start, end, freq="1T")
cumsum = series.cumsum()
cumsum = result.reindex(new_index)
cumsum = result.interpolate("linear")
Afterwards, you get an interpolated cumulative sum, which you can convert back to your searched values via:
series_upsampled = cumsum.diff()
If you want, you can shift the series_upsampled by 1, doing
series_upsampled = series_upsampled.shift(-1)
Pay attention to NaN value at the beginning (or if you shift your series, at the end).

Related

Timestamp range to hours Python Pandas

Is it a way to split timestamp range to hours ?
FROM:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 14:25:00
TO:
Person
start
stop
Tom
2019-01-01 12:15:00
2019-01-01 13:00:00
Tom
2019-01-01 13:00:00
2019-01-01 14:00:00
Tom
2019-01-01 14:00:00
2019-01-01 14:25:00
First get all the ranges between start.floor('h') and stop.ceil('h') with hour frequency using pd.date_range, then return start, range from second to second last, and stop, it'll give a list, so explode it, assign stop to this range by shifting it by -1, and finally assign start to the range, and dropna rows (this na will appear due to the effect of shift, and is not required.)
def getRange(row):
rang = pd.date_range(row['start'].floor('h'), row['stop'].ceil('h'), freq='h')
return [row['start']] + rang.to_list()[1:-1] + [row['stop']]
df =(df.assign(range=df.apply(getRange, axis=1))
.drop(columns=['start', 'stop'])
.explode('range')
)
df = df.assign(stop=df['range'].shift(-1)).dropna().rename(columns={'range':'start'})
OUTPUT:
Person start stop
0 Tom 2019-01-01 12:15:00 2019-01-01 13:00:00
0 Tom 2019-01-01 13:00:00 2019-01-01 14:00:00
0 Tom 2019-01-01 14:00:00 2019-01-01 14:25:00

combine columns with different data types to make a single dateTime column in pandas data frames

I have imported data from some source that has date in datatype class 'object' and hour in integer and looks something like:
Date Hour Val
2019-01-01 1 0
2019-01-01 2 0
2019-01-01 3 0
2019-01-01 4 0
2019-01-01 5 0
2019-01-01 6 0
2019-01-01 7 0
2019-01-01 8 0
I need a single column that has the date-time in a column that looks like this:
DATETIME
2019-01-01 01:00:00
2019-01-01 02:00:00
2019-01-01 03:00:00
2019-01-01 04:00:00
2019-01-01 05:00:00
2019-01-01 06:00:00
2019-01-01 07:00:00
2019-01-01 08:00:00
I tried to convert the date column to dateTime format using
pd.datetime(df.Date)
and then using
df.Date.dt.hour = df.Hour
I get the error
ValueError: modifications to a property of a datetimelike object are not supported. Change values on the original.
Is there an easy way to do this?
Use pandas.to_timedelta and pandas.to_datetime:
# if needed
df['Date'] = pd.to_datetime(df['Date'])
df['Datetime'] = df['Date'] + pd.to_timedelta(df['Hour'], unit='H')
[out]
Date Hour Val Datetime
0 2019-01-01 1 0 2019-01-01 01:00:00
1 2019-01-01 2 0 2019-01-01 02:00:00
2 2019-01-01 3 0 2019-01-01 03:00:00
3 2019-01-01 4 0 2019-01-01 04:00:00
4 2019-01-01 5 0 2019-01-01 05:00:00
5 2019-01-01 6 0 2019-01-01 06:00:00
6 2019-01-01 7 0 2019-01-01 07:00:00
7 2019-01-01 8 0 2019-01-01 08:00:00
Since you asked for a method combining the columns and using a single pd.to_datetime call, you could do:
df['Datetime'] = pd.to_datetime((df['Date'].astype(str) + ' ' +
df['Hour'].astype(str)),
format='%Y-%m-%d %I')

Replace "flatline" repeated data in Pandas series with nan

I have hourly time series data stored in a pandas series. Similar to this example:
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='1/1/2019', end='1/2/2019', freq='H')
data = np.random.uniform(180,182,size=(len(date_rng)))
timeseries = pd.Series(data, index=date_rng)
timeseries.iloc[4:12] = 181.911
At three decimal places, it is highly unlikely the data will be exactly the same for more than, say, 3 hours in a row. When this flatlining occurs, it indicates an issue with the sensor. So I want to detect repeated data and replace it with nan values (i.e., detect the repeated values 181.911 in the above and replace with nan)
I assume I can iterate over the time series and detect/replace that way, but is there a more efficient way to do this?
You can do it with diff, but the first occurrence retain in the series.
timeseries.where(timeseries.diff(1)!=0.0,np.nan)
2019-01-01 00:00:00 180.539278
2019-01-01 01:00:00 181.509729
2019-01-01 02:00:00 180.740326
2019-01-01 03:00:00 181.736425
2019-01-01 04:00:00 181.911000
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.093216
2019-01-01 13:00:00 180.623440
First occurrence also can be removed using diff(-1) and diff(1):
np.c_[timeseries.where(timeseries.diff(-1)!=0.0,np.nan), timeseries.where(timeseries.diff(1)!=0.0,np.nan)].mean(axis=1)
It works when repetitions are sequential in series.
With following reasonably efficient function one can choose the minimum number of repeated values to consider as flatline:
import numpy as np
def remove_flatlines(ts, threshold):
# get start and end indices of each flatline as an n x 2 array
isflat = np.concatenate(([False], np.isclose(ts.diff(), 0), [False]))
isedge = isflat[1:] != isflat[:-1]
flatrange = np.where(isedge)[0].reshape(-1, 2)
# include also first value of each flatline
flatrange[:, 0] -= 1
# remove flatlines with at least threshold number of equal values
ts = ts.copy()
for j in range(len(flatrange)):
if flatrange[j][1] - flatrange[j][0] >= threshold:
ts.iloc[flatrange[j][0]:flatrange[j][1]] = np.nan
return ts
Applied to example:
remove_flatlines(timeseries, threshold=3)
2019-01-01 00:00:00 181.447940
2019-01-01 01:00:00 180.142692
2019-01-01 02:00:00 180.994674
2019-01-01 03:00:00 180.116489
2019-01-01 04:00:00 NaN
2019-01-01 05:00:00 NaN
2019-01-01 06:00:00 NaN
2019-01-01 07:00:00 NaN
2019-01-01 08:00:00 NaN
2019-01-01 09:00:00 NaN
2019-01-01 10:00:00 NaN
2019-01-01 11:00:00 NaN
2019-01-01 12:00:00 180.972644
2019-01-01 13:00:00 181.969759
2019-01-01 14:00:00 181.008693
2019-01-01 15:00:00 180.769328
2019-01-01 16:00:00 180.576061
2019-01-01 17:00:00 181.562315
2019-01-01 18:00:00 181.978567
2019-01-01 19:00:00 181.928330
2019-01-01 20:00:00 180.773995
2019-01-01 21:00:00 180.475290
2019-01-01 22:00:00 181.460028
2019-01-01 23:00:00 180.220693
2019-01-02 00:00:00 181.630176
Freq: H, dtype: float64

grouping multiple time values into a start and finish time

I have a dataframe as follows
import pandas as pd
import numpy as np
IDs = ['A','A','A','B','B']
times = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h')
times_2 = pd.date_range(start='01/01/2019',end='01/02/2019',freq='h') + pd.Timedelta('15min')
Vals = [np.random.randint(15,250) for x in enumerate(times)]
df = pd.DataFrame({'id' : IDs*5,
'Start' : times,
'End' : times_2,
'Value' : Vals},columns=['id','Start','End','Value'])
this gives me a df as follows.
print(df.head(5))
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 00:15:00 52
1 A 2019-01-01 01:00:00 2019-01-01 01:15:00 69
2 A 2019-01-01 02:00:00 2019-01-01 02:15:00 209
3 B 2019-01-01 03:00:00 2019-01-01 03:15:00 163
4 B 2019-01-01 04:00:00 2019-01-01 04:15:00 70
now what I'm trying to do is apply a group by to my data frame to get the sum of the value column, however, whilst doing this I would like to retain the min start and max end time of my df.
so my example output would be as follows :
id Start End Value
0 A 2019-01-01 00:00:00 2019-01-01 22:15:00 2007
1 B 2019-01-01 03:00:00 2019-01-02 00:15:00 1385
The only way I've sort of made this work is pass the min and max of each unique ID by start and end time, pass these to a list and then manually create the start and end times, but it was slow and messy and prone to error... hoping someone here can guide me as to what I'm missing.
Using groupby with agg
df.groupby('id').agg({'Start':'min','End':'max','Value':'sum'})#reset_index()
Out[92]:
Start End Value
id
A 2019-01-01 00:00:00 2019-01-01 22:15:00 2152
B 2019-01-01 03:00:00 2019-01-02 00:15:00 972

Create 5-minute interval between two timestamp

I have a bunch of data point for each there are two columns: start_dt and end_dt. I am wondering how can I split the time gap between start_dt and end_dt into 5 minutes interval?
For instance,
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 11:00
=====================================================
What I am looking for is:
id+++++++start_tm ++++++++++++++ end_dt
1+++++++2019-01-01 10:00 +++++++ 2019-01-01 10:05
1+++++++2019-01-01 10:05 +++++++ 2019-01-01 10:10
1+++++++2019-01-01 10:10 +++++++ 2019-01-01 10:15
1+++++++2019-01-01 10:15 +++++++ 2019-01-01 10:20
==================================================
and so fort
is there any function out of the box to do so?
If not, any help to create this function is wonderful
If you have two Python datetime objects representing a timespan, and you just want to break that timespan up into 5 minute intervals represented by datetime objects, you could just do this:
import datetime
d1 = datetime.datetime(2019, 1, 1, 10, 0)
d2 = datetime.datetime(2019, 1, 1, 11, 0)
delta = datetime.timedelta(minutes=5)
times = []
while d1 < d2:
times.append(d1)
d1 += delta
times.append(d2)
for i in range(len(times) - 1):
print("{} - {}".format(times[i], times[i+1]))
Output:
2019-01-01 10:00:00 - 2019-01-01 10:05:00
2019-01-01 10:05:00 - 2019-01-01 10:10:00
2019-01-01 10:10:00 - 2019-01-01 10:15:00
2019-01-01 10:15:00 - 2019-01-01 10:20:00
2019-01-01 10:20:00 - 2019-01-01 10:25:00
2019-01-01 10:25:00 - 2019-01-01 10:30:00
2019-01-01 10:30:00 - 2019-01-01 10:35:00
2019-01-01 10:35:00 - 2019-01-01 10:40:00
2019-01-01 10:40:00 - 2019-01-01 10:45:00
2019-01-01 10:45:00 - 2019-01-01 10:50:00
2019-01-01 10:50:00 - 2019-01-01 10:55:00
2019-01-01 10:55:00 - 2019-01-01 11:00:00
This should handle a period that isn't an even multiple of the delta, giving you a shorter interval at the end.
I don't know pyspark, but if you are using pandas this works. (and pyspark may be similar):
1:create data
import pandas as pd
import numpy as np
data = pd.DataFrame({
'id':[1, 2],
'start_tm': pd.date_range('2019-01-01 00:00', periods=2, freq='D'),
'end_dt': pd.date_range('2019-01-01 00:30', periods=2, freq='D')})
# pandas dataframe is similar to the data in pyspark
output
id start_tm end_dt
1 2019-01-01 2019-01-01 00:30:00
2 2019-01-02 2019-01-02 00:30:00
2: split columns
period = np.timedelta64(5, 'm') # 5 minutes
idx = (data['end_dt'] - data['start_tm']) > period
while idx.any():
new_data = data[idx].copy()
new_data['start_tm'] = new_data['start_tm'] + period
data.loc[idx, 'end_dt'] = (data[idx]['start_tm'] + period).values
data = pd.concat([data, new_data], axis=0)
idx = (data['end_dt'] - data['start_tm']) > period
output
id start_tm end_dt
1 2019-01-01 00:00:00 2019-01-01 00:05:00
2 2019-01-02 00:00:00 2019-01-02 00:05:00
1 2019-01-01 00:05:00 2019-01-01 00:10:00
2 2019-01-02 00:05:00 2019-01-02 00:10:00
1 2019-01-01 00:10:00 2019-01-01 00:15:00
2 2019-01-02 00:10:00 2019-01-02 00:15:00
1 2019-01-01 00:15:00 2019-01-01 00:20:00
2 2019-01-02 00:15:00 2019-01-02 00:20:00
1 2019-01-01 00:20:00 2019-01-01 00:25:00
2 2019-01-02 00:20:00 2019-01-02 00:25:00
1 2019-01-01 00:25:00 2019-01-01 00:30:00
2 2019-01-02 00:25:00 2019-01-02 00:30:00

Categories

Resources