Rolling window on timestamped DataFrame with a custom step? - python

I have been fiddling about with pandas.DataFrame.rolling for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.
I have postgresql data with a composite index of (time, node) that has been read into a pandas.DataFrame, where time is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00 and ending on 2022-12-27 23:00:00, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.
| time | node | value |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00 | 123 | low |
| 2022-12-26 01:00:00 | 123 | med |
| 2022-12-26 02:00:00 | 123 | low |
| 2022-12-26 03:00:00 | 123 | high |
| ... | ... | ... |
| 2022-12-26 00:00:00 | 999 | low |
| 2022-12-26 01:00:00 | 999 | low |
| 2022-12-26 02:00:00 | 999 | low |
| 2022-12-26 03:00:00 | 999 | med |
| ... | ... | ... |
| 2022-12-27 00:00:00 | 123 | low |
| 2022-12-27 01:00:00 | 123 | med |
| 2022-12-27 02:00:00 | 123 | low |
| 2022-12-27 03:00:00 | 123 | high |
When I use something akin to df.rolling(window=pd.Timedelta('2days'), the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.
I've played around with using min_periods, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.
Is the behaviour I am looking for doable with pandas.DataFrame.rolling or must I look elsewhere/write my own windowing function?
Any guidance would be appreciated. Thanks!

So from what I understand, you want to create windows of length ndays and the next window should start with the next day.
Given some dataframe with 5 days in total in the frequency of 1H between indices:
import pandas as pd
import numpy as np
periods = 23 * 5
df = pd.DataFrame(
{'value': list(range(periods))},
index=pd.date_range('2022-12-16', periods=periods, freq='H')
)
d = np.random.choice(
pd.date_range('2022-12-16', periods=periods, freq='H'),
int(periods * 0.25)
)
df = df.drop(index=d)
df.head(5)
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
I randomly dropped some indices to simulate missing data.
We can use df.resample (docs) to group the data by days (regardless of missing data):
days = df.resample('1d')
print(days.get_group('2022-12-16'))
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
2022-12-16 06:00:00 6
2022-12-16 07:00:00 7
2022-12-16 08:00:00 8
2022-12-16 09:00:00 9
2022-12-16 11:00:00 11
2022-12-16 12:00:00 12
2022-12-16 13:00:00 13
2022-12-16 14:00:00 14
2022-12-16 15:00:00 15
2022-12-16 17:00:00 17
2022-12-16 18:00:00 18
2022-12-16 19:00:00 19
2022-12-16 21:00:00 21
2022-12-16 22:00:00 22
2022-12-16 23:00:00 23
Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays):
from more_itertools import windowed
ndays = 2
windows = [
pd.concat([w[1] for w in window])
for window in windowed(days, ndays)
]
Printing the first and last index of each window returns:
for window in windows:
print(window.iloc[[0, -1]])
>>> value
2022-12-16 00:00:00 0
2022-12-17 23:00:00 47
value
2022-12-17 00:00:00 24
2022-12-18 23:00:00 71
value
2022-12-18 00:00:00 48
2022-12-19 23:00:00 95
value
2022-12-19 01:00:00 73
2022-12-20 18:00:00 114
Furthermore, you can set step in windowed to control the step size between windows.

Related

django- get all model instances with overlapping date ranges

Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)

Pandas row-wise aggregation with multi-index

I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T

Removing duplicates every 5 minutes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to remove duplicate ID's that appear in every 5 minutes time frame from the dataset. The data frame looks something like this;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:04:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
Which should become;
|---------------------|------------------|------------------|
| ID | Date | Time |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:01:00 |
|---------------------|------------------|------------------|
| 13 | 2012-1-1 | 00:01:30 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:05:10 |
|---------------------|------------------|------------------|
| 12 | 2012-1-1 | 00:10:00 |
|---------------------|------------------|------------------|
The second time "12" occurs it should be flagged as duplicate as it appears a second time in the time frame 00:00:00 - 00:05:00.
I am using pandas to clean the current dataset.
Any help is appreciated!
Start from adding DatTim column (of type DateTime), taking source
data from Date and Time:
df['DatTim'] = pd.to_datetime(df.Date + ' ' + df.Time)
Then, assuming that ID is an "ordinary" column (not the index),
you should call:
groupby on DatTim column with 5 min frequency.
To each group apply drop_duplicates, with subset including only ID column.
And finally drop DatTim from the index.
Expressing the above instruction in Python:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp.drop_duplicates(subset='ID'))\
.reset_index(level=0, drop=True)
If you print(df2), you will get:
ID Date Time DatTim
0 12 2012-1-1 00:01:00 2012-01-01 00:01:00
1 13 2012-1-1 00:01:30 2012-01-01 00:01:30
3 12 2012-1-1 00:05:10 2012-01-01 00:05:10
4 12 2012-1-1 00:10:00 2012-01-01 00:10:00
To "clean up", you can drop DatTim column:
df2.drop('DatTim', axis=1)
Edit
If ID is the index, a slight change is required:
df2 = df.groupby(pd.Grouper(key='DatTim', freq='5min'))\
.apply(lambda grp: grp[~grp.index.duplicated(keep='first')])\
.reset_index(level=0, drop=True)
And then the printed df2 is:
Date Time DatTim
ID
12 2012-1-1 00:01:00 2012-01-01 00:01:00
13 2012-1-1 00:01:30 2012-01-01 00:01:30
12 2012-1-1 00:05:10 2012-01-01 00:05:10
12 2012-1-1 00:10:00 2012-01-01 00:10:00
Of course, also in this case you can drop DatTim column.

Retrieve time difference since last action -- python/pandas

Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1

compare values based on nearest datetime

I'm have two pandas dataframes, both with two columns: datetime and value (float). I want to substract the value of dataframe A from the value of dataframe B based on the nearest datetime.
Example:
dataframe A:
datetime | value
01-01-2016 00:00 | 10
01-01-2016 01:00 | 12
01-01-2016 02:00 | 14
01-01-2016 03:00 | 12
01-01-2016 04:00 | 12
01-01-2016 05:00 | 16
01-01-2016 06:00 | 18
dataframe B:
datetime | value
01-01-2016 00:20 | 5
01-01-2016 00:50 | -5
01-01-2016 01:20 | 12
01-01-2016 01:50 | 30
01-01-2016 02:20 | 1
01-01-2016 02:50 | 6
01-01-2016 03:50 | 0
In case of the first row of A, this would mean that the nearest datetime of B would also be the first row and thus: 10-5 = 5. In case of the fourth row of A (01-01-2016 3:00) this would mean that the sixth row of B is nearest and the difference would be: 12-6 = 6.
I currently do this using a for loop:
for i, row in data.iterrows():
# i is the index, a Timestamp
data['h'][i] = row['h'] - baro.iloc[baro.index.get_loc(i,method='nearest')]['h']
It works fine, but would it be possible to do this faster?
new with pandas 0.19 pd.merge_asof
pd.merge_asof(dfa, dfb, 'datetime')
IIUC you can use reindex(..., method='nearest') method if you are using Pandas version < 0.19.0, starting from 0.19.0 it definitely makes sense to use pd.merge_asof, which is much more convenient and much more efficient too:
df1 = df1.set_index('datetime')
df2 = df2.set_index('datetime')
In [214]: df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
Out[214]:
value value_right
datetime
2016-01-01 00:00:00 10 5
2016-01-01 01:00:00 12 -5
2016-01-01 02:00:00 14 30
2016-01-01 03:00:00 12 6
2016-01-01 04:00:00 12 0
2016-01-01 05:00:00 16 0
2016-01-01 06:00:00 18 0
In [224]: df1.value - df2.reindex(df1.index, method='nearest').value
Out[224]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
Name: value, dtype: int64
In [218]: merged = df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
In [220]: merged.value.subtract(merged.value_right)
Out[220]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
dtype: int64

Categories

Resources