Groupby with forward looking rolling maximum - python

I have data with date, time, and values and want to calculate a forward looking rolling maximum for each date:
Date Time Value Output
01/01/2022 01:00 1.3 1.4
01/01/2022 02:00 1.4 1.2
01/01/2022 03:00 0.9 1.2
01/01/2022 04:00 1.2 NaN
01/02/2022 01:00 5 4
01/02/2022 02:00 4 3
01/02/2022 03:00 2 3
01/02/2022 04:00 3 NaN
I have tried this:
df = df.sort_values(by=['Date','Time'], ascending=True)
df['rollingmax'] = df.groupby(['Date'])['Value'].rolling(window=4,min_periods=0).max()
df = df.sort_values(by=['Date','Time'], ascending=False)
but that doesn't seem to work...

It looks like you want a shifted reverse rolling max:
n = 4
df['Output'] = (df[::-1]
.groupby('Date')['Value']
.apply(lambda g: g.rolling(n-1, min_periods=1).max().shift())
)
Output:
Date Time Value Output
0 01/01/2022 01:00 1.3 1.4
1 01/01/2022 02:00 1.4 1.2
2 01/01/2022 03:00 0.9 1.2
3 01/01/2022 04:00 1.2 NaN
4 01/02/2022 01:00 5.0 4.0
5 01/02/2022 02:00 4.0 3.0
6 01/02/2022 03:00 2.0 3.0
7 01/02/2022 04:00 3.0 NaN

Related

Pandas dataframe expand rows in specific times

I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3
You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0
Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3
Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0

Update muliple column values based on condition in python

I have a dataframe like this,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 4.7 5.3 6 ... 8 5.5
37 0 9.2 4.5 ... 11.2 9.2
4469 2 9.8 11 ... 2 6.4
Can I use np.where to apply conditions on multiple columns at once?
I want to update the values from 00:00 to 23:00 to 0 and 1. If the value at the time of day is greater than avg_value then I change it to 1, else to 0.
I know how to apply this method to one single column.
np.where(df['00:00']>df['avg_value'],1,0)
Can I change it to multiple columns?
Output will be like,
ID 00:00 01:00 02:00 ... 23:00 avg_value
22 0 1 1 ... 1 5.5
37 0 0 0 ... 1 9.2
4469 0 1 1 ... 0 6.4
Select all columns without last by DataFrame.iloc, compare by DataFrame.gt and casting to integers and last add avg_value column by DataFrame.join:
df = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int).join(df['avg_value'])
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Or use DataFrame.pop for extract column:
s = df.pop('avg_value')
df = df.gt(s, axis=0).astype(int).join(s)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0 0 1 1 5.5
37 0 0 0 1 9.2
4469 0 1 1 0 6.4
Because if assign to same columns integers are converted to floats (it is bug):
df.iloc[:, :-1] = df.iloc[:, :-1].gt(df['avg_value'], axis=0).astype(int)
print (df)
00:00 01:00 02:00 23:00 avg_value
ID
22 0.0 0.0 1.0 1.0 5.5
37 0.0 0.0 0.0 1.0 9.2
4469 0.0 1.0 1.0 0.0 6.4

Create a dateTime for every ID in a dataframe

I have a data frame (namend df) from 2016/1/1 00:00 until 2018/11/25 23:00 with a timestamp every hour, object_id and a value. The data set only contains rows where an object_id has a value.
timestampHour object_id value
2016/1/1 00:00 1 2
2016/1/1 00:00 3 1
2016/1/1 01:00 1 1
2016/1/1 01:00 2 3
2016/1/1 02:00 2 3
2016/1/1 02:00 3 2
I would like to get a dataframe showing all object id's for every hour, with a null value if there is no value.
timestampHour object_id value
2016/1/1 00:00 1 2
2016/1/1 00:00 2 null
2016/1/1 00:00 3 1
2016/1/1 01:00 1 1
2016/1/1 01:00 2 3
2016/1/1 01:00 3 null
2016/1/1 02:00 1 null
2016/1/1 02:00 2 3
2016/1/1 02:00 3 2
I have created the dateTime from timestamps. And rounded them to hours with the following code:
df["timestamp"] = pd.to_datetime(df["result_timestamp"])
df['timestampHour'] = df['result_timestamp'].dt.round('60min')
(I don't know if there are better options, but I have been trying to create timestampHour rows until 12 (I have 12 every unique object_id) and fill those newly created rows with (the for that hour) unused object_id. But I have not been able to create the empty rows, with the condition)
I am fairly new to programming and I am not finding a clue to get closer to solving this problem from searching other posts.
Using pivot_table and unstack:
df.pivot_table(
index='object_id', columns='timestampHour', values='value'
).unstack().rename('value').reset_index()
timestampHour object_id value
0 2016/1/1 00:00 1 2.0
1 2016/1/1 00:00 2 NaN
2 2016/1/1 00:00 3 1.0
3 2016/1/1 01:00 1 1.0
4 2016/1/1 01:00 2 3.0
5 2016/1/1 01:00 3 NaN
6 2016/1/1 02:00 1 NaN
7 2016/1/1 02:00 2 3.0
8 2016/1/1 02:00 3 2.0
To see why this works, the intermediate pivot_table is helpful to look at:
timestampHour 2016/1/1 00:00 2016/1/1 01:00 2016/1/1 02:00
object_id
1 2.0 1.0 NaN
2 NaN 3.0 3.0
3 1.0 NaN 2.0
Where a value is not found for a combination of object_id and timestampHour, a NaN is added to the table. When you use unstack, these NaN's are kept, giving you the desired result with missing values represented.
This is also .reindex with a cartesian product of the two levels. This question goes into detail on ways to optimize the performance of the product for large datasets.
import pandas as pd
id_cols = ['timestampHour', 'object_id']
idx = pd.MultiIndex.from_product(df[id_cols].apply(pd.Series.unique).values.T, names=id_cols)
df.set_index(id_cols).reindex(idx).reset_index()
Output:
timestampHour object_id value
0 2016/1/1 00:00 1 2.0
1 2016/1/1 00:00 3 1.0
2 2016/1/1 00:00 2 NaN
3 2016/1/1 01:00 1 1.0
4 2016/1/1 01:00 3 NaN
5 2016/1/1 01:00 2 3.0
6 2016/1/1 02:00 1 NaN
7 2016/1/1 02:00 3 2.0
8 2016/1/1 02:00 2 3.0

Making matching algorithm between two data frames more efficient

I have two data frames eg.
Shorter time frame ( 4 hourly )
Time Data_4h
1/1/01 00:00 1.1
1/1/01 06:00 1.2
1/1/01 12:00 1.3
1/1/01 18:00 1.1
2/1/01 00:00 1.1
2/1/01 06:00 1.2
2/1/01 12:00 1.3
2/1/01 18:00 1.1
3/1/01 00:00 1.1
3/1/01 06:00 1.2
3/1/01 12:00 1.3
3/1/01 18:00 1.1
Longer time frame ( 1 day )
Time Data_1d
1/1/01 00:00 1.1
2/1/01 00:00 1.6
3/1/01 00:00 1.0
I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.
For example,
Final merged data combining 4h and 1d
Time Data_4h Data_1d
1/1/01 00:00 1.1 NaN
1/1/01 06:00 1.2 NaN
1/1/01 12:00 1.3 NaN
1/1/01 18:00 1.1 NaN
2/1/01 00:00 1.1 1.1
2/1/01 06:00 1.2 1.1
2/1/01 12:00 1.3 1.1
2/1/01 18:00 1.1 1.1
3/1/01 00:00 1.1 1.6
3/1/01 06:00 1.2 1.6
3/1/01 12:00 1.3 1.6
3/1/01 18:00 1.1 1.6
So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.
It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.
What is the best way to do this?
Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:
large_tf_data[(large_tf_data.index <= target_timestamp)][0]
Where target_timestamp is calculated on each row in the smaller time frame data frame.
This is extremely slow! Any suggestions on how to speed it up?
First, take care of dates
dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)
Then Convert df2 to something useful
d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d
Apply magic
df.join(df.Time.dt.date.map(d2).rename(d2.name))
Time Data_4h Data_1d
0 2001-01-01 00:00:00 1.1 NaN
1 2001-01-01 06:00:00 1.2 NaN
2 2001-01-01 12:00:00 1.3 NaN
3 2001-01-01 18:00:00 1.1 NaN
4 2001-01-02 00:00:00 1.1 1.1
5 2001-01-02 06:00:00 1.2 1.1
6 2001-01-02 12:00:00 1.3 1.1
7 2001-01-02 18:00:00 1.1 1.1
8 2001-01-03 00:00:00 1.1 1.6
9 2001-01-03 06:00:00 1.2 1.6
10 2001-01-03 12:00:00 1.3 1.6
11 2001-01-03 18:00:00 1.1 1.6
I'm sure there are other ways but I didn't want to think about this anymore.

Match dates in panda and add duplicate in new column

I'm searching for an elegant way to match datetimes within a panda DataFrame.
The original data looks like this:
point_id datetime value1 value2
1 2017-05-2017 00:00 1 1.1
2 2017-05-2017 00:00 2 2.2
3 2017-05-2017 00:00 3 3.3
2 2017-05-2017 01:00 4 4.4
what the result should look like:
datetime value value_cal value2 value_calc2 value3 value_calc3
2017-05-2017 00:00 1 1.1 2 2.2 3 3.3
2017-05-2017 01:00 Nan Nan 4 4.4 Nan NaN
In the end there should be one row for each datetime and missing datapoints decleared as so.
In [180]: x = (df.drop('point_id',1)
...: .rename(columns={'value1':'value','value2':'value_cal'})
...: .assign(n=df.groupby('datetime')['value1'].cumcount()+1)
...: .pivot_table(index='datetime', columns='n', values=['value','value_cal'])
...: .sort_index(axis=1, level=1)
...: )
...:
In [181]: x
Out[181]:
value value_cal value value_cal value value_cal
n 1 1 2 2 3 3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN
now we can "fix" column names
In [182]: x.columns = ['{0[0]}{0[1]}'.format(c) for c in x.columns]
In [183]: x
Out[183]:
value1 value_cal1 value2 value_cal2 value3 value_cal3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN

Categories

Resources