Making matching algorithm between two data frames more efficient

Making matching algorithm between two data frames more efficient - python

I have two data frames eg.
Shorter time frame ( 4 hourly )
Time Data_4h
1/1/01 00:00 1.1
1/1/01 06:00 1.2
1/1/01 12:00 1.3
1/1/01 18:00 1.1
2/1/01 00:00 1.1
2/1/01 06:00 1.2
2/1/01 12:00 1.3
2/1/01 18:00 1.1
3/1/01 00:00 1.1
3/1/01 06:00 1.2
3/1/01 12:00 1.3
3/1/01 18:00 1.1
Longer time frame ( 1 day )
Time Data_1d
1/1/01 00:00 1.1
2/1/01 00:00 1.6
3/1/01 00:00 1.0
I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.
For example,
Final merged data combining 4h and 1d
Time Data_4h Data_1d
1/1/01 00:00 1.1 NaN
1/1/01 06:00 1.2 NaN
1/1/01 12:00 1.3 NaN
1/1/01 18:00 1.1 NaN
2/1/01 00:00 1.1 1.1
2/1/01 06:00 1.2 1.1
2/1/01 12:00 1.3 1.1
2/1/01 18:00 1.1 1.1
3/1/01 00:00 1.1 1.6
3/1/01 06:00 1.2 1.6
3/1/01 12:00 1.3 1.6
3/1/01 18:00 1.1 1.6
So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.
It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.
What is the best way to do this?
Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:
large_tf_data[(large_tf_data.index <= target_timestamp)][0]
Where target_timestamp is calculated on each row in the smaller time frame data frame.
This is extremely slow! Any suggestions on how to speed it up?

First, take care of dates
dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)
Then Convert df2 to something useful
d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d
Apply magic
df.join(df.Time.dt.date.map(d2).rename(d2.name))
Time Data_4h Data_1d
0 2001-01-01 00:00:00 1.1 NaN
1 2001-01-01 06:00:00 1.2 NaN
2 2001-01-01 12:00:00 1.3 NaN
3 2001-01-01 18:00:00 1.1 NaN
4 2001-01-02 00:00:00 1.1 1.1
5 2001-01-02 06:00:00 1.2 1.1
6 2001-01-02 12:00:00 1.3 1.1
7 2001-01-02 18:00:00 1.1 1.1
8 2001-01-03 00:00:00 1.1 1.6
9 2001-01-03 06:00:00 1.2 1.6
10 2001-01-03 12:00:00 1.3 1.6
11 2001-01-03 18:00:00 1.1 1.6
I'm sure there are other ways but I didn't want to think about this anymore.

Related

Groupby with forward looking rolling maximum

I have data with date, time, and values and want to calculate a forward looking rolling maximum for each date:
Date Time Value Output
01/01/2022 01:00 1.3 1.4
01/01/2022 02:00 1.4 1.2
01/01/2022 03:00 0.9 1.2
01/01/2022 04:00 1.2 NaN
01/02/2022 01:00 5 4
01/02/2022 02:00 4 3
01/02/2022 03:00 2 3
01/02/2022 04:00 3 NaN
I have tried this:
df = df.sort_values(by=['Date','Time'], ascending=True)
df['rollingmax'] = df.groupby(['Date'])['Value'].rolling(window=4,min_periods=0).max()
df = df.sort_values(by=['Date','Time'], ascending=False)
but that doesn't seem to work...

It looks like you want a shifted reverse rolling max:
n = 4
df['Output'] = (df[::-1]
.groupby('Date')['Value']
.apply(lambda g: g.rolling(n-1, min_periods=1).max().shift())
)
Output:
Date Time Value Output
0 01/01/2022 01:00 1.3 1.4
1 01/01/2022 02:00 1.4 1.2
2 01/01/2022 03:00 0.9 1.2
3 01/01/2022 04:00 1.2 NaN
4 01/02/2022 01:00 5.0 4.0
5 01/02/2022 02:00 4.0 3.0
6 01/02/2022 03:00 2.0 3.0
7 01/02/2022 04:00 3.0 NaN

Pandas get average from second dataframe of the last N rows within time interval

I have 2 DFs:
DF1
Name
Timestamp
Value
object 1
2021-11-01 10:00:00
1.
object 1
2021-11-01 11:00:00
1.5
object 2
2021-11-01 10:30:00
1.7
DF2
Name
Timestamp
feature
object 1
2021-11-01 8:00:00
0.9
object 1
2021-11-01 9:00:00
1.1
object 1
2021-11-01 9:30:00
1.3
object 1
2021-11-01 12:00:00
1.
object 2
2021-11-01 10:00:00
1.3
object 2
2021-11-01 11:30:00
1.9
Per row from DF1, I would like to get the rolling average of the last N rows from DF2 that have the same Name and the Timestamp is smaller than the row I am considering. (Say N=2 in this example)
Example output should look like:
Name
Timestamp
Value
AVG of feature
object 1
2021-11-01 10:00:00
1.
(1.1 + 1.3)/2
object 1
2021-11-01 11:00:00
1.5
(1.1 + 1.3)/2
object 2
2021-11-01 10:30:00
1.7
1.3
Ideally I would be able to do even weighted averages depending on time differences. For example
Name
Timestamp
Value
AVG of feature
object 1
2021-11-01 10:00:00
1.
(60min * 1.1 + 30min * 1.3)/(2 * 90min)
object 1
2021-11-01 11:00:00
1.5
(120min * 1.1 + 90min * 1.3)/(2 * 210min)
object 2
2021-11-01 10:30:00
1.7
1.3
IMPORTANT: My issue is that doing an DF1.apply is taking really long as I have big dataframes (DF1 is around twice as big as DF2). I believe the most important bottleneck is on how to find the biggest timestamp in DF2 that is smaller than the current row in DF1

You need to use pandas.merge_asof to align the timestamps:
df1.join(pd
.merge_asof(df2.sort_values(by='Timestamp'),
df1.sort_values(by='Timestamp')
.reset_index()
.drop(columns='Value')
.rename(columns={'Timestamp': 'TS'}),
by='Name', left_on='Timestamp', right_on='TS',
direction='forward')
.assign(weight=lambda d: d['TS'].sub(d['Timestamp']).dt.total_seconds(),
feature=lambda d: d['feature'].mul(d['weight'])
)
.groupby('index').apply(lambda g: g['feature'].sum()/g['weight'].sum()/len(g))
.rename('AVG of (feature)')
)
output:
Name Timestamp Value AVG of (feature)
0.0 object 1 2021-11-01 10:00:00 1.0 0.583333
1.0 object 1 2021-11-01 11:00:00 1.5 NaN
2.0 object 2 2021-11-01 10:30:00 1.7 1.300000
NB. if you want to propagate the previous value of the AVG you can use ffill per group

Pandas Dataframe Pivot and reindex n timeseries

I have a pandas dataframe containing n time series in the same Datetime column, each one associated to a different Id, with a corresponding value associated. I would like to pivot the table and reindex to the nearest timestamp. Notice that there can be cases where a timestamp is missing, as in Id-3, in this case the value would need to become NaN.
Datetime Id Value
5-26-17 8:00 1 2.3
5-26-17 8:30 1 4.5
5-26-17 9:00 1 7
5-26-17 9:30 1 8.1
5-26-17 10:00 1 7.9
5-26-17 10:30 1 3.4
5-26-17 11:00 1 2.1
5-26-17 11:30 1 1.8
5-26-17 12:00 1 0.4
5-26-17 8:02 2 2.6
5-26-17 8:32 2 4.8
5-26-17 9:02 2 7.3
5-26-17 9:32 2 8.4
5-26-17 10:02 2 8.2
5-26-17 10:32 2 3.7
5-26-17 11:02 2 2.4
5-26-17 11:32 2 2.1
5-26-17 12:02 2 0.7
5-26-17 8:30 3 4.5
5-26-17 9:00 3 7
5-26-17 9:30 3 8.1
5-26-17 10:00 3 7.9
5-26-17 10:30 3 3.4
5-26-17 11:00 3 2.1
5-26-17 11:30 3 1.8
5-26-17 12:00 3 0.4
Expected results:
Datetime Id-1 Id-2 Id-3
5-26-17 8:00 2.3 2.6 NaN
5-26-17 8:30 4.5 4.8 4.5
5-26-17 9:00 7 7.3 7
5-26-17 9:30 8.1 8.4 8.1
5-26-17 10:00 7.9 8.2 7.9
5-26-17 10:30 3.4 3.7 3.4
5-26-17 11:00 2.1 2.4 2.1
5-26-17 11:30 1.8 2.1 1.8
5-26-17 12:00 0.4 0.7 0.4
How would you do this?

I believe need convert column to datetimes and floor by 30 minutes by floor, last pivot and add_prefix:
df['Datetime'] = pd.to_datetime(df['Datetime']).dt.floor('30T')
df = df.pivot('Datetime','Id','Value').add_prefix('Id-')
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4
Another solution is use resample with mean:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = (df.set_index('Datetime')
.groupby('Id')
.resample('30T')['Value']
.mean().unstack(0)
.add_prefix('Id-'))
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4

Match dates in panda and add duplicate in new column

I'm searching for an elegant way to match datetimes within a panda DataFrame.
The original data looks like this:
point_id datetime value1 value2
1 2017-05-2017 00:00 1 1.1
2 2017-05-2017 00:00 2 2.2
3 2017-05-2017 00:00 3 3.3
2 2017-05-2017 01:00 4 4.4
what the result should look like:
datetime value value_cal value2 value_calc2 value3 value_calc3
2017-05-2017 00:00 1 1.1 2 2.2 3 3.3
2017-05-2017 01:00 Nan Nan 4 4.4 Nan NaN
In the end there should be one row for each datetime and missing datapoints decleared as so.

In [180]: x = (df.drop('point_id',1)
...: .rename(columns={'value1':'value','value2':'value_cal'})
...: .assign(n=df.groupby('datetime')['value1'].cumcount()+1)
...: .pivot_table(index='datetime', columns='n', values=['value','value_cal'])
...: .sort_index(axis=1, level=1)
...: )
...:
In [181]: x
Out[181]:
value value_cal value value_cal value value_cal
n 1 1 2 2 3 3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN
now we can "fix" column names
In [182]: x.columns = ['{0[0]}{0[1]}'.format(c) for c in x.columns]
In [183]: x
Out[183]:
value1 value_cal1 value2 value_cal2 value3 value_cal3
datetime
2017-05-2017 00:00 1.0 1.1 2.0 2.2 3.0 3.3
2017-05-2017 01:00 4.0 4.4 NaN NaN NaN NaN

Pandas - Rolling slope calculation

How to calculate slope of each columns' rolling(window=60) value, stepped by 5?
I'd like to calculate every 5 minutes' value, and I don't need every record's results.
Here's sample dataframe and results:
df
Time A ... N
2016-01-01 00:00 1.2 ... 4.2
2016-01-01 00:01 1.2 ... 4.0
2016-01-01 00:02 1.2 ... 4.5
2016-01-01 00:03 1.5 ... 4.2
2016-01-01 00:04 1.1 ... 4.6
2016-01-01 00:05 1.6 ... 4.1
2016-01-01 00:06 1.7 ... 4.3
2016-01-01 00:07 1.8 ... 4.5
2016-01-01 00:08 1.1 ... 4.1
2016-01-01 00:09 1.5 ... 4.1
2016-01-01 00:10 1.6 ... 4.1
....
result
Time A ... N
2016-01-01 00:04 xxx ... xxx
2016-01-01 00:09 xxx ... xxx
2016-01-01 00:14 xxx ... xxx
...
Can df.rolling function be applied to this problem?
It's fine if NaN is in the window, meaning subset could be less than 60.

It seems that what you want is rolling with a specific step size.
However, according to the documentation of pandas, step size is currently not supported in rolling.
If the data size is not too large, just perform rolling on all data and select the results using indexing.
Here's a sample dataset. For simplicity, the time column is represented using integers.
data = pd.DataFrame(np.random.rand(500, 1) * 10, columns=['a'])
a
0 8.714074
1 0.985467
2 9.101299
3 4.598044
4 4.193559
.. ...
495 9.736984
496 2.447377
497 5.209420
498 2.698441
499 3.438271
Then, roll and calculate slopes,
def calc_slope(x):
slope = np.polyfit(range(len(x)), x, 1)[0]
return slope
# set min_periods=2 to allow subsets less than 60.
# use [4::5] to select the results you need.
result = data.rolling(60, min_periods=2).apply(calc_slope)[4::5]
The result will be,
a
4 -0.542845
9 0.084953
14 0.155297
19 -0.048813
24 -0.011947
.. ...
479 -0.004792
484 -0.003714
489 0.022448
494 0.037301
499 0.027189
Or, you can refer to this post. The first answer provides a numpy way to achieve this:
step size in pandas.DataFrame.rolling

try this
windows = df.groupby("Time")["A"].rolling(60)
df[out] = windows.apply(lambda x: np.polyfit(range(60), x, 1)[0], raw=True).values

You could use pandas Resample. Note that to use this , you need an index with time value
df.index = pd.to_datetime(df.Time)
print df
result = df.resample('5Min').bfill()
print result
Time A N
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:01:00 2016-01-01 00:01 1.2 4.0
2016-01-01 00:02:00 2016-01-01 00:02 1.2 4.5
2016-01-01 00:03:00 2016-01-01 00:03 1.5 4.2
2016-01-01 00:04:00 2016-01-01 00:04 1.1 4.6
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:06:00 2016-01-01 00:06 1.7 4.3
2016-01-01 00:07:00 2016-01-01 00:07 1.8 4.5
2016-01-01 00:08:00 2016-01-01 00:08 1.1 4.1
2016-01-01 00:09:00 2016-01-01 00:09 1.5 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1
Time A N
Output
Time
2016-01-01 00:00:00 2016-01-01 00:00 1.2 4.2
2016-01-01 00:05:00 2016-01-01 00:05 1.6 4.1
2016-01-01 00:10:00 2016-01-01 00:10 1.6 4.1
2016-01-01 00:15:00 2016-01-01 00:15 1.6 4.1

I use:
df['slope_I'] = df['I'].rolling('600s').apply(lambda x: (x[-1]-x[0])/600)
where the slope is something with 1/seconds units.
Probably the first 600s of the result will be empty, you should fill it with zeros, or with the mean.
The first number in the slope column will be the slope of the line that goes from the first row inside the window to the last, and so on during the rolling.
Best regards.

For other answer seekers, here I got another solution where the time interval does not need to be the same length.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()
This line of code takes the difference of the current row with sixty rows back and divide this by the difference in time of the same rows.
When you only want every fifth record then the next line should work.
df.A.diff(60)/df.Time.diff(60).dt.total_seconds()[4::5]
Note: every line is calculated and only the 5 stepped serie is returned
doc pandas diff: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making matching algorithm between two data frames more efficient - python

Related

Groupby with forward looking rolling maximum

Pandas get average from second dataframe of the last N rows within time interval

Pandas Dataframe Pivot and reindex n timeseries

Match dates in panda and add duplicate in new column

Pandas - Rolling slope calculation

Categories

Resources