how to implement this difference operation efficiently?

how to implement this difference operation efficiently? - python

I would like to build a data frame from an existing one, where each value per row is depending on the previous one. I have an initial value v0 as starting point. Let me make an example
In [126]:import pandas as pd
In [127]: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3])
In [128]: df_result = df.copy()
In [129]: v0 = 10
In [130]: for i in range(1, len(df.index)):
...: df_result.iloc[i, 0] = df.iloc[i, 0]*df_result.iloc[i-1, 0]
...:
In [131]: df_result
Out[131]:
0
0 1.000
1 1.100
2 1.320
3 1.716
In [132]:
My question is about the for loop. How can I more efficiently writing this?

I believe need first numpy.insert value v0 to first position and then call numpy.cumprod:
df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r'])
v0 = 10
df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0))
print (df)
r n
0 1.0 10.00
1 1.1 11.00
2 1.2 13.20
3 1.3 17.16

Related

select rows in group by dataframe before the row which not satisfies a condition (python)

I have a dataframe with some features. I want to group by 'id' feature. Then for each group I want to identify the row which has 'speed' feature value greater than a threshold and select all the rows before this one.
For example, my threshold is 1.5 for 'speed' feature and my input is:
id
speed
...
1
1.2
...
1
1.9
...
1
1.0
...
5
0.9
...
5
1.3
...
5
3.5
...
5
0.4
...
And my desired output is:
id
speed
...
1
1.2
...
5
0.9
...
5
1.3
...

This should get you the desired results:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
df
output:
id speed
0 1 1.2
1 1 1.9
2 1 1.0
3 5 0.9
4 5 1.3
5 5 9.5
6 5 0.4
ther = 1.5
s = df.speed.shift(-1).ge(ther)
df[s]
Output:
id speed
0 1 1.2
4 5 1.3

It took me an hour to figure out but I got what you need. You need to REVERSE the dataframe and use .cumsum() (cumulative sum) in the groupbyed id's to find the values after the speed threshold you set. Then drop the speeds more than threshold, along with rows that do not satisfy the condition. Finally, reverse back the dataframe:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
# Reverse the dataframe
df = df.iloc[::-1]
thre = 1.5
# Find rows with speed more than threshold
df = df.assign(ge=df.speed.ge(thre))
# Groupby and cumsum to get the rows that are after the threshold in with same id
df.insert(0, 'beforethre', df.groupby('id')['ge'].cumsum())
# Drop speed more than threshold
df['ge'] = df['ge'].replace(True, np.nan)
# Drop rows that don't have any speed more than threshold or after threshold
df['beforethre'] = df['beforethre'].replace(0, np.nan)
df = df.dropna(axis=0).drop(['ge', 'beforethre'], axis=1)
# Reverse back the dataframe
df = df.iloc[::-1]
# Viola!
df
Output:
id speed
0 1 1.2
3 5 0.9
4 5 1.3

Rolling average with window size an interval of column values

I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?

Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).

df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.

Padding rows based on conditional

I have time series data per row (with columns as time steps) and I'd like to left and right pad each row with 0s based on a conditional row value (i.e. 'Padding amount'). This is what I have:
Padding amount T1 T2 T3
0 3 2.9 2.8
1 2.9 2.8 2.7
1 2.8 2.3 2.0
2 4.4 3.3 2.3
And this is what I'd like to produce:
Padding amount T1 T2 T3 T4 T5
0 3 2.9 2.8 0 0 (--> padding = 0, so no change)
1 0 2.9 2.8 2.7 0 (--> shifted one to the left)
1 0 2.8 2.3 2.0 0
2 0 0 4.4 3.3 2.3 (--> shifted two to the right)
I see that Keras has sequence padding, but not sure how this would work considering all rows have the same number of entries. I'm looking at Shift and np.roll but I'm sure a solution exists for this already somewhere.

In numpy, you could construct an array of indices for the locations where you want to place your array elements.
Let's say you have
padding = np.array([0, 1, 1, 2])
data = np.array([[3.0, 2.9, 2.8],
[2.9, 2.8, 2.7],
[2.8, 2.3, 2.0],
[4.4, 3.3, 2.3]])
M, N = data.shape
The output array would be
output = np.zeros((M, N + padding.max()))
You can make an index of where the data goes:
rows = np.arange(M)[:, None]
cols = padding[:, None] + np.arange(N)
Since the shape of the index broadcasts to the shape of the shape of the data, you can assign the output directly:
output[rows, cols] = data
Not sure how this applies to a DataFrame exactly, but you could probably construct a new one after operating on the values of the old one. Alternatively, you could probably implement all these operations equivalently directly in pandas.

This is one way of doing it, i've made the process really flexible in terms of how many time periods/steps it can take:
import pandas as pd
#data
d = {'Padding amount': [0, 1, 1, 2],
'T1': [3, 2.9, 2.8, 4.4],
'T2': [2.9, 2.7, 2.3, 3.3],
'T3': [2.8, 2.7, 2.0, 2.3]}
#create DF
df = pd.DataFrame(data = d)
#get max padding amount
maxPadd = df['Padding amount'].max()
#list of time periods
timePeriodsCols = [c for c in df.columns.tolist() if 'T' in c]
#reverse list
reverseList = timePeriodsCols[::-1]
#number of periods
noOfPeriods = len(timePeriodsCols)
#create new needed columns
for i in range(noOfPeriods + 1, noOfPeriods + 1 + maxPadd):
df['T' + str(i)] = ''
#loop over records
for i, row in df.iterrows():
#get padding amount
padAmount = df.at[i, 'Padding amount']
#if zero then do nothing
if padAmount == 0:
continue
#else: roll column value by padding amount and set old location to zero
else:
for col in reverseList:
df.at[i, df.columns[df.columns.get_loc(col) + padAmount]] = df.at[i, df.columns[df.columns.get_loc(col)]]
df.at[i, df.columns[df.columns.get_loc(col)]] = 0
print(df)
Padding amount T1 T2 T3 T4 T5
0 0 3.0 2.9 2.8
1 1 0.0 2.9 2.7 2.7
2 1 0.0 2.8 2.3 2
3 2 0.0 0.0 4.4 3.3 2.3

How to apply rolling function when all variables in window from multiple columns are required

I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.

You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.

I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938

Python Pandas Cumulative Multiplication with IF case

I created a small dataframe and I want to multiply 0.99 to the previous row and so on but only if the "IF case" is true, otherwise put the x[i].
In:
1
6
2
8
4
Out:
1.00
0.99
2.00
1.98
1.96
With a help from a guy, based on a similar problem, I tried the following but does not work.
x = pd.DataFrame([1, 6, 2, 8, 4])
y = np.zeros(x.shape)
yd = pd.DataFrame(y)
yd = np.where(x<3, x ,pd.Series(.99, yd.index).cumprod() / .99)
Any idea? Thank you

This is more like a groupby problem , when the value is less than 3 you reset the prod process
y=x[0]
mask=y<3
y.where(mask,0.99).groupby(mask.cumsum()).cumprod()
Out[122]:
0 1.0000
1 0.9900
2 2.0000
3 1.9800
4 1.9602
Name: 0, dtype: float64
At least we have the for loop here (If above does not work )
your=[]
for t,v in enumerate(x[0]):
if v < 3:
your.append(v)
else:
your.append(your[t-1]*0.99)
your
Out[129]: [1, 0.99, 2, 1.98, 1.9602]

This checks whether the value of x in the current row is lesser than 3. If it is, it keeps it as is else multiplies the previous row by 0.99.
x = pd.DataFrame([1, 6, 2, 8, 4])
x['out'] = np.where(x[0] <3, x[0], x[0].shift(1)*0.99)
Output:
x['out']
0 1.00
1 0.99
2 2.00
3 1.98
4 7.92

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to implement this difference operation efficiently? - python

I believe need first numpy.insert value v0 to first position and then call numpy.cumprod: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r']) v0 = 10 df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0)) print (df) r n 0 1.0 10.00 1 1.1 11.00 2 1.2 13.20 3 1.3 17.16

Related

select rows in group by dataframe before the row which not satisfies a condition (python)

Rolling average with window size an interval of column values

Padding rows based on conditional

How to apply rolling function when all variables in window from multiple columns are required

Python Pandas Cumulative Multiplication with IF case

Categories

Resources