I have a DataFrame, df, in pandas with series df.A and df.B and am trying to create a third series, df.C that is dependent on A and B as well as the previous result. That is:
C[0]=A[0]
C[n]=A[n] + B[n]*C[n-1]
what is the most efficient way of doing this? Ideally, I wouldn't have to fall back to a for loop.
Edit
This is the desired output for C given A and B. Now just need to figure out how...
import pandas as pd
a = [ 2, 3,-8,-2, 1]
b = [ 1, 1, 4, 2, 1]
c = [ 2, 5,12,22,23]
df = pd.DataFrame({'A': a, 'B': b, 'C': c})
df
You can vectorize this with obnoxious cumulative products and zipping together of other vectors. But it won't end up saving you time. As a matter of fact, it will likely be numerically unstable.
Instead, you can use numba to speed up your loop.
from numba import njit
import numpy as np
import pandas as pd
#njit
def dynamic_alpha(a, b):
c = a.copy()
for i in range(1, len(a)):
c[i] = a[i] + b[i] * c[i - 1]
return c
df.assign(C=dynamic_alpha(df.A.values, df.B.values))
A B C
0 2 1 2
1 3 1 5
2 -8 4 12
3 -2 2 22
4 1 1 23
For this simple calculation, this will be about as fast as a simple
df.assign(C=np.arange(len(df)) ** 2 + 2)
df = pd.concat([df] * 10000)
%timeit df.assign(C=dynamic_alpha(df.A.values, df.B.values))
%timeit df.assign(C=np.arange(len(df)) ** 2 + 2)
337 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
333 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
try this:
C[0]=A[0]
C=[A[i]+B[i]*C[i-1] for i in range(1,len(A))]
very much quicker than a loop.
Related
For example, I have a dataframe where two of the columns are "Zeroes" and "Ones" that contain only zeroes and ones, respectively. If I combine them into one column I get first all the zeroes, then all the ones.
I want to combine them in a way that I get each element from both columns, not all elements from the first column and all elements from the second column. So I don't want the result to be [0, 0, 0, 1, 1, 1], I need it to be [0, 1, 0, 1, 0, 1].
I process 100K+ rows of data. What is the fastest or optimal way to achieve this?
Thanks in advance!
Try:
import pandas as pd
df = pd.DataFrame({ "zeroes" : [0, 0, 0], "ones": [1, 1, 1], "some_other" : list("abc")})
res = df[["zeroes", "ones"]].to_numpy().ravel(order="C")
print(res)
Output
[0 1 0 1 0 1]
Micro-Benchmarks
import pandas as pd
from itertools import chain
df = pd.DataFrame({ "zeroes" : [0] * 10_000, "ones": [1] * 10_000})
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
672 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [v for vs in zip(df["zeroes"], df["ones"]) for v in vs]
2.57 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(chain.from_iterable(zip(df["zeroes"], df["ones"])))
2.11 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use numpy.flatten() like below as alternative:
import numpy as np
import pandas as pd
df[["zeroes", "ones"]].to_numpy().flatten()
Benchmark (runnig on colab):
df = pd.DataFrame({ "zeroes" : [0] * 10_000_000, "ones": [1] * 10_000_000})
%timeit df[["zeroes", "ones"]].to_numpy().flatten().tolist()
1 loop, best of 5: 320 ms per loop
%timeit df[["zeroes", "ones"]].to_numpy().ravel(order="C").tolist()
1 loop, best of 5: 322 ms per loop
I don't know if this is the most optimal solution but it should solve your case.
df = pd.DataFrame([[0 for x in range(10)], [1 for x in range(10)]]).T
l = [[x, y] for x, y in zip(df[0], df[1])]
l = [x for y in l for x in y]
l
This may help you: Alternate elements of different columns using Pandas
pd.concat(
[df1, df2], axis=1
).stack().reset_index(1, drop=True).to_frame('C').rename(index='CC{}'.format)
This is a simple problem that I can't seem to find an elegant solution to. I am trying to select the rows of a data frame where two of the columns form a pair from a separate list.
For example:
import pandas as pd
df = pd.DataFrame({'a': range(8), 'b': range(8), 'c': list('zyxwvuts')})
pairs = [(4, 4), (5, 6), (6, 6), (7, 9)]
# The data has an arbitrary number of columns, but I just want
# to match 'a' and 'b'
df
a b c
0 0 0 z
1 1 1 y
2 2 2 x
3 3 3 w
4 4 4 v
5 5 5 u
6 6 6 t
7 7 7 s
In this example, my list pairs contains the combination of df.a and df.b at rows 4 and 6. What I would like is a clean way to get the data frame given by df.iloc[[4, 6], :].
Is there a pandas or numpy way to do this without explicitly looping over pairs?
Answer comparison
The solution using broadcasting is both clean and fast, as well as scaling very well.
def with_set_index(df, pairs):
return df.set_index(['a','b']).loc[pairs].dropna()
def with_tuple_isin(df, pairs):
return df[df[['a','b']].apply(tuple,1).isin(pairs)]
def with_array_views(df, pairs):
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A, B = view1D(df[['a','b']].values, np.asarray(pairs))
return df[np.isin(A, B)]
def with_broadcasting(df, pairs):
return df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]
%timeit with_set_index(df, pairs)
# 7.35 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit with_tuple_isin(df, pairs)
# 1.89 ms ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit with_array_views(df, pairs)
# 917 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit with_broadcasting(df, pairs)
# 879 µs ± 8.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
tuple with isin
df[df[['a','b']].apply(tuple,1).isin(pairs)]
Out[686]:
a b c
4 4 4 v
6 6 6 t
A vectorized one based on array-views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
A,B = view1D(df[['a','b']].values,np.asarray(pairs))
out = df[np.isin(A,B)]
Output for given sample -
In [263]: out
Out[263]:
a b c
4 4 4 v
6 6 6 t
If you are looking for a compact/clean version, we can also leverage broadcasting -
In [269]: df[(df[['a','b']].values[:,None] == pairs).all(2).any(1)]
Out[269]:
a b c
4 4 4 v
6 6 6 t
Try this:
df.set_index(['a','b']).loc[pairs].dropna()
If I have the following dataframe, derived like so: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 1)))
0
0 0
1 2
2 8
3 1
4 0
5 0
6 7
7 0
8 2
9 2
Is there an efficient way cumsum rows with a limit and each time this limit is reached, to start a new cumsum. After each limit is reached (however many rows), a row is created with the total cumsum.
Below I have created an example of a function that does this, but it's very slow, especially when the dataframe becomes very large.
I don't like that my function is looping and I am looking for a way to make it faster (I guess a way without a loop).
def foo(df, max_value):
last_value = 0
storage = []
for index, row in df.iterrows():
this_value = np.nansum([row[0], last_value])
if this_value >= max_value:
storage.append((index, this_value))
this_value = 0
last_value = this_value
return storage
If you rum my function like so: foo(df, 5)
In in the above context, it returns:
0
2 10
6 8
The loop cannot be avoided, but it can be parallelized using numba's njit:
from numba import njit, prange
#njit
def dynamic_cumsum(seq, index, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([index[i], running])
running = 0
running += seq[i]
cumsum.append([index[-1], running])
return cumsum
The index is required here, assuming your index is not numeric/monotonically increasing.
%timeit foo(df, 5)
1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the index is of Int64Index type, you can shorten this to:
#njit
def dynamic_cumsum2(seq, max_value):
cumsum = []
running = 0
for i in prange(len(seq)):
if running > max_value:
cumsum.append([i, running])
running = 0
running += seq[i]
cumsum.append([i, running])
return cumsum
lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')
B
A
3 10
7 8
9 4
%timeit foo(df, 5)
1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit Functions Performance
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
kernels=[
lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
],
labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
n_range=[2**k for k in range(0, 17)],
xlabel='N',
logx=True,
logy=True,
equality_check=None # TODO - update when #jpp adds in the final `yield`
)
The log-log plot shows that the generator function is faster for larger inputs:
A possible explanation is that, as N increases, the overhead of appending to a growing list in dynamic_cumsum2 becomes prominent. While cumsum_limit_nb just has to yield.
A loop isn't necessarily bad. The trick is to make sure it's performed on low-level objects. In this case, you can use Numba or Cython. For example, using a generator with numba.njit:
from numba import njit
#njit
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
idx, vals = zip(*cumsum_limit(df[0].values))
res = pd.Series(vals, index=idx)
To demonstrate the performance benefits of JIT-compiling with Numba:
import pandas as pd, numpy as np
from numba import njit
df = pd.DataFrame({0: [0, 2, 8, 1, 0, 0, 7, 0, 2, 2]})
#njit
def cumsum_limit_nb(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
def cumsum_limit(A, limit=5):
count = 0
for i in range(A.shape[0]):
count += A[i]
if count > limit:
yield i, count
count = 0
n = 10**4
df = pd.concat([df]*n, ignore_index=True)
%timeit list(cumsum_limit_nb(df[0].values)) # 4.19 ms ± 90.4 µs per loop
%timeit list(cumsum_limit(df[0].values)) # 58.3 ms ± 194 µs per loop
simpler approach:
def dynamic_cumsum(seq,limit):
res=[]
cs=seq.cumsum()
for i, e in enumerate(cs):
if cs[i] >limit:
res.append([i,e])
cs[i+1:] -= e
if res[-1][0]==i:
return res
res.append([i,e])
return res
result:
x=dynamic_cumsum(df[0].values,5)
x
>>[[2, 10], [6, 8], [9, 4]]
There seems to be a lot of answers on how to get last index value from pandas dataframe but what I am trying to get index position number for the last row of every index at level 0 in a multi-index dataframe. I found a way using a loop but the data frame is millions of line and this loop is slow. I assume there is a more pythonic way of doing this.
Here is a mini example of df3. I want to get a list (or maybe an array) of the numbers in the index for the df >> the last row before it changes to a new stock. The index column is the results I want. this is the index position from the df
Stock Date Index
AAPL 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 3475
AMZN 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 6951
BAC 12/31/2004
1/3/2005
1/4/2005
1/5/2005
1/6/2005
1/7/2005
1/10/2005 10427
This is the code I am using, where df3 in the dataframe
test_index_list = []
for start_index in range(len(df3)-1):
end_index = start_index + 1
if df3.index[start_index][0] != df3.index[end_index][0]:
test_index_list.append(start_index)
I change divakar answer a bit with get_level_values for indices of first level of MultiIndex:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
print (df)
C D E
F A B
a a 4 7 1 5
b 5 8 3 3
c 4 9 5 6
b d 5 4 7 9
e 5 2 1 2
c f 4 3 0 4
def start_stop_arr(initial_list):
a = np.asarray(initial_list)
mask = np.concatenate(([True], a[1:] != a[:-1], [True]))
idx = np.flatnonzero(mask)
stop = idx[1:]-1
return stop
print (df.index.get_level_values(0))
Index(['a', 'a', 'a', 'b', 'b', 'c'], dtype='object', name='F')
print (start_stop_arr(df.index.get_level_values(0)))
[2 4 5]
dict.values
Using dict to track values leaves the last found value as the one that matters.
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
[2, 4, 5]
With Loop
Create function that takes a factorization and number of unique values
def last(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
You can then get the factorization with
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
array([2, 4, 5])
However, the way MultiIndex is usually constructed, the labels objects are already factorizations and the levels objects are unique values.
last(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
What's more is that we can use Numba to use just in time compiling to super-charge this.
from numba import njit
#njit
def nlast(bins, k):
a = np.zeros(k, np.int64)
for i, b in enumerate(bins):
a[b] = i
return a
nlast(df.index.labels[0], df.index.levels[0].size)
array([2, 4, 5])
Timing
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
last(f, len(u))
641 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
f, u = pd.factorize(df.index.get_level_values(0))
nlast(f, len(u))
264 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
nlast(df.index.labels[0], len(df.index.levels[0]))
4.06 µs ± 43.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
last(df.index.labels[0], len(df.index.levels[0]))
654 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
list(dict(map(reversed, enumerate(df.index.get_level_values(0)))).values())
709 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
jezrael's solution. Also very fast.
%timeit start_stop_arr(df.index.get_level_values(0))
113 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.unique
I did not time this because I don't like it. See below:
Using np.unique and the return_index argument. This returns the first place each unique value is found. After this, I'd do some shifting to get at the last position of the prior unique value.
Note: this works if the level values are in contiguous groups. If they aren't, we have to do sorting and unsorting that isn't worth it. Unless it really is then I'll show how to do it.
i = np.unique(df.index.get_level_values(0), return_index=True)[1]
np.append(i[1:], len(df)) - 1
array([2, 4, 5])
Setup
from #jezrael
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')}).set_index(['F','A','B'])
df :
val wt
1 100 2
2 300 3
3 200 5
required df :
val wt cum_wt_avg
1 100 2 100
2 300 3 220
3 200 5 210
formula :
cum_wt_avg [i] = cum_sum(val*wt)[i] / cum_sum(weight)[i]
Is there any easy way to do it in pandas or numpy to do this ?
Something like this
df["cum_wt_avg"] = pd.cum_mean(value=df.val, weight=df.wt)
I think in pandas best avoid loops.
So first multiple columns by mul, get cumsum and divide by cumsumed column wt:
df["cum_wt_avg"] = df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum())
print (df)
val wt cum_wt_avg
1 100 2 100.0
2 300 3 220.0
3 200 5 210.0
For improve performance use numpy with numpy.cumsum:
import numpy as np
a = df['val'].values
b = df['wt'].values
df["cum_wt_avg"] = np.cumsum(a * b) / np.cumsum(b)
Timings:
import numpy as np
from numba import jit
df = pd.concat([df]*1000)
#jpp solution
#jit(nopython=True)
def cum_wavg(arr, res):
return np.cumsum(arr[:, 0] * arr[:, 1])/ np.cumsum(arr[:, 1])
def jez1(df):
a = df['val'].values
b = df['wt'].values
return np.cumsum(a * b) / np.cumsum(b)
print (jez1(df))
In [184]: %timeit cum_wavg(df.values, res=np.zeros(len(df.index)))
65.5 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum())
362 µs ± 6.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [186]: %timeit (jez1(df))
63.8 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This is one way using numpy.
import numpy as np
def cum_wavg(arr):
return [np.average(arr[:i+1, 0], weights=arr[:i+1, 1]) for i in range(arr.shape[0])]
df['cum_wavg'] = cum_wavg(df.values)
For better performance, you can use numba:
import numpy as np
from numba import jit
df = pd.concat([df]*1000)
#jit(nopython=True)
def cum_wavg(arr, res):
return np.cumsum(arr[:, 0] * arr[:, 1])/ np.cumsum(arr[:, 1])
%timeit cum_wavg(df.values, res=np.zeros(len(df.index))) # 92.9 µs
%timeit df['val'].mul(df['wt']).cumsum().div(df['wt'].cumsum()) # 549 µs