I have a time-series A holding several values. I need to obtain a series B that is defined algebraically as follows:
B[t] = a * A[t] + b * B[t-1]
where we can assume B[0] = 0, and a and b are real numbers.
Is there any way to do this type of recursive computation in Pandas? Or do I have no choice but to loop in Python as suggested in this answer?
As an example of input:
> A = pd.Series(np.random.randn(10,))
0 -0.310354
1 -0.739515
2 -0.065390
3 0.214966
4 -0.605490
5 1.293448
6 -3.068725
7 -0.208818
8 0.930881
9 1.669210
As I noted in a comment, you can use scipy.signal.lfilter. In this case (assuming A is a one-dimensional numpy array), all you need is:
B = lfilter([a], [1.0, -b], A)
Here's a complete script:
import numpy as np
from scipy.signal import lfilter
np.random.seed(123)
A = np.random.randn(10)
a = 2.0
b = 3.0
# Compute the recursion using lfilter.
# [a] and [1, -b] are the coefficients of the numerator and
# denominator, resp., of the filter's transfer function.
B = lfilter([a], [1, -b], A)
print B
# Compare to a simple loop.
B2 = np.empty(len(A))
for k in range(0, len(B2)):
if k == 0:
B2[k] = a*A[k]
else:
B2[k] = a*A[k] + b*B2[k-1]
print B2
print "max difference:", np.max(np.abs(B2 - B))
The output of the script is:
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
max difference: 0.0
Another example, in IPython, using a pandas DataFrame instead of a numpy array:
If you have
In [12]: df = pd.DataFrame([1, 7, 9, 5], columns=['A'])
In [13]: df
Out[13]:
A
0 1
1 7
2 9
3 5
and you want to create a new column, B, such that B[k] = A[k] + 2*B[k-1] (with B[k] == 0 for k < 0), you can write
In [14]: df['B'] = lfilter([1], [1, -2], df['A'].astype(float))
In [15]: df
Out[15]:
A B
0 1 1
1 7 9
2 9 27
3 5 59
Related
I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).
Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)
I'm trying to evaluate a simple formula provided via a GUI.
Currently I store the data in a dict with letters as keys (happy to change that, but thought that could come the solution one step closer).
Eventually I want to type in a simple folmula such as "A - J*2"
import pandas as pd
data_dict = {}
data_dict['A'] = pd.Series([1, 2, 3])
data_dict['C'] = pd.Series([0, 1, 2])
data_dict['E'] = pd.Series([0.5, 1.5, 2.5])
data_dict['J'] = pd.Series([4, 5, 6])
e.g. "A - J*2" ==>
data_dict['A'] - data_dict['J'] * 2
The letters will change dynamically.
Use DataFrame.eval, but first need create DataFrame from dict of Series:
df = pd.DataFrame(data_dict)
print (df)
A C E J
0 1 0 0.5 4
1 2 1 1.5 5
2 3 2 2.5 6
print (df.eval("A - J*2"))
0 -7
1 -8
2 -9
dtype: int64
I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.
So, to set up a simple problem:
Problem
df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)
A B
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
5 5 2
Question:
How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.
df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
Is there a better, more concise, clearer, more consistent approach?
TL;DR
iloc + argsort
We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.
With the power of iloc, we can sort with any array that specifies the order.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.
Example 1
Using the specified problem above
df.iloc[df.prod(1).argsort()]
Same results as above
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
How fast are these solutions?
We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
backtest setup
def add_drop():
return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
v = df.values
a = v.prod(1).argsort()
return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
Example 2
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
Suppose I have dataframe df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
A
0 -2
1 -1
2 0
3 1
4 2
I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.
Caveat: np.lexsort sorts by the last array in its list first. \shurg
def add_drop():
return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
v = df.A.values
return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
v = df.A.values
a = np.lexsort([np.abs(v), v >= 0])
return pd.DataFrame(v[a, None], df.index[a], df.columns)
All of which return
A
1 -1
0 -2
2 0
3 1
4 2
How fast this time?
In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.
backtest setup
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))