pandas.Series : how to get rate of next value - python

About pandas, I want to know how to get rate of next value. Below series is a sample.
import pandas as pd
s = pd.Series([1,2,1,1,1,3])
>>> s
0 1
1 2
2 1
3 1
4 1
5 3
# What I wanna get are below rates.
# 1 to 2 : 1/5(0.2)
# 2 to 1 : 1/5(0.2)
# 1 to 1 : 2/5(0.4)
# 1 to 3 : 1/5(0.2)
Sorry for bad description, but is there anyone who know how to do this?

One possible solution with strides, aggregate count by GroupBy.size and division by length of DataFrame:
import pandas as pd
import numpy as np
s = pd.Series([1,2,1,1,1,3])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df1 = pd.DataFrame(rolling_window(s.values, 2), columns=['from','to'])
df1 = df1.groupby(['from','to'], sort=False).size().div(len(df1)).reset_index(name='rate')
print (df1)
from to rate
0 1 2 0.2
1 2 1 0.2
2 1 1 0.4
3 1 3 0.2

Related

How to efficiently do operation on pandas each group

So I have a data frame like this--
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
id delay
0 1 22
1 1 23
2 1 44
3 2 33
4 2 55
What I am doing is grouping by id and doing rolling operation on the delay column like below--
k = [0.1, 0.5, 1]
def f(d):
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
return d
df.groupby(['id']).apply(f)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
It is working just fine but I am curious whether .apply on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.
You can use strides for vectorized rolling with GroupBy.transform:
k = [0.1, 0.5, 1]
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(d):
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
Another option will be to use np.convolve() instead:
# Our function
f = lambda x: np.convolve(np.array([1,0.5,0.1]),x)[:len(x)]
# Groupby + Transform
df['new_delay'] = df.groupby('id')['delay'].transform(f)
Don't know if it's faster or not.
Here is one approach with groupby + rolling and apply a custom function compiled using numba
def func(v):
k = np.array([0.1, 0.5, 1])
return np.sum(v * k[len(k) - len(v):])
(
df.groupby('id')['delay']
.rolling(3, min_periods=1)
.apply(func, raw=True, engine='numba')
.droplevel(0)
)
0 22.0
1 34.0
2 57.7
3 33.0
4 71.5
Name: delay, dtype: float64

How to multiply the previous value of an other column with the value of x column (shift)

I have the following pandas df which consists of 2 factor-columns and 2 signal-columns.
import pandas as pd
data = [
[0.1,-0.1,0.1],
[-0.1,0.2,0.3],
[0.3,0.1,0.3],
[0.1,0.3,-0.2]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B', 'factor_C'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
This gives me the following output:
factor_A factor_B factor_C factor_A_signal factor_B_signal factor_C_signal
0 0.1 -0.1 0.1 1 -1 1
1 -0.1 0.2 0.3 -1 1 1
2 0.3 0.1 0.3 1 1 1
3 0.1 0.3 -0.2 1 1 -1
Now in a 1 month holding period I have to multiply factor_A with the previous factor_A_signal + factor_B with the previous factor_B_signal divided by the number of factors (in this case "2") and add a new column ("ret_1m). At the moment I am not able to say how much factors I will have as an input so therefore I have to work with a for loop. 
In a 2 month holding period I have to multiply the t+1 factor_A with the previous factor_A_signal + the t+1 factor_B with the previous factor_B_signal divided by the number of factors and add a new column ("ret_2m") and so on to the 12th month.
To show you an example I would do that for 2 factors for 3 month holding period as follow:
import pandas as pd
data = [
[0.1,-0.1],
[-0.1,0.2],
[0.3,0.1],
[0.1,0.3]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
def one_three(n_factors):
df["ret_1m"] = (df['factor_A_signal'].shift() * df["factor_A"] +
df['factor_B_signal'].shift() * df["factor_B"])/n_factors
df["ret_2m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-1) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-1))/n_factors
df["ret_3m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-2) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-2))/n_factors
return df
one_three(2)
Output:
factor_A factor_B factor_A_signal factor_B_signal ret_1m ret_2m ret_3m
0 0.1 -0.1 1 -1 NaN NaN NaN
1 -0.1 0.2 -1 1 -0.15 0.1 -0.1
2 0.3 0.1 1 1 -0.10 0.1 NaN
3 0.1 0.3 1 1 0.20 NaN NaN
How could I automate this with a for loop? Thank you very much in advance.
A for loop for your function def one_three(n_factors):
# Create list of columns in dataframe that are not signals
factors = [x for x in df.columns if not x.endswith("_signal")]
# Looking through range from 1 to 1 + number of months (in your example 3)
for i in range(1, 3+1):
name = "ret_" + str(i) + "m"
df[name] = 0
for x in factors:
df[name] += df[str(x + "_signal")].shift() * df[x].shift(1 - i)
df[name] /= len(factors)
Assuming you know already populated the factor_ columns, then run the signal loop. The first section finds all columns that do not end with _signal and returns a list - otherwise you could use a list of [factor_A, factor_B, ...]. Looping through the number of months, here I used 3 following your example, the computation loops through all items in the list.
The output for this matched your output with the given input data.

How to storage the outputs of an iterable function

Probably this is very simple, but I can not figure it out how is the proper way to produce a dataframe in pandas, with the outputs of my function.
Let's say that I have a function that divide each element of a list (let's omitting the easiest way to divide a list):
X = [1,2,3,4,5,6]
for i in X:
def SUM(X):
output = i / 2
return output
df = SUM(X)
At the end 'df' represent the last operation performed by my function. But how can I append all the outputs in a Dataframe?
Thanks by your suggestions
Why not create DataFrame in first step and then processing column values by Series.apply?
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
df = pd.DataFrame({'in':X})
df['out'] = df['in'].apply(SUM)
print (df)
in out
0 1 0.5
1 2 1.0
2 3 1.5
3 4 2.0
4 5 2.5
5 6 3.0
Your solution should be used:
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
out = [SUM(i) for i in X]
df = pd.DataFrame({'out':out})
print (df)
out
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5
5 3.0

window based weighted average in pandas

I am trying to do a window based weighted average of two columns
for example if i have my value column "a" and my weighting column "b"
a b
1: 1 2
2: 2 3
3: 3 4
with a trailing window of 2 (although id like to work with a variable window length)
my third weighted average column should be "c" where the rows that do not have enough previous data for a full weighted average are nan
c
1: nan
2: (1 * 2 + 2 * 3) / (2 + 3) = 1.8
3: (2 * 3 + 3 * 4) / (3 + 4) = 2.57
For your particular case of window of 2, you may use prod and shift
s = df.prod(1)
(s + s.shift()) / (df.b + df.b.shift())
Out[189]:
1 NaN
2 1.600000
3 2.571429
dtype: float64
On sample df2:
a b
0 73.78 51.46
1 73.79 27.84
2 73.79 34.35
s = df2.prod(1)
(s + s.shift()) / (df2.b + df2.b.shift())
Out[193]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
This method still works on variable window length. For variable window length, you need additional listcomp and sum
Try on sample df2 above
s = df2.prod(1)
m = 2 #window length 2
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[214]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
On window length 3
m = 3 #window length 3
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[215]:
0 NaN
1 NaN
2 73.785472
dtype: float64

Is it possible to use a custom filter function in pandas?

Can I use my helper function to determine if a shot was a three pointer as a filter function in Pandas? My actual function is much more complex, but i simplified it for this question.
def isThree(x, y):
return (x + y == 3)
print data[isThree(data['x'], data['y'])].head()
Yes:
import numpy as np
import pandas as pd
data = pd.DataFrame({'x': np.random.randint(1,3,10),
'y': np.random.randint(1,3,10)})
print(data)
Output:
x y
0 1 2
1 2 1
2 2 1
3 1 2
4 2 1
5 2 1
6 2 1
7 2 1
8 2 1
9 2 2
def isThree(x, y):
return (x + y == 3)
print(data[isThree(data['x'], data['y'])].head())
Output:
x y
0 1 2
1 2 1
2 2 1
3 1 2
4 2 1
Yes, so long as your function returns a Boolean Series with the same index you can slice your original DataFrame with the output. In this simple example, we can pass Series to your function:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 4, (30, 2)))
def isThree(x, y):
return x + y == 3
df[isThree(df[0], df[1])]
# 0 1
#2 2 1
#5 2 1
#9 0 3
#11 2 1
#12 0 3
#13 2 1
#27 3 0
In this case, I would recommend using np.where(). See the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,4,2,3,1,2,3,4,0], 'y': [0,1,2,0,0,2,4,0,1,2]})
df['3 Pointer'] = np.where(df['x']+df['y']==3, 1, 0)
Yields:
x y 3 Pointer
0 1 0 0
1 2 1 1
2 4 2 0
3 2 0 0
4 3 0 1
5 1 2 1
6 2 4 0
7 3 0 1
8 4 1 0
9 0 2 0
You can use np.vectorize. Documentation is here https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
def isThree(x, y):
return (x + y == 3)
df=pd.DataFrame({'A':[1,2],'B':[2,0]})
df['new_column'] = np.vectorize(isThree)(df['A'], df['B'])

Categories

Resources