About pandas, I want to know how to get rate of next value. Below series is a sample.
import pandas as pd
s = pd.Series([1,2,1,1,1,3])
>>> s
0 1
1 2
2 1
3 1
4 1
5 3
# What I wanna get are below rates.
# 1 to 2 : 1/5(0.2)
# 2 to 1 : 1/5(0.2)
# 1 to 1 : 2/5(0.4)
# 1 to 3 : 1/5(0.2)
Sorry for bad description, but is there anyone who know how to do this?
One possible solution with strides, aggregate count by GroupBy.size and division by length of DataFrame:
import pandas as pd
import numpy as np
s = pd.Series([1,2,1,1,1,3])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df1 = pd.DataFrame(rolling_window(s.values, 2), columns=['from','to'])
df1 = df1.groupby(['from','to'], sort=False).size().div(len(df1)).reset_index(name='rate')
print (df1)
from to rate
0 1 2 0.2
1 2 1 0.2
2 1 1 0.4
3 1 3 0.2
Related
So I have a data frame like this--
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
id delay
0 1 22
1 1 23
2 1 44
3 2 33
4 2 55
What I am doing is grouping by id and doing rolling operation on the delay column like below--
k = [0.1, 0.5, 1]
def f(d):
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
return d
df.groupby(['id']).apply(f)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
It is working just fine but I am curious whether .apply on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.
You can use strides for vectorized rolling with GroupBy.transform:
k = [0.1, 0.5, 1]
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def f(d):
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
id delay new_delay
0 1 22 22.0
1 1 23 34.0
2 1 44 57.7
3 2 33 33.0
4 2 55 71.5
Another option will be to use np.convolve() instead:
# Our function
f = lambda x: np.convolve(np.array([1,0.5,0.1]),x)[:len(x)]
# Groupby + Transform
df['new_delay'] = df.groupby('id')['delay'].transform(f)
Don't know if it's faster or not.
Here is one approach with groupby + rolling and apply a custom function compiled using numba
def func(v):
k = np.array([0.1, 0.5, 1])
return np.sum(v * k[len(k) - len(v):])
(
df.groupby('id')['delay']
.rolling(3, min_periods=1)
.apply(func, raw=True, engine='numba')
.droplevel(0)
)
0 22.0
1 34.0
2 57.7
3 33.0
4 71.5
Name: delay, dtype: float64
I have the following pandas df which consists of 2 factor-columns and 2 signal-columns.
import pandas as pd
data = [
[0.1,-0.1,0.1],
[-0.1,0.2,0.3],
[0.3,0.1,0.3],
[0.1,0.3,-0.2]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B', 'factor_C'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
This gives me the following output:
factor_A factor_B factor_C factor_A_signal factor_B_signal factor_C_signal
0 0.1 -0.1 0.1 1 -1 1
1 -0.1 0.2 0.3 -1 1 1
2 0.3 0.1 0.3 1 1 1
3 0.1 0.3 -0.2 1 1 -1
Now in a 1 month holding period I have to multiply factor_A with the previous factor_A_signal + factor_B with the previous factor_B_signal divided by the number of factors (in this case "2") and add a new column ("ret_1m). At the moment I am not able to say how much factors I will have as an input so therefore I have to work with a for loop.
In a 2 month holding period I have to multiply the t+1 factor_A with the previous factor_A_signal + the t+1 factor_B with the previous factor_B_signal divided by the number of factors and add a new column ("ret_2m") and so on to the 12th month.
To show you an example I would do that for 2 factors for 3 month holding period as follow:
import pandas as pd
data = [
[0.1,-0.1],
[-0.1,0.2],
[0.3,0.1],
[0.1,0.3]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
def one_three(n_factors):
df["ret_1m"] = (df['factor_A_signal'].shift() * df["factor_A"] +
df['factor_B_signal'].shift() * df["factor_B"])/n_factors
df["ret_2m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-1) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-1))/n_factors
df["ret_3m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-2) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-2))/n_factors
return df
one_three(2)
Output:
factor_A factor_B factor_A_signal factor_B_signal ret_1m ret_2m ret_3m
0 0.1 -0.1 1 -1 NaN NaN NaN
1 -0.1 0.2 -1 1 -0.15 0.1 -0.1
2 0.3 0.1 1 1 -0.10 0.1 NaN
3 0.1 0.3 1 1 0.20 NaN NaN
How could I automate this with a for loop? Thank you very much in advance.
A for loop for your function def one_three(n_factors):
# Create list of columns in dataframe that are not signals
factors = [x for x in df.columns if not x.endswith("_signal")]
# Looking through range from 1 to 1 + number of months (in your example 3)
for i in range(1, 3+1):
name = "ret_" + str(i) + "m"
df[name] = 0
for x in factors:
df[name] += df[str(x + "_signal")].shift() * df[x].shift(1 - i)
df[name] /= len(factors)
Assuming you know already populated the factor_ columns, then run the signal loop. The first section finds all columns that do not end with _signal and returns a list - otherwise you could use a list of [factor_A, factor_B, ...]. Looping through the number of months, here I used 3 following your example, the computation loops through all items in the list.
The output for this matched your output with the given input data.
Probably this is very simple, but I can not figure it out how is the proper way to produce a dataframe in pandas, with the outputs of my function.
Let's say that I have a function that divide each element of a list (let's omitting the easiest way to divide a list):
X = [1,2,3,4,5,6]
for i in X:
def SUM(X):
output = i / 2
return output
df = SUM(X)
At the end 'df' represent the last operation performed by my function. But how can I append all the outputs in a Dataframe?
Thanks by your suggestions
Why not create DataFrame in first step and then processing column values by Series.apply?
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
df = pd.DataFrame({'in':X})
df['out'] = df['in'].apply(SUM)
print (df)
in out
0 1 0.5
1 2 1.0
2 3 1.5
3 4 2.0
4 5 2.5
5 6 3.0
Your solution should be used:
X = [1,2,3,4,5,6]
def SUM(X):
output = X / 2
return output
out = [SUM(i) for i in X]
df = pd.DataFrame({'out':out})
print (df)
out
0 0.5
1 1.0
2 1.5
3 2.0
4 2.5
5 3.0
I am trying to do a window based weighted average of two columns
for example if i have my value column "a" and my weighting column "b"
a b
1: 1 2
2: 2 3
3: 3 4
with a trailing window of 2 (although id like to work with a variable window length)
my third weighted average column should be "c" where the rows that do not have enough previous data for a full weighted average are nan
c
1: nan
2: (1 * 2 + 2 * 3) / (2 + 3) = 1.8
3: (2 * 3 + 3 * 4) / (3 + 4) = 2.57
For your particular case of window of 2, you may use prod and shift
s = df.prod(1)
(s + s.shift()) / (df.b + df.b.shift())
Out[189]:
1 NaN
2 1.600000
3 2.571429
dtype: float64
On sample df2:
a b
0 73.78 51.46
1 73.79 27.84
2 73.79 34.35
s = df2.prod(1)
(s + s.shift()) / (df2.b + df2.b.shift())
Out[193]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
This method still works on variable window length. For variable window length, you need additional listcomp and sum
Try on sample df2 above
s = df2.prod(1)
m = 2 #window length 2
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[214]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
On window length 3
m = 3 #window length 3
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[215]:
0 NaN
1 NaN
2 73.785472
dtype: float64
Can I use my helper function to determine if a shot was a three pointer as a filter function in Pandas? My actual function is much more complex, but i simplified it for this question.
def isThree(x, y):
return (x + y == 3)
print data[isThree(data['x'], data['y'])].head()
Yes:
import numpy as np
import pandas as pd
data = pd.DataFrame({'x': np.random.randint(1,3,10),
'y': np.random.randint(1,3,10)})
print(data)
Output:
x y
0 1 2
1 2 1
2 2 1
3 1 2
4 2 1
5 2 1
6 2 1
7 2 1
8 2 1
9 2 2
def isThree(x, y):
return (x + y == 3)
print(data[isThree(data['x'], data['y'])].head())
Output:
x y
0 1 2
1 2 1
2 2 1
3 1 2
4 2 1
Yes, so long as your function returns a Boolean Series with the same index you can slice your original DataFrame with the output. In this simple example, we can pass Series to your function:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 4, (30, 2)))
def isThree(x, y):
return x + y == 3
df[isThree(df[0], df[1])]
# 0 1
#2 2 1
#5 2 1
#9 0 3
#11 2 1
#12 0 3
#13 2 1
#27 3 0
In this case, I would recommend using np.where(). See the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,4,2,3,1,2,3,4,0], 'y': [0,1,2,0,0,2,4,0,1,2]})
df['3 Pointer'] = np.where(df['x']+df['y']==3, 1, 0)
Yields:
x y 3 Pointer
0 1 0 0
1 2 1 1
2 4 2 0
3 2 0 0
4 3 0 1
5 1 2 1
6 2 4 0
7 3 0 1
8 4 1 0
9 0 2 0
You can use np.vectorize. Documentation is here https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
def isThree(x, y):
return (x + y == 3)
df=pd.DataFrame({'A':[1,2],'B':[2,0]})
df['new_column'] = np.vectorize(isThree)(df['A'], df['B'])