I tried to construct a column without using a loop for. However I am using the value that I calculated at the previous step.
Example :
f-
0
0.3
1
0.4
2
0.45
3
0.6
f+
0
0.21
1
0.78
2
0.54
3
0.9
Matrix P
0
1
0
0.9
0.1
1
0.1
0.9
DataFrame df to be filled
0
1
2
0
0.5
...
...
1
0.5
....
...
Then the column 1 in DataFrame df becomes :
prob = f-[0,0] * df[0,0] / (f-[0,0] * df[0,0] * f+[0,0] * df[0,1])
df[0] = P.dot([prob, 1 - prob])
Here is the code with a for loop :
for i in np.arange(0, n):
p = f_m[i+1] * df.iloc[0, i] / (f_m[i+1] * df.iloc[0, i] + f_p[i+1] * df.iloc[1, i])
xi[i + 1] = p_matrix.dot(np.array([p, 1 - p]))
Does someone have the solution to create it without a for loop ? And then each column will be calculated the same way
Related
I have the following pandas df which consists of 2 factor-columns and 2 signal-columns.
import pandas as pd
data = [
[0.1,-0.1,0.1],
[-0.1,0.2,0.3],
[0.3,0.1,0.3],
[0.1,0.3,-0.2]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B', 'factor_C'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
This gives me the following output:
factor_A factor_B factor_C factor_A_signal factor_B_signal factor_C_signal
0 0.1 -0.1 0.1 1 -1 1
1 -0.1 0.2 0.3 -1 1 1
2 0.3 0.1 0.3 1 1 1
3 0.1 0.3 -0.2 1 1 -1
Now in a 1 month holding period I have to multiply factor_A with the previous factor_A_signal + factor_B with the previous factor_B_signal divided by the number of factors (in this case "2") and add a new column ("ret_1m). At the moment I am not able to say how much factors I will have as an input so therefore I have to work with a for loop.
In a 2 month holding period I have to multiply the t+1 factor_A with the previous factor_A_signal + the t+1 factor_B with the previous factor_B_signal divided by the number of factors and add a new column ("ret_2m") and so on to the 12th month.
To show you an example I would do that for 2 factors for 3 month holding period as follow:
import pandas as pd
data = [
[0.1,-0.1],
[-0.1,0.2],
[0.3,0.1],
[0.1,0.3]
]
df = pd.DataFrame(data, columns=['factor_A', 'factor_B'])
for col in df:
new_name = col + '_signal'
df[new_name] = [1 if x>0 else -1 for x in df[col]]
print(df)
def one_three(n_factors):
df["ret_1m"] = (df['factor_A_signal'].shift() * df["factor_A"] +
df['factor_B_signal'].shift() * df["factor_B"])/n_factors
df["ret_2m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-1) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-1))/n_factors
df["ret_3m"] = (df['factor_A_signal'].shift() * df["factor_A"].shift(-2) +
df['factor_B_signal'].shift() * df["factor_B"].shift(-2))/n_factors
return df
one_three(2)
Output:
factor_A factor_B factor_A_signal factor_B_signal ret_1m ret_2m ret_3m
0 0.1 -0.1 1 -1 NaN NaN NaN
1 -0.1 0.2 -1 1 -0.15 0.1 -0.1
2 0.3 0.1 1 1 -0.10 0.1 NaN
3 0.1 0.3 1 1 0.20 NaN NaN
How could I automate this with a for loop? Thank you very much in advance.
A for loop for your function def one_three(n_factors):
# Create list of columns in dataframe that are not signals
factors = [x for x in df.columns if not x.endswith("_signal")]
# Looking through range from 1 to 1 + number of months (in your example 3)
for i in range(1, 3+1):
name = "ret_" + str(i) + "m"
df[name] = 0
for x in factors:
df[name] += df[str(x + "_signal")].shift() * df[x].shift(1 - i)
df[name] /= len(factors)
Assuming you know already populated the factor_ columns, then run the signal loop. The first section finds all columns that do not end with _signal and returns a list - otherwise you could use a list of [factor_A, factor_B, ...]. Looping through the number of months, here I used 3 following your example, the computation loops through all items in the list.
The output for this matched your output with the given input data.
I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4
Apologies for unclear title. My data look like this. They always sum to 1
>df
A B C D E
0.3 0.3 0.05 0.2 0.05
What i want to do it identify columns which:
1) Highest value
2) The % reduction for highest was less than threshold.
For example:
Assuming 50% was threshold, I want to end up with [A,B,C], based on logic that:
1) A & B have highest value.
2) 50% of A or B is 0.15. Since D is 0.2, it is added to list
3) 50% of D is 0.1. Since both C or E are less than 0.1, they are not added to list.
I used the following test DataFrame:
A B C D E
0 0.3 0.3 0.05 0.2 0.05
1 0.5 0.1 0.20 0.1 0.10
Start from defining the following function to get column names for the current row:
def getCols(row, threshold):
s = row.sort_values(ascending=False)
currVal = 0.0
lst = []
for key, grp in s.groupby(s, sort=False):
if len(lst) > 0 and key < currVal * threshold:
break
currVal = key
lst.extend(grp.index.sort_values().tolist())
return lst
Then apply it:
df['cols'] = df.apply(getCols, axis=1, threshold = 0.5)
The result is:
A B C D E cols
0 0.3 0.3 0.05 0.2 0.05 [A, B, D]
1 0.5 0.1 0.20 0.1 0.10 [A]
I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100