I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
Related
I tried to construct a column without using a loop for. However I am using the value that I calculated at the previous step.
Example :
f-
0
0.3
1
0.4
2
0.45
3
0.6
f+
0
0.21
1
0.78
2
0.54
3
0.9
Matrix P
0
1
0
0.9
0.1
1
0.1
0.9
DataFrame df to be filled
0
1
2
0
0.5
...
...
1
0.5
....
...
Then the column 1 in DataFrame df becomes :
prob = f-[0,0] * df[0,0] / (f-[0,0] * df[0,0] * f+[0,0] * df[0,1])
df[0] = P.dot([prob, 1 - prob])
Here is the code with a for loop :
for i in np.arange(0, n):
p = f_m[i+1] * df.iloc[0, i] / (f_m[i+1] * df.iloc[0, i] + f_p[i+1] * df.iloc[1, i])
xi[i + 1] = p_matrix.dot(np.array([p, 1 - p]))
Does someone have the solution to create it without a for loop ? And then each column will be calculated the same way
Suppose I have two DFs, say df1,df2 as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[0,1,100],[1,1.1,120],[2,0.8,102]],columns=['id','a','b'])
df2 = pd.DataFrame([[0,0.5,110],[1,1.05,94],[2,0.96,145],[3,0.86,112],[4,1.3,97]],
columns=['id','a','b'])
print(df1)
id a b
0 0 1.0 100
1 1 1.1 120
2 2 0.8 102
print(df2)
id a b
0 0 0.50 110
1 1 1.05 94
2 2 0.96 145
3 3 0.86 112
4 4 1.30 97
Now, suppose I choose some interval size da,db. I want, for each row in df1, to pick a random row from df2, such that abs(a1-a2)<da,abs(b1-b2)<db. What I am currently doing is very brute force:
da = 0.2
db = 25
df2_list=[]
nbad = 0
for rid,row in df1.iterrows():
ca = row['a']
cb = row['b']
c_df2 = df2[np.abs(df2['a']-ca)<da]\
[np.abs(df2['b']-cb)<db]
if len(c_df2) == 0:
nbad+=1
continue
c_df2 = c_df2.sample()
df2_list.append(c_df2['id'].values[0])
matched_df = df2[df2['id'].isin(df2_list)]
print(matched_df)
id a b
1 1 1.05 94
3 3 0.86 112
4 4 1.30 97
However, for my real purpose, where my DF is really big, this is very slow.
Is there a faster way to achieve this result?
Here's a solution:
da = 0.2
db = 25
res = pd.merge(df1.assign(dummy = 1), df2.assign(dummy = 1), on = "dummy").drop("dummy", axis = 1)
res = res[(np.abs(res.a_x - res.a_y) < da) & (np.abs(res.b_x - res.b_y) < db)]
res = res.groupby("id_x").apply(lambda x: x.sample(1))[["id_y", "a_y", "b_y"]]
res.index = res.index.droplevel(1)
print(res)
The output is:
id_y a_y b_y
id_x
0 1 1.05 94
1 4 1.30 97
2 3 0.86 112
I want to create a new columns for a big table using several criteria and columsn and was not sure the best way to approach it.
df = pd.DataFrame({'a': ['A', "B", "B", "C", "D"],
'b':['y','n','y','n', np.nan], 'c':[10,20,10,40,30], 'd':[.3,.1,.4,.2, .1]})
df.head()
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
df['new_Col'] = df.c+df.d
if df.a=='A' & df.b =='y':
df['new_Col'] = df.d *2
else:
df['new_Col'] = 0
return df
fun()
OR
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
return = df.c+df.d
if df.a=='A' & df.b =='y':
return df.d *2
else:
return 0
df['new_Col"] df.apply(fun)
OR using np.where:
df['new_Col'] = np.where(df.a=='A' & df.b =='n', df.c+df.d,0 )
df['new_Col'] = np.where(df.a=='A' & df.b =='y', df.d *2,0 )
Looks like you need np.select
a, n, y = df.a.eq('A'), df.b.eq('n'), df.b.eq('y')
df['result'] = np.select([a & n, a & y], [df.c + df.d, df.d*2], default=0)
This is an arithmetic way (I added one more row to your sample for case a = 'A' and b = 'n'):
sample
Out[1369]:
a b c d
0 A y 10 0.3
1 B n 20 0.1
2 B y 10 0.4
3 C n 40 0.2
4 D NaN 30 0.1
5 A n 50 0.9
nc = df.a.eq('A') & df.b.eq('y')
mc = df.a.eq('A') & df.b.eq('n')
nr = df.d * 2
mr = df.c + df.d
df['new_col'] = nc*nr + mc*mr
Out[1371]:
a b c d new_col
0 A y 10 0.3 0.6
1 B n 20 0.1 0.0
2 B y 10 0.4 0.0
3 C n 40 0.2 0.0
4 D NaN 30 0.1 0.0
5 A n 50 0.9 50.9
I have a dataframe - pastebin for minimium code to run
df_dict = {
'A': [1, 2, 3, 4, 5],
'B': [5, 2, 3, 1, 5],
'out': np.nan
}
df = pd.DataFrame(df_dict)
I am currently performing some row by row calculations by doing the following:
def transform(row):
length = 2
weight = 5
row_num = int(row.name)
out = row['A'] / length
if (row_num >= length):
previous_out = df.at[ row_num-1, 'out' ]
out = (row['B'] - previous_out) * weight + previous_out
df.at[row_num, 'out'] = out
df.apply( lambda x: transform(x), axis=1)
This yield the correct result:
A B out
0 1 5 0.5
1 2 2 1.0
2 3 3 11.0
3 4 1 -39.0
4 5 5 181.0
The breakdown for the correct calculation is as follows:
A B out
0 1 5 0.5
out = a / b
1 2 2 1.0
out = a / b
row_num >= length:
2 3 3 11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1 = 11
3 4 1 -39.0
out = (1 - 11) * 5 + 11 = 39
4 5 5 181.0
out = (5 - (-39)) * 5 + (-39) = 181
Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.
My current attempt looks like this:
df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()
This is not working and I'm not quite sure where to go from here.
Pastebin of the above code to just copy/paste into a file and run
You won't be able to do better than this:
df['out'] = df.A / length
for i in range(len(df)):
if i >= length:
df.loc[i, 'out'] = (df.loc[i, 'B'] -
df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']
The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.
I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData