I want to create a new columns for a big table using several criteria and columsn and was not sure the best way to approach it.
df = pd.DataFrame({'a': ['A', "B", "B", "C", "D"],
'b':['y','n','y','n', np.nan], 'c':[10,20,10,40,30], 'd':[.3,.1,.4,.2, .1]})
df.head()
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
df['new_Col'] = df.c+df.d
if df.a=='A' & df.b =='y':
df['new_Col'] = df.d *2
else:
df['new_Col'] = 0
return df
fun()
OR
def fun(df=df):
df=df.copy()
if df.a=='A' & df.b =='n':
return = df.c+df.d
if df.a=='A' & df.b =='y':
return df.d *2
else:
return 0
df['new_Col"] df.apply(fun)
OR using np.where:
df['new_Col'] = np.where(df.a=='A' & df.b =='n', df.c+df.d,0 )
df['new_Col'] = np.where(df.a=='A' & df.b =='y', df.d *2,0 )
Looks like you need np.select
a, n, y = df.a.eq('A'), df.b.eq('n'), df.b.eq('y')
df['result'] = np.select([a & n, a & y], [df.c + df.d, df.d*2], default=0)
This is an arithmetic way (I added one more row to your sample for case a = 'A' and b = 'n'):
sample
Out[1369]:
a b c d
0 A y 10 0.3
1 B n 20 0.1
2 B y 10 0.4
3 C n 40 0.2
4 D NaN 30 0.1
5 A n 50 0.9
nc = df.a.eq('A') & df.b.eq('y')
mc = df.a.eq('A') & df.b.eq('n')
nr = df.d * 2
mr = df.c + df.d
df['new_col'] = nc*nr + mc*mr
Out[1371]:
a b c d new_col
0 A y 10 0.3 0.6
1 B n 20 0.1 0.0
2 B y 10 0.4 0.0
3 C n 40 0.2 0.0
4 D NaN 30 0.1 0.0
5 A n 50 0.9 50.9
Related
d1=pd.DataFrame({'x':['a','b','c','c'],'y':[-1,-2,-3,0]})
d2=pd.DataFrame({'x':['d','c','a','b'],'y':[0.1,0.2,0.3,0.4]})
I want to replace d1.y where y<0 with the correspondent y in d2. It's something like vlookup in Excel. The core problem is replace y according to x rather than just simply manipulate y. What I want is
Out[40]:
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
Use Series.map with condition:
s = d2.set_index('x')['y']
d1.loc[d1.y < 0, 'y'] = d1['x'].map(s)
print (d1)
x y
0 a 0.3
1 b 0.4
2 c 0.2
3 c 0.0
You can try this:
d1.loc[d1.y < 0, 'y'] = d2.loc[d1.y < 0, 'y']
I have a dataframe than contains two columns, a: [1,2,3,4,5]; b: [1,0.4,0.3,0.5,0.2]. How can I make a column c such that:
c[0] = 1
c[i] = c[i-1]*b[i]+a[i]*(1-b[i])
so that c:[1,1.6,2.58,3.29,4.658]
Calculation:
1 = 1
1*0.4+2*0.6 = 1.6
1.6*0.3+3*0.7 = 2.58
2.58*0.5+4*0.5 = 3.29
3.29*0.2+5*0.8 = 4.658
?
I can't see a way to vectorise your recursive algorithm. However, you can use numba to optimize your current logic. This should be preferable to a regular loop.
from numba import jit
df = pd.DataFrame({'a': [1,2,3,4,5],
'b': [1,0.4,0.3,0.5,0.2]})
#jit(nopython=True)
def foo(a, b):
c = np.zeros(a.shape)
c[0] = 1
for i in range(1, c.shape[0]):
c[i] = c[i-1] * b[i] + a[i] * (1-b[i])
return c
df['c'] = foo(df['a'].values, df['b'].values)
print(df)
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
There could be a smarter way, but here's my attempt:
import pandas as pd
a = [1,2,3,4,5]
b = [1,0.4,0.3,0.5,0.2]
df = pd.DataFrame({'a':a , 'b': b})
for i in range(len(df)):
if i is 0:
df.loc[i,'c'] = 1
else:
df.loc[i,'c'] = df.loc[i-1,'c'] * df.loc[i,'b'] + df.loc[i,'a'] * (1 - df.loc[i,'b'])
Output:
a b c
0 1 1.0 1.000
1 2 0.4 1.600
2 3 0.3 2.580
3 4 0.5 3.290
4 5 0.2 4.658
How to get count of values greater than current row in the last n rows?
Imagine we have a dataframe as following:
col_a
0 8.4
1 11.3
2 7.2
3 6.5
4 4.5
5 8.9
I am trying to get a table such as following where n=3.
col_a col_b
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
Thanks in advance.
In pandas is best dont loop because slow, here is better use rolling with custom function:
n = 3
df['new'] = (df['col_a'].rolling(n+1, min_periods=1)
.apply(lambda x: (x[-1] < x[:-1]).sum())
.astype(int))
print (df)
col_a new
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
If performance is important, use strides:
n = 3
x = np.concatenate([[np.nan] * (n), df['col_a'].values])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = rolling_window(x, n + 1)
df['new'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
print (df)
col_a new
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
Performance: Here is used perfplot in small window n = 3:
np.random.seed(1256)
n = 3
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def roll(df):
df['new'] = (df['col_a'].rolling(n+1, min_periods=1).apply(lambda x: (x[-1] < x[:-1]).sum(), raw=True).astype(int))
return df
def list_comp(df):
df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() for i, j in df['col_a'].items()]
return df
def strides(df):
x = np.concatenate([[np.nan] * (n), df['col_a'].values])
arr = rolling_window(x, n + 1)
df['new1'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
return df
def make_df(n):
df = pd.DataFrame(np.random.randint(20, size=n), columns=['col_a'])
return df
perfplot.show(
setup=make_df,
kernels=[list_comp, roll, strides],
n_range=[2**k for k in range(2, 15)],
logx=True,
logy=True,
xlabel='len(df)')
Also I was curious about performance in large window, n = 100:
n = 3
df['col_b'] = df.apply(lambda row: sum(row.col_a <= df.col_a.loc[row.name - n: row.name-1]), axis=1)
Out[]:
col_a col_b
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
Using a list comprehension with pd.Series.items:
n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() \
for i, j in df['col_a'].items()]
Equivalently, using enumerate:
n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-n):i]).sum() \
for i, j in enumerate(df['col_a'])]
Result:
print(df)
col_a count
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
I want to know if there is any faster way to do the following loop? Maybe use apply or rolling apply function to realize this
Basically, I need to access previous row's value to determine current cell value.
df.ix[0] = (np.abs(df.ix[0]) >= So) * np.sign(df.ix[0])
for i in range(1, len(df)):
for col in list(df.columns.values):
if ((df[col].ix[i] > 1.25) & (df[col].ix[i-1] == 0)) | :
df[col].ix[i] = 1
elif ((df[col].ix[i] < -1.25) & (df[col].ix[i-1] == 0)):
df[col].ix[i] = -1
elif ((df[col].ix[i] <= -0.75) & (df[col].ix[i-1] < 0)) | ((df[col].ix[i] >= 0.5) & (df[col].ix[i-1] > 0)):
df[col].ix[i] = df[col].ix[i-1]
else:
df[col].ix[i] = 0
As you can see, in the function, I am updating the dataframe, I need to access the most updated previous row, so using shift will not work.
For example:
Input:
A B C
1.3 -1.5 0.7
1.1 -1.4 0.6
1.0 -1.3 0.5
0.4 1.4 0.4
Output:
A B C
1 -1 0
1 -1 0
1 -1 0
0 1 0
you can use .shift() function for accessing previous or next values:
previous value for col column:
df['col'].shift()
next value for col column:
df['col'].shift(-1)
Example:
In [38]: df
Out[38]:
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
In [39]: df['prev_a'] = df['a'].shift()
In [40]: df
Out[40]:
a b c prev_a
0 1 0 5 NaN
1 9 9 2 1.0
2 2 2 8 9.0
3 6 3 0 2.0
4 6 1 7 6.0
In [43]: df['next_a'] = df['a'].shift(-1)
In [44]: df
Out[44]:
a b c prev_a next_a
0 1 0 5 NaN 9.0
1 9 9 2 1.0 2.0
2 2 2 8 9.0 6.0
3 6 3 0 2.0 6.0
4 6 1 7 6.0 NaN
I am surprised there isn't a native pandas solution to this as well, because shift and rolling do not get it done. I have devised a way to do this using the standard pandas syntax but I am not sure if it performs any better than your loop... My purposes just required this for consistency (not speed).
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
Disclaimer: I used pandas 0.16 but with only slight modification this will work for the latest versions too.
Others had similar questions and I posted this solution on those as well:
Reference previous row when iterating through dataframe
Reference values in the previous row with map or apply
#maxU has it right with shift, I think you can even compare dataframes directly, something like this:
df_prev = df.shift(-1)
df_out = pd.DataFrame(index=df.index,columns=df.columns)
df_out[(df>1.25) & (df_prev == 0)] = 1
df_out[(df<-1.25) & (df_prev == 0)] = 1
df_out[(df<-.75) & (df_prev <0)] = df_prev
df_out[(df>.5) & (df_prev >0)] = df_prev
The syntax may be off, but if you provide some test data I think this could work.
Saves you having to loop at all.
EDIT - Update based on comment below
I would try my absolute best not to loop through the DF itself. You're better off going column by column, sending to a list and doing the updating, then just importing back again. Something like this:
df.ix[0] = (np.abs(df.ix[0]) >= 1.25) * np.sign(df.ix[0])
for col in df.columns.tolist():
currData = df[col].tolist()
for currRow in range(1,len(currData)):
if currData[currRow]> 1.25 and currData[currRow-1]== 0:
currData[currRow] = 1
elif currData[currRow] < -1.25 and currData[currRow-1]== 0:
currData[currRow] = -1
elif currData[currRow] <=-.75 and currData[currRow-1]< 0:
currData[currRow] = currData[currRow-1]
elif currData[currRow]>= .5 and currData[currRow-1]> 0:
currData[currRow] = currData[currRow-1]
else:
currData[currRow] = 0
df[col] = currData
I have a dataframe of historical election results and want to calculate an additional column that applies a basic math formula for records for winning candidates and copies a value over for the rest of them.
Here is the code I tried:
va2 = va1[['contest_id', 'year', 'district', 'office', 'party_code',
'pct_vote', 'winner']].drop_duplicates()
va2['vote_waste'] = va2['winner'].map(lambda x: (-.5) + va2['pct_vote']
if x == 'w' else va2['pct_vote'])
This gave me a new column where each row contained the calculation for every row in every row.
You can use numpy.where() to achieve what you want:
import pandas as pd
import numpy as np
data = {
'winner': pd.Series(['w', 'l', 'l', 'w', 'l']),
'pct_vote': pd.Series([0.4, 0.9, 0.9, 0.4, 0.9]),
'party_code': pd.Series([10, 20, 30, 40, 50])
}
df = pd.DataFrame(data)
print(df)
party_code pct_vote winner
0 10 0.4 w
1 20 0.9 l
2 30 0.9 l
3 40 0.4 w
4 50 0.9 l
df['vote_waste'] = np.where(
df['winner'] == 'w',
df['pct_vote'] - 0.5, #if condition is true, use this value
df['pct_vote'] #if condition is false, use this value
)
print(df)
party_code pct_vote winner vote_waste
0 10 0.4 w -0.1
1 20 0.9 l 0.9
2 30 0.9 l 0.9
3 40 0.4 w -0.1
4 50 0.9 l 0.9
This is because you are operating a element x against series va2['pct_vote']. What you need is operation on va2['winner'] and va2['pct_vote'] element wise. You could use apply to achieve that.
consider a as winner and b as pct_vote
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
df
Out[23]:
a b c
0 1 2 3
1 4 5 6
df['new'] = df[['a','b']].apply(lambda x : (-0.5)+x[1] if x[0] ==1 else x[1],axis=1)
df
Out[42]:
a b c new
0 1 2 3 1.5
1 4 5 6 5.0