Pythonic way to simplify multiply 2 pandas columns with condition - python

I am looking for a way how I can simplify below examples:
self.df[TARGET_NAME] = self.df.apply(lambda row: 1 if row['WINNER'] == 1 and row['WINNER_OVER_2_5'] == 1 else 0, axis=1)
like:
self.df[TARGET_NAME] = self.df[(self.df.WINNER == 1)] &
self.df[(self.df.WINNER_OVER_2_5 == 1)] # not it's not correct
and more complex as below
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 0),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 1),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 2),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
-0))).astype(float)

IIUC you can use isin:
print df
WINNER WINNER_OVER_2_5
0 1 0
1 1 1
2 0 2
df['TARGET_NAME'] = np.where((df.WINNER.isin([1]) & df.WINNER_OVER_2_5.isin([1])),1,0)
print df
WINNER WINNER_OVER_2_5 TARGET_NAME
0 1 0 0
1 1 1 1
2 0 2 0
EDIT (untested, because no data):
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([0])),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([1])),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([2])),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
0))).astype(float)

I am looking for a way how I can simplify below examples:
I guess you're looking for a simpler syntax. How about this:
df['MATCH'] = matches(df, values=(0,1), WINNER=1, WINNER_OVER_2_5=1)
Note that values= is optional and takes any tuple(false-value, true-value), defaulting to (False, True).
To get there it takes a bit of magic. Essentially this builds a truth table by chaining the conditions and transforming the result into the values as specified. It ends up doing the same thing as your lambda, just in a generic way.
def matches(df, values=None, **kwargs):
values = values or (False, True)
flt = None
for var, value in kwargs.iteritems():
t = (df[var] == value)
flt = (flt & t) if flt is not None else t
flt = flt.apply(lambda t : values[t])
return flt

maybe you can try with boolean attribute
df= pd.DataFrame({'a':[1,0,1,0],'b' :[1,1,0,np.nan]})
df['NEW']= ((df['a']==1 ) & (df['b']==1)).astype(int).fillna(0)

Related

Pandas: Selecting rows in a table

I am trying to select these rows (where T-stage = 3 AND N-stage = 0 AND Radiation = 1) from three columns (T-stage, N-stage, and Radiation) with Python from the Table below. I used the following but the results is not what was expected:
df=pd.read_csv('Mydata.csv') // Loading my data
#I tried the two approaches below, but the results were not what I expected.
A = ((df['T-stage'] == 3) | (df['N-stage'] == 0 | (df['Radiation'] == 1)))
or
B = ((df['T-stage'] == 3) & (df['N-stage'] == 0 & (df['Radiation'] == 1)))
Seems some () mismatch, for each condition use one ():
B = df[(df['T-stage'] == 3) & (df['N-stage'] == 0) & (df['Radiation'] == 1)]

if I want to make new column based on the old columns range?

I want to make a new column with old column's date range
df['block']= np.where((df['transacted_date']> '2016-06-01') & (df['transacted_date']< '2016-09-01') ,0,'None')
df['block']= np.where((df['transacted_date']> '2016-09-01') & (df['transacted_date']< '2016-12-01') ,1,'None')
is there way to do this in if elif statement?
try using np.select
m1 = (df['transacted_date'] > '2016-06-01') & (df['transacted_date'] < '2016-09-01')
m2 = (df['transacted_date'] > '2016-09-01') &( df['transacted_date'] < '2016-12-01')
df['block'] = np.select(condlist=[m1,m2],
choicelist=[0,1],
default=None)
Use numpy.select with Series.between:
m1 = df['transacted_date'].between('2016-06-01', '2016-09-01', inclusive = False)
m2 = df['transacted_date'].between('2016-09-01', '2016-12-01', inclusive = False)
df['block'] = np.select([m1,m2], [0,1], default=None)
If need if-else solution:
def f(x):
if (x > pd.Timestamp('2016-06-01')) and (x < pd.Timestamp('2016-09-01')):
return 0
elif (x > pd.Timestamp('2016-09-01')) and (x < pd.Timestamp('2016-12-01')):
return 1
else:
return None
df['block']=df['transacted_date'].apply(f)
If need more general solution use cut with numpy.where, because cut cannot create None or NaN labels:
b = pd.to_datetime([pd.Timestamp.min,'2016-06-01','2016-09-01','2016-12-01',pd.Timestamp.max])
s = pd.cut(df['transacted_date'], bins=b, labels=[-2, 0, 1, -1])
df['block1'] = np.where(s.astype(int) >= 0, s, np.nan)

How to form another colum in a pd.DataFrame out of different variables

I'm trying to make a new boolean variable by an if-statement with multiple conditions in other variables. But so far my many tries do not even work with variable as parameter.
head of used columns in data frame
I would really appreciate if anyone of you can see the Problem, I already searched for two days the whole World Wide Web. But as beginner I couldn't find the solution yet.
amount = df4['AnzZahlungIDAD']
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
diffrange = [(diff <=15) & (diff >= -15)]
special = df4[['Taxatoreneinsatz', 'Belegpruefereinsatz_rel', 'IntSVKZ', 'ExtTechSVKZ']]
First Method with list comprehension
label = []
label = [True if (amount[i] <= 1) & (time[i] <= timequantil) & (diff == diffrange) & (special == 'N') else False for i in label]
label
Second Method with iterrows()
df4['label'] = pd.Series([])
df4['label'] = [True if (row[amount] <= 1) & (row[time] <= timequantil) & (row[diff] == diffrange) & (row[special] == 'N') else False for row in df4.iterrows()]
df4['label']
3rd Method with Lambda function
df4.loc[:,'label'] = '1'
df4['label'] = df4['label'].apply([lambda c: True if (c[amount] <= 1) & (c[time] <= timequantil) & (c[diff] == diffrange) & (c[special]) == 'N' else False for c in df4['label']], axis = 0)
df4['label'].value_counts()
I expected that I get a varialbe "Label" in my dataframe df4 that is whether True or False.
Fewer tries gave me only all values = False or all = True even if I used only a single Parameter, which is impossible by the data.
First Method runs fine but Outputs: []
Second Method gives me following error: TypeError: tuple indices must be integers or slices, not Series
Third Method does not load at all.
IIUC, try this
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
# timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
df4['label'] = (df4['AnzZahlungIDAD'] <= 1) & (time <= time.quantile(.2)) & (diff <=15) & (diff >= -15) & (df['Belegpruefereinsatz_rel'] =='N') & (df['Taxatoreneinsatz'] =='N') & (df['ExtTechSVKZ'] =='N') & (df['IntSVKZ'] =='N')
Given your dataset i got following output
Anz dlz sch zal taxa bel int ext label
0 2 82 200 253.80 N N N J False
1 2 82 200 253.80 N N N J False
2 1 153 200 323.68 N J N N False
3 1 153 200 323.68 N J N N False
4 1 191 500 1252.12 N J N N False
Note: Don't mind the abbreviations used in column name

Optimizing while loop in Python

I have a piece of code that takes forever to run. Does anybody know how to optimize it?
The purpose of the formula is to make a column that does the following: when 'action' != 0, if 'PX_LAST'<'ma', populate 'buy_sell' with -1, if 'PX_LAST'>'ma', populate 'buy_sell' with 1; in the other cases, do not populate 'buy_sell' with new values.
Fyi - Column 'action' is populated with either 0 or 1
#create column
df_zinc['buy_sell'] = 0
index = 0
while index < df_zinc.shape[0]:
if df_zinc['action'][index] != 0:
continue
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
elif df_zinc['PX_LAST'][index]>df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = 1
else:
index = index + 1
I think you need:
import numpy as np
mask1 = df_zinc['action'] != 0
mask2 = df_zinc['PX_LAST'] < df_zinc['ma']
mask3 = df_zinc['PX_LAST'] > df_zinc['ma']
df_zinc['buy_sell'] = np.select([mask1 & mask2, mask1 & mask3], [-1,1], 0)

writing function in pandas/python

I have just started to learn python and don't have much of dev background. Here is the code I have written while learning.
I now want to make a function which exactly does what my "for" loop is doing but it needs to calculate different exp(exp,exp1 etc) based on different num(num, num1 etc)
how can I do this?
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
print (df)
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * row['num']
exp.append(row['mul'])
else:
row['mul'] = 1 * row['num']
exp.append(row['mul'])
df['exp'] = exp
print (df)
This is what i was trying to do which gives wrong results
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
def f(x):
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * x
exp.append(row['mul'])
else:
row['mul'] = 1 * x
exp.append(row['mul'])
return exp
df['exp'] = df['num'].apply(f)
df['exp1'] = df['num1'].apply(f)
df
Per suggestion below, I would do:
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df['exp1']=np.where(df.str=='a',df['num1']*-1,df['num1']*1)
I think you are looking for np.where
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df
Out[281]:
str num num1 exp
0 a 1 3 -1
1 b 2 4 2
Normal dataframe operation:
df["exp"] = df.apply(lambda x: x["num"] * (1 if x["str"]=="a" else -1), axis=1)
Mathematical dataframe operation:
df["exp"] = ((df["str"] == 'a')-0.5) * 2 * df["num"]

Categories

Resources