Pandas: Selecting rows in a table

Pandas: Selecting rows in a table - python

I am trying to select these rows (where T-stage = 3 AND N-stage = 0 AND Radiation = 1) from three columns (T-stage, N-stage, and Radiation) with Python from the Table below. I used the following but the results is not what was expected:
df=pd.read_csv('Mydata.csv') // Loading my data
#I tried the two approaches below, but the results were not what I expected.
A = ((df['T-stage'] == 3) | (df['N-stage'] == 0 | (df['Radiation'] == 1)))
or
B = ((df['T-stage'] == 3) & (df['N-stage'] == 0 & (df['Radiation'] == 1)))

Seems some () mismatch, for each condition use one ():
B = df[(df['T-stage'] == 3) & (df['N-stage'] == 0) & (df['Radiation'] == 1)]

Related

Loop through a few commands using a function

I have a loop where I constantly got an error message.
print(((df1['col1_df'] == 0) & (df1['col2_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col3_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col4_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col5_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col2_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col3_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col4_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col5_df'] == True)).sum())
I want to loop them through a function.
So far I have:
for i in range (2,5):
col = "col{}_df".format(i)
print(((df['col'] == 0) & (df['col'] == 2)).sum())
How can I number the df and let the df go through 1, 2, 3, 4 (like df1, df2 df3)

col is a variable. 'col' is a string. Having df['col'] doesn't refer to the variable col.
Your string format is done wrong: col = "col{}_df".format(i)
Also, range(2,5) will give you [2,5) not [2,5]. It is not inclusive.

You can express your entire code in two lines using a comprehension:
print(*( ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
print(*( ((df2[f'col1_df'] == 0) & (df2[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
The expression ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ) creates a generator object yielding the 4 expressions one by one.
The * scatters the elements in this generator and passes the arguments to print as you would have passed a comma-separated argument list. The end=\n ensures that each of these arguments are separated by a new line when output.

How to form another colum in a pd.DataFrame out of different variables

I'm trying to make a new boolean variable by an if-statement with multiple conditions in other variables. But so far my many tries do not even work with variable as parameter.
head of used columns in data frame
I would really appreciate if anyone of you can see the Problem, I already searched for two days the whole World Wide Web. But as beginner I couldn't find the solution yet.
amount = df4['AnzZahlungIDAD']
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
diffrange = [(diff <=15) & (diff >= -15)]
special = df4[['Taxatoreneinsatz', 'Belegpruefereinsatz_rel', 'IntSVKZ', 'ExtTechSVKZ']]
First Method with list comprehension
label = []
label = [True if (amount[i] <= 1) & (time[i] <= timequantil) & (diff == diffrange) & (special == 'N') else False for i in label]
label
Second Method with iterrows()
df4['label'] = pd.Series([])
df4['label'] = [True if (row[amount] <= 1) & (row[time] <= timequantil) & (row[diff] == diffrange) & (row[special] == 'N') else False for row in df4.iterrows()]
df4['label']
3rd Method with Lambda function
df4.loc[:,'label'] = '1'
df4['label'] = df4['label'].apply([lambda c: True if (c[amount] <= 1) & (c[time] <= timequantil) & (c[diff] == diffrange) & (c[special]) == 'N' else False for c in df4['label']], axis = 0)
df4['label'].value_counts()
I expected that I get a varialbe "Label" in my dataframe df4 that is whether True or False.
Fewer tries gave me only all values = False or all = True even if I used only a single Parameter, which is impossible by the data.
First Method runs fine but Outputs: []
Second Method gives me following error: TypeError: tuple indices must be integers or slices, not Series
Third Method does not load at all.

IIUC, try this
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
# timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
df4['label'] = (df4['AnzZahlungIDAD'] <= 1) & (time <= time.quantile(.2)) & (diff <=15) & (diff >= -15) & (df['Belegpruefereinsatz_rel'] =='N') & (df['Taxatoreneinsatz'] =='N') & (df['ExtTechSVKZ'] =='N') & (df['IntSVKZ'] =='N')
Given your dataset i got following output
Anz dlz sch zal taxa bel int ext label
0 2 82 200 253.80 N N N J False
1 2 82 200 253.80 N N N J False
2 1 153 200 323.68 N J N N False
3 1 153 200 323.68 N J N N False
4 1 191 500 1252.12 N J N N False
Note: Don't mind the abbreviations used in column name

Speed up Pandas: find all columns which fullfill set of conditions

I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?

Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")

Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")

You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

Optimizing while loop in Python

I have a piece of code that takes forever to run. Does anybody know how to optimize it?
The purpose of the formula is to make a column that does the following: when 'action' != 0, if 'PX_LAST'<'ma', populate 'buy_sell' with -1, if 'PX_LAST'>'ma', populate 'buy_sell' with 1; in the other cases, do not populate 'buy_sell' with new values.
Fyi - Column 'action' is populated with either 0 or 1
#create column
df_zinc['buy_sell'] = 0
index = 0
while index < df_zinc.shape[0]:
if df_zinc['action'][index] != 0:
continue
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
elif df_zinc['PX_LAST'][index]>df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = 1
else:
index = index + 1

I think you need:
import numpy as np
mask1 = df_zinc['action'] != 0
mask2 = df_zinc['PX_LAST'] < df_zinc['ma']
mask3 = df_zinc['PX_LAST'] > df_zinc['ma']
df_zinc['buy_sell'] = np.select([mask1 & mask2, mask1 & mask3], [-1,1], 0)

Pythonic way to simplify multiply 2 pandas columns with condition

I am looking for a way how I can simplify below examples:
self.df[TARGET_NAME] = self.df.apply(lambda row: 1 if row['WINNER'] == 1 and row['WINNER_OVER_2_5'] == 1 else 0, axis=1)
like:
self.df[TARGET_NAME] = self.df[(self.df.WINNER == 1)] &
self.df[(self.df.WINNER_OVER_2_5 == 1)] # not it's not correct
and more complex as below
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 0),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 1),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 2),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
-0))).astype(float)

IIUC you can use isin:
print df
WINNER WINNER_OVER_2_5
0 1 0
1 1 1
2 0 2
df['TARGET_NAME'] = np.where((df.WINNER.isin([1]) & df.WINNER_OVER_2_5.isin([1])),1,0)
print df
WINNER WINNER_OVER_2_5 TARGET_NAME
0 1 0 0
1 1 1 1
2 0 2 0
EDIT (untested, because no data):
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([0])),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([1])),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([2])),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
0))).astype(float)

I am looking for a way how I can simplify below examples:
I guess you're looking for a simpler syntax. How about this:
df['MATCH'] = matches(df, values=(0,1), WINNER=1, WINNER_OVER_2_5=1)
Note that values= is optional and takes any tuple(false-value, true-value), defaulting to (False, True).
To get there it takes a bit of magic. Essentially this builds a truth table by chaining the conditions and transforming the result into the values as specified. It ends up doing the same thing as your lambda, just in a generic way.
def matches(df, values=None, **kwargs):
values = values or (False, True)
flt = None
for var, value in kwargs.iteritems():
t = (df[var] == value)
flt = (flt & t) if flt is not None else t
flt = flt.apply(lambda t : values[t])
return flt

maybe you can try with boolean attribute
df= pd.DataFrame({'a':[1,0,1,0],'b' :[1,1,0,np.nan]})
df['NEW']= ((df['a']==1 ) & (df['b']==1)).astype(int).fillna(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Selecting rows in a table - python

Seems some () mismatch, for each condition use one (): B = df[(df['T-stage'] == 3) & (df['N-stage'] == 0) & (df['Radiation'] == 1)]

Related

Loop through a few commands using a function

How to form another colum in a pd.DataFrame out of different variables

Speed up Pandas: find all columns which fullfill set of conditions

Optimizing while loop in Python

Pythonic way to simplify multiply 2 pandas columns with condition

Categories

Resources