Speed up Pandas: find all columns which fullfill set of conditions

Speed up Pandas: find all columns which fullfill set of conditions - python

I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?

Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")

Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")

You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

Related

Pandas: Selecting rows in a table

I am trying to select these rows (where T-stage = 3 AND N-stage = 0 AND Radiation = 1) from three columns (T-stage, N-stage, and Radiation) with Python from the Table below. I used the following but the results is not what was expected:
df=pd.read_csv('Mydata.csv') // Loading my data
#I tried the two approaches below, but the results were not what I expected.
A = ((df['T-stage'] == 3) | (df['N-stage'] == 0 | (df['Radiation'] == 1)))
or
B = ((df['T-stage'] == 3) & (df['N-stage'] == 0 & (df['Radiation'] == 1)))

Seems some () mismatch, for each condition use one ():
B = df[(df['T-stage'] == 3) & (df['N-stage'] == 0) & (df['Radiation'] == 1)]

Loop through a few commands using a function

I have a loop where I constantly got an error message.
print(((df1['col1_df'] == 0) & (df1['col2_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col3_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col4_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col5_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col2_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col3_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col4_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col5_df'] == True)).sum())
I want to loop them through a function.
So far I have:
for i in range (2,5):
col = "col{}_df".format(i)
print(((df['col'] == 0) & (df['col'] == 2)).sum())
How can I number the df and let the df go through 1, 2, 3, 4 (like df1, df2 df3)

col is a variable. 'col' is a string. Having df['col'] doesn't refer to the variable col.
Your string format is done wrong: col = "col{}_df".format(i)
Also, range(2,5) will give you [2,5) not [2,5]. It is not inclusive.

You can express your entire code in two lines using a comprehension:
print(*( ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
print(*( ((df2[f'col1_df'] == 0) & (df2[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
The expression ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ) creates a generator object yielding the 4 expressions one by one.
The * scatters the elements in this generator and passes the arguments to print as you would have passed a comma-separated argument list. The end=\n ensures that each of these arguments are separated by a new line when output.

How to check if a serial number starts with 0

Im excluding rows from my df that fill certain conditions
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20))]
I would like to also exclude, the numbers that start with 0 in the column 'Serial'
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'] == range(0) == 0))]
I tried the above, no result.

You will probably want to use df.str.startswith to check the first character:
df[ ~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'].str.startswith('0')))]
The current expression df['Serial'] == range(0) == 0 is meaningless. It is equivalent to df['Serial'] == range(0) and range(0) == 0. Clearly, neither of those is related to comparing the first character of the string to '0' (as opposed to 0).

Pythonic way to simplify multiply 2 pandas columns with condition

I am looking for a way how I can simplify below examples:
self.df[TARGET_NAME] = self.df.apply(lambda row: 1 if row['WINNER'] == 1 and row['WINNER_OVER_2_5'] == 1 else 0, axis=1)
like:
self.df[TARGET_NAME] = self.df[(self.df.WINNER == 1)] &
self.df[(self.df.WINNER_OVER_2_5 == 1)] # not it's not correct
and more complex as below
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 0),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 1),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 2),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
-0))).astype(float)

IIUC you can use isin:
print df
WINNER WINNER_OVER_2_5
0 1 0
1 1 1
2 0 2
df['TARGET_NAME'] = np.where((df.WINNER.isin([1]) & df.WINNER_OVER_2_5.isin([1])),1,0)
print df
WINNER WINNER_OVER_2_5 TARGET_NAME
0 1 0 0
1 1 1 1
2 0 2 0
EDIT (untested, because no data):
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([0])),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([1])),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([2])),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
0))).astype(float)

I am looking for a way how I can simplify below examples:
I guess you're looking for a simpler syntax. How about this:
df['MATCH'] = matches(df, values=(0,1), WINNER=1, WINNER_OVER_2_5=1)
Note that values= is optional and takes any tuple(false-value, true-value), defaulting to (False, True).
To get there it takes a bit of magic. Essentially this builds a truth table by chaining the conditions and transforming the result into the values as specified. It ends up doing the same thing as your lambda, just in a generic way.
def matches(df, values=None, **kwargs):
values = values or (False, True)
flt = None
for var, value in kwargs.iteritems():
t = (df[var] == value)
flt = (flt & t) if flt is not None else t
flt = flt.apply(lambda t : values[t])
return flt

maybe you can try with boolean attribute
df= pd.DataFrame({'a':[1,0,1,0],'b' :[1,1,0,np.nan]})
df['NEW']= ((df['a']==1 ) & (df['b']==1)).astype(int).fillna(0)

find the field names from a search query

i have a where condition query.and i want to make a dataframe with those fields inside the where condition.
Question is how to extract those fields from inside the where condition.
I tried things like finding the string before any operator(like ==,>=,&,/) using rstrip,lstrip.But still not successful. i do beleive some string search method will do it but i am not getting it.
my where condition is
whereFields == "CITY_NAME == 'city1' & EVENT_GENRE == 'KIDS' & count_EVENT_GENRE >= 1$#$FAV_VENUE_CITY_NAME == 'city1' & EVENT_GENRE == 'FANTASY' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'FESTIVAL' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'WORKSHOP' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'EXHIBITION' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & FAV_GENRE == '|DRAMA|'$#$CITY_NAME == 'city1' & & FAV_GENRE == '|ACTION|ADVENTURE|SCI-FI"
i want the field names involved.like my dataframe should have all unique columns.
Any help will be appreciated.

import re
res = [re.split(r'[(==)(>=)]', x)[0].strip() for x in re.split('[&($#$)]', whereFields)]
seems to work. Now you may want the unique ones, and no empty field:
res = [x for x in list(set(res)) if x]

In [98]:
pd.DataFrame(data = pd.Series(re.findall('\w+ *(?==|<|>)' , whereFields)).unique() , columns = ['fields'])
Out[98]:
fields
0 CITY_NAME
1 EVENT_GENRE
2 count_EVENT_GENRE
3 FAV_VENUE_CITY_NAME
4 FAV_GENRE

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up Pandas: find all columns which fullfill set of conditions - python

Related

Pandas: Selecting rows in a table

Loop through a few commands using a function

How to check if a serial number starts with 0

Pythonic way to simplify multiply 2 pandas columns with condition

find the field names from a search query

Categories

Resources