Selection in dataframe base on multiple condition - python

I am developping a dashboard using dash.
The user can select different parameters and a dataframe is updated (6 parameters).
The idea was to do :
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[condition]
But I have a key error, what i understand, as condition is a string.
Any advice on how I can do it ? What is the best practise ?
NB : I have a solution with if condition but I would like to maximise this part, avoiding the copy of the dataframe (>10 millions of rows).
dff = df.copy()
if len(filter1)>0:
dff = dff.loc[dff.col1.isin(filter1)]
if len(filter2)>0:
dff = dff.loc[dff.col2.isin(filter2)]

you can use eval:
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[eval(condition)]

You can merge the masks using the & operator and only apply the merged mask once
from functools import reduce
filters = []
if len(filter1)>0:
filters.append(df.col1.isin(filter1))
if len(filter2)>0:
filters.append(df.col2.isin(filter2))
if len(filters) > 0:
final_filter = reduce(lambda a, b: a&b, filters)
df = df.loc[final_filter]

Related

How to filter this dataframe?

I have a large dataframe (sample). I was filtering the data according to this code:
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
When I print filt_df3, I am getting only an empty dataframe - why?
How can I improve the code, other approaches like some advanced techniques?
I am not sure the code above works as outlined in the edit below?
I would like to know how can I change the code, such that it works as outlined in the edit below?
Edit:
I want to remove the rows based on columns (A0 - A49) based on cond_A.
Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.
Thank you very much in advance.
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this:
https://stackoverflow.com/a/39820329/6139079
You can filter by all columns together with DataFrame.all for test if all rows match together:
A = [f"A{i}" for i in range(50)]
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)]
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)]
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)
And last chain all masks by & for bitwise AND:
filt_df = df[cond_A & cond_B & cond_C]
If get empty DataFrame it seems no row satisfy all conditions.

How to do multiple queries?

I want to do a multiple queries. Here is my data frame:
data = {'Name':['Penny','Ben','Benny','Mark','Ben1','Ben2','Ben3'],
'Eng':[5,1,4,3,1,2,3],
'Math':[1,5,3,2,2,2,3],
'Physics':[2,5,3,1,1,2,3],
'Sports':[4,5,2,3,1,2,3],
'Total':[12,16,12,9,5,8,12],
'Group':['A','A','A','A','A','B','B']}
df1=pd.DataFrame(data, columns=['Name','Eng','Math','Physics','Sports','Total','Group'])
df1
I have 3 queries:
Group A or B
Math > Eng
Name starts with 'B'
I tried to do it one by one
df1[df1.Name.str.startswith('B')]
df1.query('Math > Eng')
df1[df1.Group == 'A'] #I cannot run the code with df1[df1.Group == 'A' or 'B']
Then, I tried to merge those queries
df1.query("'Math > Eng' & 'df1[df1.Name.str.startswith('B')]' & 'df1[df1.Group == 'A']")
TokenError: ('EOF in multi-line statement', (2, 0))
I also tried to pass str.startswith() into df.query()
df1.query("df1.Name.str.startswith('B')")
UndefinedVariableError: name 'df1' is not defined
I have tried lots of ways but no one works. How can I put those queries together?
The long way to solve this – and the one with the most transparency, so best for beginners – is to create a boolean column for each filter. Then sum those columns as one final filter:
df1['filter_1'] = df1['Group'].isin(['A','B'])
df1['filter_2'] = df1['Math'] > df1['Eng']
df1['filter_3'] = df1['Name'].str.startswith('B')
# If all are true
df1['filter_final'] = df1[['filter_1', 'filter_2', 'filter_3']].all(axis=1)
You can certainly combine these steps into one:
mask = ((df1['Group'].isin(['A','B'])) &
(df1['Math'] > df1['Eng']) &
(df1['Name'].str.startswith('B'))
)
df['filter_final'] = mask
Lastly, selecting rows which satisfy your filter is done as follows:
df_filtered = df1[df1['filter_final']]
This selects rows from df1 where final_filter == True
Firstly, the answer is:
df1.query("Math > Eng & Name.str.startswith('B') & Group=='A'")
Additional comments
In query, the column's name doesn't accompany the data frame's name.
df1[df1.Group.isin(['A', 'B'])] or df1.query("Group in ['A', 'B']") instead of df1[df1.Group == 'A' or 'B']

Another way of passing orderby list in pyspark windows method

I just had a below concern in performing window operation on pyspark dataframe. I want to get the latest records from the input table with the below condition, but I want to exclude the for loop:
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date').desc()", "col('load_date').desc()"]
window_spec = Window.partitionBy(*groupby_col).orderBy([eval(x) for x in orderby_col])
df = df.withColumn("rank", rank().over(window_spec))
df = df.filter(col('rank') == '1')
My concern, is I'm using the orderby_col and evaluating to covert in columner way using eval() and for loop to check all the orderby columns in the list.
Could you please let me know how we can pass multiple columns in order by without having a for loop to do the descending order??
import pyspark.sql.functions as f
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date')", "col('load_date')"]
window_spec = Window.partitionBy(*groupby_col).orderBy(f.desc(*orderby_col))
df = df.withColumn("rank", f.rank().over(window_spec))
df = df.filter(col('rank') == '1')

I want to run a loop with condition and save all outputs as dataframes with different names

I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources