Masking a DataFrame on multiple column conditions - inside a loop - python

I would like to mask my dataframe conditional on multiple columns inside a loop. I am trying to do something like this:
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = df[(df['0'] == items[0]) & (df['1'] == items[1]) & ... ]
dfs.append(df_)
Please note that the second condition I wrote above would not exist for the first iteration of the loop because there would be no items[1] element.
Here is a sample dataframe you are welcome to test on:
df = pd.DataFrame({'0': ['a']*3 + ['b']*3 + ['c']*3,
'1': ['a']*3 + ['b']*6,
'2': ['b']*4 + ['c']*5,
'3': ['c']*5 + ['d']*4})
The only solution I have come up with uses eval which I would like very much to avoid.

If you subset your DataFrame to include only the columns you want to use for comparison (as you have done in your example) and the keys in your val_dict are the same as the columns you want to compare, then you can get Pandas to do this for you.
Making a slight modification to your df
df = pd.DataFrame({0: ['a']*3 + ['b']*3 + ['c']*3,
1: ['a']*3 + ['d']*6,
2: ['b']*4 + ['c']*5,
3: ['c']*5 + ['a']*4})
You can now accomplish what you want by the following
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
val_series = pd.Series(val_dict)
for i in range(4):
mask = (df == val_series).all(axis=1)
dfs.append(df[mask])
EDIT
I am leaving my original solution even though it addresses a different problem than OP intended to solve. The intended problem can be solved by the following:
mask = True
for key in range(4):
mask &= df[key] == val_dict[key]
dfs.append(df[mask])
Again, this is using the modified df used earlier in my original answer.

I'll share my eval solution.
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = eval('df[(' + ') & ('.join(['df["'+str(j)+'"] == items['+str(j)+']' for j in range(i+1)]) + ')]')
dfs.append(df_)
It works... but so ugly :(

Related

How to properly write if-then lambda statement for pandas df?

I have the following code:
data = [[11001218, 'Value', 93483.37, 'G', '', 93483.37, '', '56117J100', 'FRA', 'Equity'],
[11001218, 'Value', 3572.73, 'G', 3572.73, '', '56117J100', '', 'LUM', 'Equity'],
[11001218, 'Value', 89910.64, 'G', 89910.64, '', '56117J100', '', 'WAR', 'Equity'],
[11005597, 'Value', 72640313.34,'L','',72640313.34, 'REVR21964', '','IN2', 'Repo']
]
df = pd.DataFrame(data, columns = ['ID', 'Type', 'Diff', 'Group', 'Amount','Amount2', 'Id2', 'Id3', 'Executor', 'Name'])
def logic_builder(row, row2, row3):
if row['Name'] == 'Repo' and row['Group'] == 'L':
return 'Fine resultant'
elif (row['ID'] == row2['ID']) and (row['ID'] == row3['ID']) and (row['Group'] == row2['Group']) and (row['Group'] == row3['Group']) and (row['Executor'] != row2['Executor']) and (row['Executor'] != row3['Executor']):
return 'Difference in Executor'
df['Results'] = df.apply(lambda row: logic_builder(row, row2, row3), axis=1)
If you look at the first 3 rows, they are all technically the same. They contain the same ID, Type, Group, and Name. The only difference is the executor, hence I would like my if-then statement to return "Difference in Executor". I am having trouble figuring out how to right the if-then to look at all the rows with similar attributes for the fields I mentioned above.
Thank you.
You can pass a single row, then determine its index and look for the other rows with df.iloc[index].
Here an example
def logic_builder(row):
global df #you need to access the df
i = row.name #row index
#get next rows
try:
row2 = df.iloc[i+1]
row3 = df.iloc[i+2]
except IndexError:
return
if row['Name'] == 'Repo' and row['Group'] == 'L':
return 'Fine resultant'
elif (row['ID'] == row2['ID']) and (row['ID'] == row3['ID']) and (row['Group'] == row2['Group']) and (row['Group'] == row3['Group']) and (row['Executor'] != row2['Executor']) and (row['Executor'] != row3['Executor']):
return 'Difference in Executor'
df['Results'] = df.apply(logic_builder, axis=1)
Of course, since the result depend on the next two rows, you can't run it on the last 2 rows of the dataframe.
You can modify the function a bit to perform on a chunk/slice of a dataframe, based on group using groupby since you are performing the action per group. A modified version of the function you have written would look something like this:
def logic_builder(group):
if group['Name'].eq('Repo').all() and group['Group'].eq('L').all():
return 'Fine resultant'
elif group['Group'].nunique()==1 and group['Executor'].nunique()>1:
return 'Difference in Executor'
row1, row2, row3,..,rown is not going to work actually, because there may be one or more rows per group, so better strategy is to perform if else using all, and nunique (which essentially gives number of unique values in the selected column) for the above logic that you have.
Then apply the function on groupby object:
df.groupby('ID').apply(logic_builder)
ID
11001218 Difference in Executor
11005597 Fine resultant
dtype: object
You can finally join above value to the actual dataframe if needed.

How to transpose values from top few rows in python dataframe into new columns

I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()

Google cloud NL API data to Pandas Dataframe

I‘m using Google NL API (sample_classify_text)
It's sending me data that I transformed into this format:
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
From here I'd like to build a Pandas df that looks like this:
a b c 1 2 3 url1
d 4 url2
Knowing that the number of results for each url is different (a,b,c = 3 results, d = 1 result) It seems that most of the time number of results < 4 but I'm not sure about this, so I'd like to keep it flexible.
I've tried a few things, but it gets pretty complicated. I'm wondering if there's an easy way to handle that?
Have you tried creating a Pandas DF directly from the list?
Such like:
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
df = pd.DataFrame(response_list)
The result of the print(df) is:
0 1 2
0 [a, b, c] [1, 2, 3] [url1]
1 [d] [4] [url2]
That's what I ended up doing.
Not the most elegant solution...
Please don't tell me this can be done with a one-liner :D
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6 = [None],[None],[None],[None],[None],[None],[None] #pour crer les colonnes
for main_list in response_list:
for idx_macro, sub_list in enumerate(main_list):
for idx, elem in enumerate(sub_list):
if idx_macro == 0:
if idx == 0:
colum_0.append(elem)
if idx == 1:
colum_1.append(elem)
if idx == 2:
colum_2.append(elem)
elif idx_macro == 1:
if idx == 0:
colum_3.append(elem)
if idx == 1:
colum_4.append(elem)
if idx == 2:
colum_5.append(elem)
elif idx_macro == 2:
colum_6.append(elem)
colum_lists = [colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6]
longest_list = 3
colum_lists2 = []
for lst in colum_lists[:-1]: #skip urls
while len(lst) < longest_list:
lst.append(None)
colum_lists2.append(lst)
colum_lists2.append(colum_6) #add urls
df = pd.DataFrame(colum_lists2)
df = df.transpose()
df = df.drop(0)
display(df)

Select rows from a DataFrame using .loc and multiple conditions and then show the row corresponding to the min/max of one column

I know how to select data using .loc and multiple conditions, like so:
df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')]
But from the result of this I can't figure out how to show the entire row corresponding to the min (or max) taken on one other column of numbers, 'C'. How do I do this?
Use this:
df2 = df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')]
df2.loc[df2.C == df2.C.min(), :]
Use this:
for columns:
df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')].apply(max, axis=0)
for rows:
df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')].apply(max, axis=1)
You can use the idxmin or idxmax functions.
Docs for the idxmin function: "Return the row label of the minimum value. If multiple values equal the minimum, the first row label with that value is returned."
So, if you
df.loc[((df['A'] == True) & (df['B'] == 'Tuesday')).idxmix()], this will return the row which has the minimum value for column C.
The easiest option:
df = pd.DataFrame({
'A': [True,False,True,True],
'B': ['Sun', 'Mon', 'Tue', 'Tue'],
'C': [1,4,5,1],
'D': [10,20,30,40]})
print(df.query(""" A == True and B == 'Tue' and C == C.min() """))
A B C D
3 True Tue 1 40

searching in a pandas df that contains ranges

I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!
Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}
Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}

Categories

Resources