Pandas substring search for filter - python

I have a use case where I need to validate each row in the df and mark if it is correct or not. Validation rules are in another df.
Main
col1 col2
0 1 take me home
1 2 country roads
2 2 country roads take
3 4 me home
Rules
col3 col4
0 1 take
1 2 home
2 3 country
3 4 om
4 2 take
A row in main is marked as pass if the following condition matches for any row in rules
The condition for passing is:
col1==col3 and col4 is substring of col2
Main
col1 col2 result
0 1 take me home Pass
1 2 country roads Fail
2 2 country roads take Pass
3 4 me home Pass
My initial approach was to parse Rules df and create a function out of it dynamically and then run
def action_function(row) -> object:
if self.combined_filter()(row): #combined_filter() is the lambda equivalent of Rules df
return success_action(row) #mark as pass
return fail_action(row) #mark as fail
Main["result"] = self.df.apply(action_function, axis=1)
This turned out to be very slow as apply is not vectorized. The main df is about 3 million and Rules df is around 500 entries. Time taken is around 3 hour.
I am trying to use pandas merge for this. But substring match is not supported by the merge operation. I cannot split words by space or anything.
This will be used as part of a system. So I cannot hardcode anything. I need to read the df from excel every time system starts.
Can you please suggest an approach for this?

Merge and then apply the condtion using np.where i.e
temp = main.merge(rules,left_on='col1',right_on='col3')
temp['results'] = temp.apply(lambda x : np.where(x['col4'] in x['col2'],'Pass','Fail'),1)
no_dupe_df = temp.drop_duplicates('col2',keep='last').drop(['col3','col4'],1)
col1 col2 results
0 1 take me home Pass
2 2 country roads Fail
4 2 country roads take Pass
5 4 me home Pass

Related

Pandas: remove rows that are in another dataframe, comparison on a subset of columns

I have 2 dataframes, X and Y with the same columns, and I'm trying to remove rows in X that appear in Y, but I want to compare them only based on a subset of the columns.
Example:
>>> X
site_domain id url
0 a.com 1 ad_a.com/test
1 b.com 2 ad_b.com/test
2 c.com 3 ad_c.com/test
3 d.com 4 ad_d.com/test
4 e.com 5 ad_e.com/test
>>> Y
site_domain id url
0 a.com 1 ad_a.com/test
1 b.com 10 ad_b.com/test
2 other.com 3 ad_c.com/test
3 d.com 4 ad_other.com/test
I want to remove rows from X that appear in Y, in my definition, this means that columns site_domain and url must match, but I don't care about id. The result of my operation should thus be:
site_domain id url
0 c.com 3 ad_c.com/test
1 d.com 4 ad_d.com/test
2 e.com 5 ad_e.com/test
How could I do this? I think this would require some boolean mask applied on X, but I don't know how to generate a boolean mask that would apply to the index (so as to keep or reject entire rows at a time), and I also don't know how to generate such a mask.
I tried to create such a mask with X['site_domain'] == Y['site_domain'] & X['url'] == Y['url'] and then using the negation of that mask, but Pandas complains that these series are not identically labeled. I could probably make versions of these series with identical labels, but I feel like this would be a lot of trouble for such a simple problem.
You can concatenate site_domain and url and use isin to check if the concatenated string is present in Y
X[~(X['site_domain']+'_'+X['url']).isin(Y['site_domain']+'_'+Y['url'])]
site_domain id url
2 c.com 3 ad_c.com/test
3 d.com 4 ad_d.com/test
4 e.com 5 ad_e.com/test

For each row, select all instances where value is not '0' (in any/all columns)

I have a df that looks something like this:
Col1 Col2 Col3 ColN
0 0 2 1
10 5 0 8
0 0 0 12
Trying to get a sum/mean of all the times a value has not been zero, for each row (and then add it as a 'Sum/Mean' column), to have output:
Col1 Col2 Col3 ColN Sum
0 0 2 1 2
10 5 0 8 1
0 0 0 12 3
In the df, I'm recording number of times an event has occurred. I'm trying to get the average number of occurrences or frequency (or I guess, the number of times a value in a row has been not 0).
Is there some way to apply this dataframe-wide? I have about 2000 rows, and have been hacking away trying to use Counter but have managed to get the number of times something has been observed only for 1 row :(
Or maybe I should convert all non-zero numbers to a dummy variable, but then still don't know how to select and sum?
As yatu suggested,
df.ne(0).sum(1)
does the job. (Note: when I use it to do df['Sum'] = df.ne(0).sum(1), I get a warning message, but I don't really understand the implications)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Actually, I get several rows with zeroes in the wanted column that are still there (not sure why), so I also go and remove any rows with zeroes after this (this is all very ugly, but no idea...)
df = df[(df[['Sum']] != 0).all(axis=1)]

Selecting rows based on last occurence of a string in pandas

I have a pandas dataframe which looks something like this,
id desc
1 Description
1 02.09.2017 15:00 abcd
1 this is a sample description
1 which is continued here also
1
1 Description
1 01.09.2017 12:00 absd
1 this is another sample description
1 which might be continued here
1 or here
1
2 Description
2 09.03.2017 12:00 abcd
2 another sample again
2 and again
2
2 Description
2 08.03.2017 12:00 abcd
2 another sample again
2 and again times two
Basically, there is an id and the rows contain the information in a very unstructured format. I want to extract the description which is after the last "Description" row and store that in 1 row. The resulting dataframe would look something like this:
id desc
1 this is another sample description which might be continued here or here
2 another sample again and again times two
From what I am able to think, I might have to use groupby but I don't know what to do after that.
Extract the positions of last Description and join the rows using str.cat
In [2840]: def lastjoin(x):
...: pos = x.desc.eq('Description').cumsum().idxmax()
...: return x.desc.loc[pos+2:].str.cat(sep=' ')
...:
In [2841]: df.groupby('id').apply(lastjoin)
Out[2841]:
id
1 this is another sample description which might...
2 another sample again and again times two
dtype: object
To have columns use reset_index
In [3216]: df.groupby('id').apply(lastjoin).reset_index(name='desc')
Out[3216]:
id desc
0 1 this is another sample description which might...
1 2 another sample again and again times two

How to merge by a column of collection using Python Pandas?

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:
|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity
For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:
|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#
My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.
My question is:
1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags?
2. How to improve the efficiency of for loop search?
This can be achieved using df.join and df.groupby.
This is the setup I'm working with:
df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]})
Id Tag
0 2 [c#, winforms, type-conversion, decimal, opacity]
df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})
Id Tag
0 3 [c#]
1 4 [winforms, type-conversion]
2 5 [winforms, c#]
Let's flatten out the right column in both data frames. This helped:
In [2331]: from itertools import chain
In [2332]: def flatten(df):
...: return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
...: "Tag": list(chain.from_iterable(df.Tag))})
...:
In [2333]: df1 = flatten(df1)
In [2334]: df2 = flatten(df2)
In [2335]: df1.head()
Out[2335]:
Id Tag
0 2 c#
1 2 winforms
2 2 type-conversion
3 2 decimal
4 2 opacity
And similarly for df2, which is also flattened.
Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.
In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]:
Id_x Id_y Tag
0 2 3 1
1 2 4 2
2 2 5 2
The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.
The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.
To see matching tags, you'll modify the above slightly:
In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]:
Id_x Id_y Tag
0 2 3 [c#]
1 2 4 [winforms, type-conversion]
2 2 5 [c#, winforms]
To filter out 1-overlaps, chain a df.query call to the first expression:
In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]:
Id_x Id_y Tag
1 2 4 2
2 2 5 2
Step 1 List down all tags
Step 2 create binary representation of each tag, i.e. use bit 1 or 0 to represent whether have or not have the tag
Step 3 to find any ID share the same tag, you could call a simple apply function to decode your binary representation.
In terms of processing speed, it should be all right. However, if number of tags are too huge, there might be memory issue. If you only need to find questions with same tag for one Id only, I will suggest you write a simple function and calling df.apply. If you need to check for a lot of IDs and find questions with same tag, I will say above approach will be better.
(Intended to leave it as comment, but not enough reputation... sigh)

Understanding apply and groupby in Pandas

I'm trying to understand an example from Python for Data Analysis book by Wes McKinney. I've looked through the pandas cookbook, documentation and SO but can't find an example like this one.
The example looks at the 2012 Federal Election Commission Database (https://github.com/wesm/pydata-book/blob/master/ch09.ipynb). The code below determines the top donor occupations donating to Obama and Romney.
I'm struggling to understand how the function takes a groupby object and performs another groupby operation on it. When I run this outside of the function I get an error. Could somebody shed some light on this behaviour?
Thanks,
Iwan
# top donor occupations donating to Obama or Romney
def get_top_amounts(group, key, n = 5):
totals = group.groupby(key)['contb_receipt_amt'].sum()
return totals.sort_values(ascending = False)[:n]
# first group by candidate
grouped = fec_mrbo.groupby('cand_nm')
# for each group, group again by contb_receipt_amt so we have a hierarchical index
# take the contribution amount
# then return the total amount for each occupation by cand sorted to give top n
grouped.apply(get_top_amounts, 'contbr_occupation', n= 5)
The result looks like this
cand_nm contbr_occupation
Obama, Barack RETIRED 25270507.23
ATTORNEY 11126932.97
INFORMATION REQUESTED 4849801.96
HOMEMAKER 4243394.30
PHYSICIAN 3732387.44
LAWYER 3159391.87
CONSULTANT 2459812.71
Romney, Mitt RETIRED 11266949.23
INFORMATION REQUESTED PER BEST EFFORTS 11173374.84
HOMEMAKER 8037250.86
ATTORNEY 5302578.82
PRESIDENT 2403439.77
EXECUTIVE 2230653.79
C.E.O. 1893931.11
When you use apply on a grouped dataframe you are actually iterating over the groups and pass each group the function, you're applying.
Let's look at a simple example:
import pandas as pd
df = pd.DataFrame({'col1': [1,1,1,1,2,2,2,2],
'col2': ['a','b','a','b','a','b','a','b'],
'value': [1,2,3,4,5,6,7,8]})
grouped = df.groupby('col1')
Now let's create a simple function which allows us to see what's getting passed to the function:
def print_group(group):
print(group)
print('=' * 10)
grouped.apply(print_group)
col1 col2 value
0 1 a 1
1 1 b 2
2 1 a 3
3 1 b 4
==========
col1 col2 value
0 1 a 1
1 1 b 2
2 1 a 3
3 1 b 4
==========
col1 col2 value
4 2 a 5
5 2 b 6
6 2 a 7
7 2 b 8
==========
As you can see each group is getting passed to the function as a separate dataframe. And of course you can apply all the normal functions to this subset.
The fact that you see the first group twice is due to internal reasons and cannot be changed, it's not a bug ;).
Let's create another function to proof this:
def second_group_sum(group):
res = group.groupby('col2').value.sum()
print(res)
print('=' * 10)
return res
grouped.apply(second_group_sum)
col2
a 4
b 6
Name: value, dtype: int64
==========
col2
a 4
b 6
Name: value, dtype: int64
==========
col2
a 12
b 14
Name: value, dtype: int64
==========
You could even go further and do group-apply-group-apply-group-apply etc etc...
I hope this helps a bit understand what's going on.
By the way if you use ipdb (debugging tool) you can set a breakpoint in the applied function a interact with the group dataframe.

Categories

Resources