I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100
Related
I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.
I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?
I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)
You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.
Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)
Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %
I have two lists: ref_list and inp_list. How can one make use of FuzzyWuzzy to match the input list from the reference list?
inp_list = pd.DataFrame(['ADAMS SEBASTIAN', 'HAIMBILI SEUN', 'MUTESI
JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA',
'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON',
'PETERS MARIO', 'SHEETEKELA MATT NICKY'],
columns =['Names'])
ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN
KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA MATT NICKY MBILI'], columns =
['Names'])
After some research, I modified some codes I found on the internet. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.
Below are the codes, first a helper function was defined.
def match_term(term, inp_list, min_score=0):
# -1 score in case I don't get any matches
max_score = -1
# return empty for no match
max_name = ''
# iterate over all names in the other
for term2 in inp_list:
# find the fuzzy match score
score = fuzz.token_sort_ratio(term, term2)
# checking if I am above my threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = term2
max_score = score
return (max_name, max_score)
# list for dicts for easy dataframe creation
dict_list = []
#iterating over the sales file
for name in inp_list:
#use the defined function above to find the best match, also set the threshold to a chosen #
match = match_term(name, ref_list, 94)
#new dict for storing data
dict_ = {}
dict_.update({'passenger_name': name})
dict_.update({'match_name': match[0]})
dict_.update({'score': match[1]})
dict_list.append(dict_)
Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?
You can try to vectorized the operations instead of evaluate the scores in a loop.
Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list. Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1). Finally you'll get the best match name and score in ref_list for each name in inp_list.
The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy, you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:
import random, string, time
import pandas as pd
from string_grouper import match_strings
alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1
random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)
t_1 = time.time()
matches = match_strings(series_1, series_2,
min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)
It takes less than one second to do 25.000.000 comparisons! For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that
"Using this approach made it possible to search for near duplicates in
a set of 663,000 company names in 42 minutes using only a dual-core
laptop".
To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.
In my Python application I have an array of IP address strings which looks something like this:
[
"50.28.85.81-140", // Matches any IP address that matches the first 3 octets, and has its final octet somewhere between 81 and 140
"26.83.152.12-194" // Same idea: 26.83.152.12 would match, 26.83.152.120 would match, 26.83.152.195 would not match
]
I installed netaddr and although the documentation seems great, I can't wrap my head around it. This must be really simple - how do I check if a given IP address matches one of these ranges? Don't need to use netaddr in particular - any simple Python solution will do.
The idea is to split the IP and check every component separately.
mask = "26.83.152.12-192"
IP = "26.83.152.19"
def match(mask, IP):
splitted_IP = IP.split('.')
for index, current_range in enumerate(mask.split('.')):
if '-' in current_range:
mini, maxi = map(int,current_range.split('-'))
else:
mini = maxi = int(current_range)
if not (mini <= int(splitted_IP[index]) <= maxi):
return False
return True
Not sure this is the most optimal, but this is base python, no need for extra packages.
parse the ip_range, creating a list with 1 element if simple value, and a range if range. So it creates a list of 4 int/range objects.
then zip it with a split version of your address and test each value in range of the other
Note: Using range ensures super-fast in test (in Python 3) (Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?)
ip_range = "50.28.85.81-140"
toks = [[int(d)] if d.isdigit() else range(int(d.split("-")[0]),int(d.split("-")[1]+1)) for d in ip_range.split(".")]
print(toks) # debug
for test_ip in ("50.28.85.86","50.284.85.200","1.2.3.4"):
print (all(int(a) in b for a,b in zip(test_ip.split("."),toks)))
result (as expected):
[[50], [28], [85], range(81, 140)]
True
False
False
I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated
If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.
What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)