Data modeling with AppEngine python and Queries with 'IN' range - python

I have a list of addresses as string type and I'd like find all events whose location value matches the contents of the list. Because I have thousands of such entries, using the 'IN' with a filter won't work as I've exceeded the limit of 30 items/fetch.
Here's how I'm trying to do a filter:
# addresses come in as list of string items
addresses = ['123 Main St, Portland, ME', '500 Broadway, New York, NY', ...];
query = Event.all();
query.filter('location IN ', addresses);
# above causes the error:
<class 'google.appengine.api.datastore_errors.BadArgumentError'>:
Cannot satisfy query -- too many subqueries (max: 30, got 119).
Probable cause: too many IN/!= filters in query.
My model classes:
class Event(GeoModel):
name = db.StringProperty();
location = db.PostalAddressProperty();
Is there a better way to find all entries that match a specific criteria?

There's no way around this other than multiple queries - you are, after all, asking for the combined results of a set of queries for different addresses, and this is how 'IN' queries are implemented in the datastore. You might want to consider using ndb or asynchronous queries so you can run them in parallel.
Perhaps if you explain what you're trying to achieve, we can suggest a more efficient approach.

A simple solution/(hack) would be to break up your list of addresses to lists of 30 each. Do 1 query per 30 locations then take an intersection of the query results to get the events in every location in the original list.

GQL ‘IN’ does not allow sub-queries more than 30. For this purpose, I have divided sub queries into small chunks for less than or equal 30 sub-queries and result stored into an array.
resultArray = []
rLength = 0.0
rCount = len(subQueryArray)
rLength = len(subQueryArray)/29.0
arrayLength = int(math.ceil( rLength ))
# If subqueries are greater than 30 than divide sub-query length by 29 or 30
if arrayLength > 1:
for ii in range (0, arrayLength):
#srange = start range, nrange = new range
if ii == 0:
srange = ii
else:
srange = nrange + 1
nrange = 29 * (ii + 1)
newList = []
for nii in range (srange, nrange+1):
if nii < rCount:
newList.append(subQueryArray[nii])
query = db.GqlQuery(“SELECT * FROM table_name ” +“WHERE req_id in:1”,newList)
for result in query.run():
# result.id belongs to table entity
resultArray.append(result.id)

Related

Making comparing 2 tables faster (Postgres/SQLAlchemy)

I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.

What is the fastest way to match, replace, and extract substrings from pandas dataframe with multiple criteria?

I have an approximately 1 million row pandas dataframe containing data parsed from federal appellate court opinions. I need to extract the names of judges hearing the cases. The data has an unknown number of judges per case (one row) which are contained in a string. That string (currently stored in a single column) contains a lot of excess text as well as has inconsistent formatting and capitalization. I use different dictionaries of judge names (with 2,575 regex keys possible to be used) to match judges listed based on multiple criteria described below. I use the dictionary with the most stringent matching criteria first and gradually loosen the criteria. I also remove the matched string from the source column. The current methods that I have tried are simply too slow (taking days, weeks, or even months).
The reason there are multiple possible dictionaries is that many judges share the same (last) names. The strings don't ordinarily include full names. I use data contained in two other columns to get the right match: year the case was decided and the court hearing the case (both integers). I also have higher and lower quality substring search terms. The dictionaries I use can be recreated at will in different formats besides regex if needed.
The fastest solution I have tried was crude and unpythonic. In the initial parsing of the data (extraction of sections and keywords from raw text files), which occurs on a case-by-case basis, I did the following: 1) removed excess text to the degree possible, 2) sorted the remaining text into a list stored within a pandas column, 3) concatenated as strings the year and court to each item in that list, and 4) matched that concatenated string to a dictionary that I had similarly prepared. That dictionary didn't use regular expressions and had approximately 800,000 keys. That process took about a day (with all of the other parsing involved as well) and was not as accurate as I would have liked (because it omitted certain name format permutations).
The code below contains my most recent attempt (which is currently running and looks to be among the slowest options yet). It creates subset dictionaries on the fly and still ends up iterating through those smaller dictionaries with regex keys. I've read through and tried to apply solutions from many stackoverflow questions, but couldn't find a workable solution. I'm open to any python-based idea. The data is real data that I've cleaned with a prior function.
import numpy as np
import pandas as pd
test_data = {'panel_judges' : ['CHAGARES, VANASKIE, SCHWARTZ',
'Sidney R. Thomas, Barry G. Silverman, Raymond C. Fisher, Opinion by Thomas'],
'court_num' : [3, 9],
'date_year' : [2014, 2014]}
test_df = pd.DataFrame(data = test_data)
name_dict = {'full_name' : ['Chagares, Michael A.',
'Vanaskie, Thomas Ignatius',
'Schwartz, Charles, Jr.',
'Schwartz, Edward Joseph',
'Schwartz, Milton Lewis',
'Schwartz, Murray Merle'],
'court_num' : [3, 3, 1061, 1097, 1058, 1013],
'circuit_num' : [3, 3, 5, 9, 9, 3],
'start_year' : [2006, 2010, 1976, 1968, 1979, 1974],
'end_year' : [2016, 2019, 2012, 2000, 2005, 2013],
'hq_match' : ['M(?=ICHAEL)? ?A?(?=\.)? ?CHAGARES',
'T(?=HOMAS)? ?I?(?=GNATIUS)? ?VANASKIE',
'C(?=HARLES)? SCHWARTZ',
'E(?=DWARD)? ?J?(?=OSEPH)? ?SCHWARTZ',
'M(?=ILTON)? ?L?(?=EWIS)? ?SCHWARTZ',
'M(?=URRAY)? ?M?(?=ERLE)? ?SCHWARTZ'],
'lq_match' : ['CHAGARES',
'VANASKIE',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ']}
names = pd.DataFrame(data = name_dict)
in_col = 'panel_judges'
year_col = 'date_year'
out_col = 'fixed_panel'
court_num_col = 'court_num'
test_df[out_col] = ''
test_df[out_col].astype(list, inplace = True)
def judge_matcher(df, in_col, out_col, year_col, court_num_col,
size_column = None):
general_cols = ['start_year', 'end_year', 'full_name']
court_cols = ['court_num', 'circuit_num']
match_cols = ['hq_match', 'lq_match']
for match_col in match_cols:
for court_col in court_cols:
lookup_cols = general_cols + [court_col] + [match_col]
judge_df = names[lookup_cols]
for year in range(df[year_col].min(),
df[year_col].max() + 1):
for court in range(df[court_num_col].min(),
df[court_num_col].max() + 1):
lookup_subset = ((judge_df['start_year'] <= year)
& (year < (judge_df['end_year'] + 2))
& (judge_df[court_col] == court))
new_names = names.loc[lookup_subset]
df_subset = ((df[year_col] == year)
& (df[court_num_col] == court))
df.loc[df_subset] = matcher(df.loc[df_subset],
in_col, out_col, new_names, match_col)
return df
def matcher(df, in_col, out_col, lookup, keys):
patterns = dict(zip(lookup[keys], lookup['full_name']))
for key, value in patterns.items():
df[out_col] = (
np.where(df[in_col].astype(str).str.upper().str.contains(key),
df[out_col] + value + ', ', df[out_col]))
df[in_col] = df[in_col].astype(str).str.upper().str.replace(key, '')
return df
df = judge_matcher(test_df, in_col, out_col, year_col, court_num_col)
The output I currently get is essentially right (although the names should be sorted and in a list). The proper "Schwartz" is picked and the matches are all correct. The problem is speed.
My goal is to have a de-deduplicated, sorted (alphabetically) list of judges on each panel either stored in a single column or exploded into up to 15 separate columns (I presently do that in a separate vectorized function). I then will do other lookups on those judges based upon other demographic and biographical information. The produced data will be openly available to researchers in the area and the code will be part of a free, publicly available platform usable for studying other courts as well. So accuracy and speed are both important considerations for users on many different machines.
For anyone who stumbles across this question and has a similar complex string matching issue in pandas, this is the solution I found to be the fastest.
It isn't fully vectorized like I wanted, but I used df.apply with this method within a class:
def judge_matcher(self, row, in_col, out_col, year_col, court_num_col,
size_col = None):
final_list = []
raw_list = row[in_col]
cleaned_list = [x for x in raw_list if x]
cleaned_list = [x.strip() for x in cleaned_list]
for name in cleaned_list:
name1 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 1)
name2 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 2)
if name1 in self.names_dict_list[0]:
final_list.append(self.names_dict_list[0].get(name1))
elif name1 in self.names_dict_list[1]:
final_list.append(self.names_dict_list[1].get(name1))
elif name2 in self.names_dict_list[2]:
final_list.append(self.names_dict_list[2].get(name2))
elif name2 in self.names_dict_list[3]:
final_list.append(self.names_dict_list[3].get(name2))
elif name in self.names_dict_list[4]:
final_list.append(self.names_dict_list[4].get(name))
final_list = list(unique_everseen(final_list))
final_list.sort()
row[out_col] = final_list
if size_col and final_list:
row[size_col] = len(final_list)
return row
#staticmethod
def convert_judge_name(year, court, name, dict_type):
if dict_type == 1:
return str(int(court) * 10000 + int(year)) + name
elif dict_type == 2:
return str(int(year)) + name
else:
return name
Basically, it concatenates three columns together and performs hashed dictionary lookups (instead of regexes) with the concatenated strings. Multiplication is used to efficiently concatenate the two numbers to be side-by-side as strings. The dictionaries had similarly prepared keys (and the values are the desired strings). By using lists and then deduplicating, I didn't have to remove the matched strings. I didn't time this specific function, but the overall module took just over 10 hours to process ~ 1 million rows. When I run it again, I will try to remember to time this applied function specifically and post the results here. The method is ugly, but fairly effective.

Dataflow job is timing out. Having issues comparing two collections, and appending the values of one to another.

Hoping someone can help me here. I have two bigquery tables that I read into 2 different p collections, p1 and p2. I essentially want to update product based on a type II transformation that keeps track of history (previous values in the nested column in product) and appends new values from dwsku.
The idea is to check every row in each collection. If there is a match based on some table values (between p1 and p2), then check product's nested data to see if it contains all values in p1 (based on it's sku number and brand). If it does not contain the most recent data from p2 then take a copy of the format of the current nested data in product, and fit the new data into it. Take this nested format and add it to the existing nested products in product.
def process_changes(element, productdata):
for data in productdata:
if element['sku_number'] == data['sku_number'] and element['brand'] == data['brand']:
logging.info('Processing Product: ' + str(element['sku_number']) + ' brand:' + str(element['brand']))
datatoappend = []
for nestline in data['product']:
logging.info('Nested Data: ' + nestline['product'])
if nestline['in_use'] == 'Y' and (nestline['sku_description'] != element['sku_description'] or nestline['department_id'] != element['department_id'] or nestline['department_description'] != element['department_description']
or nestline['class_id'] != element['class_id'] or nestline['class_description'] != element['class_description'] or nestline['sub_class_id'] != element['sub_class_id'] or nestline['sub_class_description'] != element['sub_class_description'] ):
logging.info('we found a sku we need to update')
logging.info('sku is ' + data['sku_number'])
newline = nestline.copy()
logging.info('most recent nested product element turned off...')
nestline['in_use'] = 'N'
nestline['expiration_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
logging.info(nestline)
logging.info('inserting most recent change in dwsku inside nest')
newline['sku_description'] = element['sku_description']
newline['department_id'] = element['department_id']
newline['department_description'] = element['department_description']
newline['class_id'] = element['class_id']
newline['class_description'] = element['class_description']
newline['sub_class_id'] = element['sub_class_id']
newline['sub_class_description'] = element['sub_class_description']
newline['in_use'] = 'Y'
newline['effective_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_time'] = "%s:%s:%s" % (curdate.hour, curdate.minute, curdate.second)
nestline['expiration_date'] = "9999-01-01"
datatoappend.append(newline)
else:
logging.info('Nothing changed for sku ' + str(data['sku_number']))
for dt in datatoappend:
logging.info('processed sku ' + str(element['sku_number']))
logging.info('adding the changes (if any)')
data['product'].append(dt)
return data
changed_product = p1 | beam.FlatMap(process_changes, AsIter(p2))
Afterwards I want to add all values in p1 not in p2 in a nested format as seen in nestline.
Any help would be appreciated as I'm wondering why my job is taking hours to run with nothing to show. Even the output logs in dataflow UI don't show anything.
Thanks in advance!
This can be quite expensive if side input PCollection p2 is large. From your code snippets it's not clear how PCollection p2 is constructed. But if it is, for example, a text file that is if size 62.7MB, processing it per element can be pretty expensive. Can you consider using CoGroupByKey: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
Also note that from a FlatMap, you are supposed to return a iterator of elements from the processing method. Seems like you are returning a dictionary('data') which probably is incorrect.

python: data cleaning - detect pattern for fraudulent email addresses

I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100

Python, complex looping calculations with lists or arrays

I am converting old pseudo-Fortran code into python and am struggling to create a framework within which I can perform some complex iterative calculations.
As a beginner, my first instinct is to use lists as I find them easier to work with, but i understand that arrays would probably be a more suitable method.
I already have all the input channels as lists and am hoping for a good explanation of how to set up loops for such calculations.
This is an example of the pseudo-Fortran i am replicating. Each (t) indicates a 'time-series channel' that I currently have stored as lists (ie. ECART2(t) and NNNN(t) are lists) All lists have the same number of entries.
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. ) ;
mmm(t)=nnnn(t)+1.;
if YRPVBPO(t).ge.0.1 .and. YRPVBPO(t).le.0.999930338 .and. YAEVBPO(t).ge.0.000015 .and. YAEVBPO(t).le.0.000615 then do;
YM5(t) = customFunction(YRPVBPO,YAEVBPO);*
end;
YUEVBO(t) = YU0VBO(t) * YM5(t) ;*m/s
YHEVBO(t) = YCPEVBO(t)*TPO_TGETO1(t)+0.5*YUEVBO(t)*YUEVBO(t);*J/kg
YAVBO(t) = ddnn2(t)*(YUEVBO(t)**2);*
YDVBO(t) = YCPEVBO(t)**2 + 4*YHEVBO(t)*YAVBO(t) ;*
YTSVBPO(t) = (sqrt(YDVBO(t))-YCPEVBO(t))/2./YAVBO(t);*K
YUSVBO(t) = ddnn(t)*YUEVBO(t)*YTSVBPO(t);*m/s
YM7(t) = YUSVBO(t)/YU0VBO(t);*
YPHSVBPOtot(t) = (YPHEVBPO(t) - YPDHVBPO(t))/(1.+((YGAMAEVBO(t)-1)/2)*(YM7(t)**2))**(YGAMAEVBO(t)/(1-YGAMAEVBO(t)));*bar
YPHEVBPOtot(t) = YPHEVBPO(t) / (1.+rss0(t)*YM5(t)*YM5(t))**rss1(t);*bar
YDPVBPOtot(t) = YPHEVBPOtot(t) - YPHSVBPOtot(t) ;*bar
iter(t) = (YPHEVBPOtot(t) - YDPVBPOtot(t))/YPHEVBPOtot(t);*
ecart2(t)= ABS(iter(t)-YRPVBPO(t));*
aa(t)=YRPVBPO(t)+0.0001;
YRPVBPO(t)=aa(t);*
nnnn(t)=mmm(t);*
end;
Understanding the pseudo-fortran: With 'time-series data' there is an impicit loop iterating through the individual values in each list - as well as looping over each of those values until the conditions are met.
It will carry out the loop calculations on the first list values until the conditions are met. It then moves onto the second value in the lists and perform the same looping calculations until the conditions are met...
ECART2 = [2,0,3,5,3,4]
NNNN = [6,7,5,8,6,7]
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. )
MMM = NNNN + 1
this looks at the first values in each list (2 and 6). Because the conditions are met, subsequent calculations are performed on the first values in the new lists such as MMM = [6+1,...]
Once the rest of the calculations have been performed (looping multiple times if the conditions are not met) only then does the second value in every list get considered. The second values (0 and 7) do not meet the conditions and therefore the second entry for MMM is 0.
MMM=[6+1, 0...]
Because 0 must be entered if conditons are not met, I am considering setting up all the 'New lists' in advance and populating them with 0s.
NB: 'customFunction()' is a separate function that is called, returning a value from two input values
MY CURRENT SOLUTION
set up all the empty lists
nPts = range(ECART2)
MMM = [0]*nPts
YM5 = [0]*nPts
etc...
then start performing calculations
for i in ECART2:
while (ECART2[i] > 0.0002) and (NNNN[i] < 2000):
MMM[i] = NNNN[i]+1
if YRPVBPO[i]>=0.1 and YRPVBPO[i]<=0.999930338 and YAEVBPO[i]>=0.000015 and YAEVBPO[i]<=0.000615:
YM5[i] = MACH_LBP_DIA30(YRPVBPO[i],YAEVBPO[i])
YUEVBO[i] = YU0VBO[i]*YM5[i]
YHEVBO[i] = YCPEVBO[i]*TGETO1[i] + 0.5*YUEVBO[i]^2
YAVBO[i] = DDNN2[i]*YUEVBO[i]^2
YDVBO[i] = YCPEVBO[i]^2 + 4*YHEVBO[i]*YAVBO[i]
etc etc...
but i'm guessing that there are better ways of doing this - such as the suggestion to use numpy arrays (something i plan on learning in the near future)

Categories

Resources