Properly mapping first names to emails? - python

I am working on a research project and I have a list of about ~200 names and 6 email addresses. The requirement is to map every one of those emails to a single email address following this requirement:
"Names starting with A, B, C, D, E will map to email1. F, G, H, I, J will map to email2" and so on and so forth.
Now I'm trying to think of a way to map those names to the specific email in a fashion of "if name starts with A-E then email1, rather than iterating through all the names and checking for the starting letter of each name. Is there a way to accomplish this? I'm thinking RegEx might help, but not sure exactly how (possibly something along the lines of ^[a-eA-E]?)

The re module has an undocumented Scanner class which can be used to attach an arbitrary function call to regex patterns. When the Scanner.scan method is called, the supplied text is matched against each regex pattern, and the associated function is called when a match is found. The scan method ends when the remaining text matches none of the patterns.
import re
def make_email(i):
def email(scanner, token):
print('{t}: Send to email{i}'.format(t=token, i=i))
return email
scanner = re.Scanner(
[(pat, make_email(i)) # 2
for i, pat in enumerate((r"^[a-e]\w+", r"^[f-j]\w+"))] # 1
+ [(r"\s+", None)],
flags=re.IGNORECASE|re.MULTILINE)
scanner.scan("""\
Albert
Barry
Carrie
David
Erin
Franklin
Geoff
Harold
Isadore
Jay""")
prints
Albert: Send to email0
Barry: Send to email0
Carrie: Send to email0
David: Send to email0
Erin: Send to email0
Franklin: Send to email1
Geoff: Send to email1
Harold: Send to email1
Isadore: Send to email1
Jay: Send to email1
You can add more regex patterns here.
The Scanner class is initialized with a list of 2-tuples. Each
2-tuple consists of a regex pattern, and the associated callback
function.

The simple and straightforward solution is to create a simple dictionary with regexes as keys, and loop over those.
import re
mappings = { r'^[a-e]': "email0", r'^[f-j]': "email1" }
for name in names:
for regex in mappings:
if re.match(regex, name, flags=re.IGNORECASE):
print "%s: send to %s" % (name, mappings[regex])
break
else:
print "%s: no match" % name
If you do this on an industrial scale, you would probably want to precompile the regexes with re.compile() but for a quick and dirty solution, this gets the job done.

You only need to know the first letter in each name, and map it to an email address. You don't need a regex for that.
def address(name):
addresses = ['foo#bar.com', 'spam#eggs.org', ... ]
i = 'abcdefghijklmnopqrstuvwxyz'.find(name[0].lower()) // 5
return addresses[i]
Then you want to iterate over the names.
for name in names: print(name, address(name))

Related

Extract strings between two words that are supplied from two lists respectively

I have a text which looks like an email body as follows.
To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22
PM
Tony Stark wrote:
Then I have a list of key words as follows.
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
Objective:
I want to first search the text string as given above and if any of the key words as listed above is (or are) present then I want to extract the strings in between those key words. In this case, we have Not and don't keywords matched from no_wds. Also we have Proceed key word matched from yes_wds list. Thus the text I want to be extracted as list as follows
txt = ['seen any abnormalities and thus I don't think we should','think we should']
My approach:
I have tried
re.findall(r'{}(.*){}'.format(re.escape('|'.join(no_wds)),re.escape('|'.join(yes_wds))),text,re.I)
Or
text_f = []
for i in no_wds:
for j in yes_wds:
t = re.findall(r'{}(.*){}'.format(re.escape(i),re.escape(j)),text, re.I)
text_f.append(t)
Didn't get any suitable result. Then I tried str.find() method, there also no success.
I tried to get a clue from here.
Can anybody help in solving this? Any non-regex solution is somewhat I am keen to see, as regex at times are not a good fit. Having said the same, if any one can come up with regex based solution where I can iterate the lists it is welcome.
Loop through the list containing the keys, use the iterator as a splitter (whatever.split(yourIterator)).
EDIT:
I am not doing your homework, but this should get you on your way:
I decided to loop through the splitted at every space list of the message, search for the key words and add the index of the hits into a list, then I used those indexes to slice the message, probably worth trying to slice the message without splitting it, but I am not going to do your homework. And you must find a way to automate the process when there are more indexes, tip: check if the size is even or you are going to have a bad time slicing.
*Note that you should replace the \n characters and find a way to sort the key lists.
message = """To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22"""
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
splittedMessage = message.split( ' ' )
msg = []
for i in range( 0, len( splittedMessage ) ):
temp = splittedMessage[i]
for j, k in zip( no_wds, yes_wds ):
tempJ = j.lower()
tempK = k.lower()
if( tempJ == temp or tempK == temp ):
msg.append( i )
found = ' '.join( splittedMessage[msg[0]:msg[1]] )
print( found )

How to go about matching first and last name to an email in Python scraper

I'm very new to python and have been getting help from peers with developing this program. I essentially have a very unrefined, dynamic scraper, that pulls emails from a given URL.
I've been considering how I would go about matching up a first/last name to the email address, and come up with the idea of matching any 4+ consecutive characters before the '#' in an email to another element on the web page, under the assumption that most business's use at least some portion of first/last name in the creation of the email. I also decided to go with 4 characters to avoid any mix ups that might occur at 3+ characters, as I don't feel this is specific enough.
I hope this isn't too general of a question, I'm just unsure where to start.
Most of what I have found while pondering this question has been based on splitting the email and using regex to match, but I'm unsure if this will work on the page itself/how to implement.
import urllib.request,re
f = urllib.request.urlopen("http://www.sampleurl.com")
s = f.read().decode('utf-8')
print(re.findall(r"\+\d{2}\s?0?\d{10}",s))
print(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
This a very basic version of a much larger program, but it is most of the inner workings. It returns email and phone number properly based on the given URL.
from email_split import email_split
import urllib.request,re
#Find Emails
f = urllib.request.urlopen("https://www.sampleurl.com/")
s = f.read().decode('utf-8')
e = (re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
emails = []
for x in e:
emails.append(str(x))
#Split Email
email = email_split(x)
str = email.local
match = re.search(r'([\w.-]+)', str)
if match:
print match.group()
It is sort of a general question, but I thought of this right away. I'm sort of thinking this would be a 3 step process:
1) extract the names on the website. I haven't used it, but sounds like you could use spaCy to pull out name entities. Then store those in some type of list of names.
2) extract all the email addresses on the site and store those in a list of email addresses.
3) Then use fuzzywuzzy to iterate through the names and find matches to an email address sample of fuzzywuzzy
Without having a specific website to try this on, it's just in theory, and I created a sample list to sort of show what I'm thinking:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
names = ['John A. Smith', 'Michael Jordan', 'James Thomas', 'Bradley Cooper']
emails = ["jasmith#foo.com", "john.smith#bar.com", "jthomas#foobar.com", "bradc#company.com"]
email_users = [ x.split('#')[0] for x in emails ]
email_dict = dict(zip(email_users, emails))
for name in names:
matches = process.extract(name, email_users, limit=3)
print ('%s: ' %name)
for each in matches:
if each[1] >= 70:
print ('\t%-20s - Score: %s' %(email_dict[each[0]], each[1]))
else:
continue
Output:
John A. Smith:
john.smith#bar.com - Score: 95
jasmith#foo.com - Score: 70
Michael Jordan:
James Thomas:
jthomas#foobar.com - Score: 77
Bradley Cooper:
bradc#company.com - Score: 72

Write semi-colon seperated email addresses into a csv file

I have a file email.txt which has email addresses as follows:
James, Brian < brian.james#abc.com>; Attar, Daniel < Daniel.Attar#abc.com>; Alex, James < james.alex#abc.com>; Trendy, Elizabeth < elizabeth.trendy#abc.com>; jones, Gary < Gary.Jones#abc.com>; bones, byron < byron.bones#abc.com>;
I want to write the email addresses into a .csv file in one column like this:
brian.james#abc.com
daniel.attar#abc.com
...
byron.bones#abc.com
I wrote a Python script as follows which does this:
fn1 = "email.txt"
f1 = open(fn1,"r")
f1r1 = f1.readlines()
f1r2 = [i.strip() for i in f1r1]
f1r3 = [i.split(";") for i in f1r2]
s1 = f1r3[0]
a = open("ef.csv","w")
for i in s1:
j = i.split("<")
a.write(j[1].strip(">")+"\n")
a.close()
Is there a better, more efficient or more elegant way to write this?
You could consider reading the contents of the text file as a single string and then using re to extract the emails from that string.
In this case, it looks like your email format is fairly specific, so the regex below is also specific also. Realize, though, that a regex capable of finding any RFC 5322-compliant email address (the "official standard" for email address formats) is several hundred characters long. For more on that see How to Find or Validate an Email Address from Jan Goyvaerts.
Anyway...
import re
with open('emails.txt', 'r') as file:
# Produces a single string, `emails`
emails = file.read().replace('\n', '')
regex = re.compile('\S+\.\S+#abc\.com')
for email in regex.findall(emails):
print(email)
# brian.james#abc.com
# Daniel.Attar#abc.com
# james.alex#abc.com
# elizabeth.trendy#abc.com
# Gary.Jones#abc.com
# byron.bones#abc.com
Regex walkthrough: this regex assumes each email takes a pretty specific form: something<dot>somethingelse<at>abc.com.
\S+ is 1 or more non-whitespace characters
\. is a literal period (backslashing a metacharacter)

Method for parsing text Cc field of email header?

I have the plain text of a Cc header field that looks like so:
friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>
Are there any battle tested modules for parsing this properly?
(bonus if it's in python! the email module just returns the raw text without any methods for splitting it, AFAIK)
(also bonus if it splits name and address into to fields)
There are a bunch of function available as a standard python module, but I think you're looking for
email.utils.parseaddr() or email.utils.getaddresses()
>>> addresses = 'friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>'
>>> email.utils.getaddresses([addresses])
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'), ('Smith, Jane', 'jane.smith#uconn.edu')]
I haven't used it myself, but it looks to me like you could use the csv package quite easily to parse the data.
The bellow is completely unnecessary. I wrote it before realising that you could pass getaddresses() a list containing a single string containing multiple addresses.
I haven't had a chance to look at the specifications for addresses in email headers, but based on the string you provided, this code should do the job splitting it into a list, making sure to ignore commas if they are within quotes (and therefore part of a name).
from email.utils import getaddresses
addrstring = ',friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>,'
def addrparser(addrstring):
addrlist = ['']
quoted = False
# ignore comma at beginning or end
addrstring = addrstring.strip(',')
for char in addrstring:
if char == '"':
# toggle quoted mode
quoted = not quoted
addrlist[-1] += char
# a comma outside of quotes means a new address
elif char == ',' and not quoted:
addrlist.append('')
# anything else is the next letter of the current address
else:
addrlist[-1] += char
return getaddresses(addrlist)
print addrparser(addrstring)
Gives:
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'),
('Smith, Jane', 'jane.smith#uconn.edu')]
I'd be interested to see how other people would go about this problem!
Convert multiple E-mail string in to dictionary (Multiple E-Mail with name in to one string).
emailstring = 'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>'
Split string by Comma
email_list = emailstring.split(',')
name is key and email is value and make dictionary.
email_dict = dict(map(lambda x: email.utils.parseaddr(x), email_list))
Result like this:
{'John Smith': 'john.smith#email.com', 'Friends': 'friend#email.com', 'Smith': 'jane.smith#uconn.edu'}
Note:
If there is same name with different email id then one record is skip.
'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>, Friends <friend_co#email.com>'
"Friends" is duplicate 2 time.

How to eliminate email formatting in received email?

I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character
Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().
If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'

Categories

Resources