I have a file email.txt which has email addresses as follows:
James, Brian < brian.james#abc.com>; Attar, Daniel < Daniel.Attar#abc.com>; Alex, James < james.alex#abc.com>; Trendy, Elizabeth < elizabeth.trendy#abc.com>; jones, Gary < Gary.Jones#abc.com>; bones, byron < byron.bones#abc.com>;
I want to write the email addresses into a .csv file in one column like this:
brian.james#abc.com
daniel.attar#abc.com
...
byron.bones#abc.com
I wrote a Python script as follows which does this:
fn1 = "email.txt"
f1 = open(fn1,"r")
f1r1 = f1.readlines()
f1r2 = [i.strip() for i in f1r1]
f1r3 = [i.split(";") for i in f1r2]
s1 = f1r3[0]
a = open("ef.csv","w")
for i in s1:
j = i.split("<")
a.write(j[1].strip(">")+"\n")
a.close()
Is there a better, more efficient or more elegant way to write this?
You could consider reading the contents of the text file as a single string and then using re to extract the emails from that string.
In this case, it looks like your email format is fairly specific, so the regex below is also specific also. Realize, though, that a regex capable of finding any RFC 5322-compliant email address (the "official standard" for email address formats) is several hundred characters long. For more on that see How to Find or Validate an Email Address from Jan Goyvaerts.
Anyway...
import re
with open('emails.txt', 'r') as file:
# Produces a single string, `emails`
emails = file.read().replace('\n', '')
regex = re.compile('\S+\.\S+#abc\.com')
for email in regex.findall(emails):
print(email)
# brian.james#abc.com
# Daniel.Attar#abc.com
# james.alex#abc.com
# elizabeth.trendy#abc.com
# Gary.Jones#abc.com
# byron.bones#abc.com
Regex walkthrough: this regex assumes each email takes a pretty specific form: something<dot>somethingelse<at>abc.com.
\S+ is 1 or more non-whitespace characters
\. is a literal period (backslashing a metacharacter)
Related
I'm very new to python and have been getting help from peers with developing this program. I essentially have a very unrefined, dynamic scraper, that pulls emails from a given URL.
I've been considering how I would go about matching up a first/last name to the email address, and come up with the idea of matching any 4+ consecutive characters before the '#' in an email to another element on the web page, under the assumption that most business's use at least some portion of first/last name in the creation of the email. I also decided to go with 4 characters to avoid any mix ups that might occur at 3+ characters, as I don't feel this is specific enough.
I hope this isn't too general of a question, I'm just unsure where to start.
Most of what I have found while pondering this question has been based on splitting the email and using regex to match, but I'm unsure if this will work on the page itself/how to implement.
import urllib.request,re
f = urllib.request.urlopen("http://www.sampleurl.com")
s = f.read().decode('utf-8')
print(re.findall(r"\+\d{2}\s?0?\d{10}",s))
print(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
This a very basic version of a much larger program, but it is most of the inner workings. It returns email and phone number properly based on the given URL.
from email_split import email_split
import urllib.request,re
#Find Emails
f = urllib.request.urlopen("https://www.sampleurl.com/")
s = f.read().decode('utf-8')
e = (re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
emails = []
for x in e:
emails.append(str(x))
#Split Email
email = email_split(x)
str = email.local
match = re.search(r'([\w.-]+)', str)
if match:
print match.group()
It is sort of a general question, but I thought of this right away. I'm sort of thinking this would be a 3 step process:
1) extract the names on the website. I haven't used it, but sounds like you could use spaCy to pull out name entities. Then store those in some type of list of names.
2) extract all the email addresses on the site and store those in a list of email addresses.
3) Then use fuzzywuzzy to iterate through the names and find matches to an email address sample of fuzzywuzzy
Without having a specific website to try this on, it's just in theory, and I created a sample list to sort of show what I'm thinking:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
names = ['John A. Smith', 'Michael Jordan', 'James Thomas', 'Bradley Cooper']
emails = ["jasmith#foo.com", "john.smith#bar.com", "jthomas#foobar.com", "bradc#company.com"]
email_users = [ x.split('#')[0] for x in emails ]
email_dict = dict(zip(email_users, emails))
for name in names:
matches = process.extract(name, email_users, limit=3)
print ('%s: ' %name)
for each in matches:
if each[1] >= 70:
print ('\t%-20s - Score: %s' %(email_dict[each[0]], each[1]))
else:
continue
Output:
John A. Smith:
john.smith#bar.com - Score: 95
jasmith#foo.com - Score: 70
Michael Jordan:
James Thomas:
jthomas#foobar.com - Score: 77
Bradley Cooper:
bradc#company.com - Score: 72
There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name after # and url link in the twitter text, for example: galaxy5univ and url link.
tweet_text = ['#galaxy5univ I like you',
'RT #BestOfGalaxies: Let's sit under the stars ...',
'#jonghyun__bot .........((thanks)',
'RT #yosizo: thanks.ddddd <https://yahoo.com>',
'RT #LDH_3_yui: #fam, ccccc https://msn.news.com']
my code:
import re
pu = re.compile(r'http\S+')
pn = re.compile(r'#(\S+)')
for row in twitter_text:
text = pu.findall(row)
name = (pn.findall(row))
print("url: ", text)
print("name: ", name)
Through testing the code in a large number of twitter data, I have got that my two patterns for url and name both are wrong(although in a few twitter text data is right). Do you guys have some documents or link about extract name and url from twitter text in the case of large twitter data.
If you have advices about extracting name and url from twitter data, please tell me, thanks!
Note that your pn = re.compile(r'#(\S+)') regex will capture any 1+ non-whitespace characters after #.
To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:
pn = re.compile(r'#([^\s:]+)')
Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.
If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'#(\S+):').
As for a URL matching regex, there are many on the Web, just choose the one that works best for you.
Here is an example code:
import re
p = re.compile(r'#([^\s:]+)')
test_str = "#galaxy5univ I like you\nRT #BestOfGalaxies: Let's sit under the stars ...\n#jonghyun__bot .........((thanks)\nRT #yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT #LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str))
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']
If the usernames doesn't contain special chars, you can use:
#([\w]+)
See Live demo
I am working on a research project and I have a list of about ~200 names and 6 email addresses. The requirement is to map every one of those emails to a single email address following this requirement:
"Names starting with A, B, C, D, E will map to email1. F, G, H, I, J will map to email2" and so on and so forth.
Now I'm trying to think of a way to map those names to the specific email in a fashion of "if name starts with A-E then email1, rather than iterating through all the names and checking for the starting letter of each name. Is there a way to accomplish this? I'm thinking RegEx might help, but not sure exactly how (possibly something along the lines of ^[a-eA-E]?)
The re module has an undocumented Scanner class which can be used to attach an arbitrary function call to regex patterns. When the Scanner.scan method is called, the supplied text is matched against each regex pattern, and the associated function is called when a match is found. The scan method ends when the remaining text matches none of the patterns.
import re
def make_email(i):
def email(scanner, token):
print('{t}: Send to email{i}'.format(t=token, i=i))
return email
scanner = re.Scanner(
[(pat, make_email(i)) # 2
for i, pat in enumerate((r"^[a-e]\w+", r"^[f-j]\w+"))] # 1
+ [(r"\s+", None)],
flags=re.IGNORECASE|re.MULTILINE)
scanner.scan("""\
Albert
Barry
Carrie
David
Erin
Franklin
Geoff
Harold
Isadore
Jay""")
prints
Albert: Send to email0
Barry: Send to email0
Carrie: Send to email0
David: Send to email0
Erin: Send to email0
Franklin: Send to email1
Geoff: Send to email1
Harold: Send to email1
Isadore: Send to email1
Jay: Send to email1
You can add more regex patterns here.
The Scanner class is initialized with a list of 2-tuples. Each
2-tuple consists of a regex pattern, and the associated callback
function.
The simple and straightforward solution is to create a simple dictionary with regexes as keys, and loop over those.
import re
mappings = { r'^[a-e]': "email0", r'^[f-j]': "email1" }
for name in names:
for regex in mappings:
if re.match(regex, name, flags=re.IGNORECASE):
print "%s: send to %s" % (name, mappings[regex])
break
else:
print "%s: no match" % name
If you do this on an industrial scale, you would probably want to precompile the regexes with re.compile() but for a quick and dirty solution, this gets the job done.
You only need to know the first letter in each name, and map it to an email address. You don't need a regex for that.
def address(name):
addresses = ['foo#bar.com', 'spam#eggs.org', ... ]
i = 'abcdefghijklmnopqrstuvwxyz'.find(name[0].lower()) // 5
return addresses[i]
Then you want to iterate over the names.
for name in names: print(name, address(name))
I have the plain text of a Cc header field that looks like so:
friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>
Are there any battle tested modules for parsing this properly?
(bonus if it's in python! the email module just returns the raw text without any methods for splitting it, AFAIK)
(also bonus if it splits name and address into to fields)
There are a bunch of function available as a standard python module, but I think you're looking for
email.utils.parseaddr() or email.utils.getaddresses()
>>> addresses = 'friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>'
>>> email.utils.getaddresses([addresses])
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'), ('Smith, Jane', 'jane.smith#uconn.edu')]
I haven't used it myself, but it looks to me like you could use the csv package quite easily to parse the data.
The bellow is completely unnecessary. I wrote it before realising that you could pass getaddresses() a list containing a single string containing multiple addresses.
I haven't had a chance to look at the specifications for addresses in email headers, but based on the string you provided, this code should do the job splitting it into a list, making sure to ignore commas if they are within quotes (and therefore part of a name).
from email.utils import getaddresses
addrstring = ',friend#email.com, John Smith <john.smith#email.com>,"Smith, Jane" <jane.smith#uconn.edu>,'
def addrparser(addrstring):
addrlist = ['']
quoted = False
# ignore comma at beginning or end
addrstring = addrstring.strip(',')
for char in addrstring:
if char == '"':
# toggle quoted mode
quoted = not quoted
addrlist[-1] += char
# a comma outside of quotes means a new address
elif char == ',' and not quoted:
addrlist.append('')
# anything else is the next letter of the current address
else:
addrlist[-1] += char
return getaddresses(addrlist)
print addrparser(addrstring)
Gives:
[('', 'friend#email.com'), ('John Smith', 'john.smith#email.com'),
('Smith, Jane', 'jane.smith#uconn.edu')]
I'd be interested to see how other people would go about this problem!
Convert multiple E-mail string in to dictionary (Multiple E-Mail with name in to one string).
emailstring = 'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>'
Split string by Comma
email_list = emailstring.split(',')
name is key and email is value and make dictionary.
email_dict = dict(map(lambda x: email.utils.parseaddr(x), email_list))
Result like this:
{'John Smith': 'john.smith#email.com', 'Friends': 'friend#email.com', 'Smith': 'jane.smith#uconn.edu'}
Note:
If there is same name with different email id then one record is skip.
'Friends <friend#email.com>, John Smith <john.smith#email.com>,"Smith" <jane.smith#uconn.edu>, Friends <friend_co#email.com>'
"Friends" is duplicate 2 time.
I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character
Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().
If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'