Python extract email address from a HUGE string [duplicate] - python

This question already has answers here:
Extract email sub-strings from large document
(14 answers)
Closed 21 days ago.
I have been using this:
(I know, there are probably more efficient ways...)
Given this in an email message:
Submitted data:
First Name: MyName
Your Email Address: email#domain.com
TAG:
I coded this:
intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]
intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]
... and got what I needed. This worked because I had the 'TAG' label.
Now I am given this:
Submitted data:
First name: MyName
Last name:
Email: email#domain.com
I'm having a brain block on getting the email address without a next word. There is whitespace. Can someone nudge me in the right direction? I suspect I can dig out the email address after the occurrence of 'Email:' using regex...

You can, in fact, make use of RegEx to extract e-mails.
To find single e-mails in a text, you can make use of
re.search().group()
In case you want to find multiple emails, you can make use of
re.findall()
An example
import re
text = "First name: MyName Last name: Email: email#domain.com "
email = re.search(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print(email.group())
emails = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)
This would give the output as
email#domain.com
['email#domain.com']

If the email should come after the word Email followed by a :, you could match the Name part, and capture the email in a group with an email like pattern.
\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)
\bEmail A word boundary to prevent a partial match, match Email
[^:]*:\s* Match optional chars other than :, then match : and optional whitespace chars
( Capture group 1
[^\s#]+#[^\s#]+ Match a single # between 1+ more non whitespace chars ecluding the # itself
) Close group 1
Regex demo
Example with re.findall that returns the values of the capture groups:
import re
regex = r"\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)"
s = ("Submitted data:\n"
"First Name: MyName\n"
"Your Email Address: email#domain.com\n"
"TAG:\n\n"
"Submitted data:\n"
"First name: MyName\n"
"Last name:\n"
"Email: email#domain.com")
print(re.findall(regex, s))
Output
['email#domain.com', 'email#domain.com']

Searching for strings is often better done with splitting, and occasionally regular expressions. So first split the lines:
bodylines = bodystring.splitlines()
Split the resulting lines on the : delimiter (make a generator):
chunks = (line.split(':') for line in bodylines)
Now grab the first one that has "email" on the left and # on the right:
address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val)
If you want all the emails across multiple lines, replace next with a list comprehension:
addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val]
This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it). Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result. If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!

Related

Finding first index after symbol

I need to extract emails from random text strings. For example:
s = 'Application for training - customer#gmail.com Some notes'
I found out how can i find end of email:
email_end = s.find('.com')+4
But how can i find it's start index? Maybe we could reverse string and find first ' ' after # but how can we do it?
This is a very non-trivial approach without using regular expression: you can reverse the string.
s = 'Application for training - customer#gmail.com Some notes'
s_rev = s[::-1]
# Now you are looking for "moc." and this is the starting point:
s_rev.find("moc.")
-> 11
# Then you can search for the next "space" after this index:
s_rev.find(" ", 11)
-> 29
# Then you can find the email from the reversed string:
s_rev[11:29]
-> 'moc.liamg#remotsuc'
# Finally reverse it back:
s_rev[11:29][::-1]
-> 'customer#gmail.com'
As a one-liner:
s[::-1][s[::-1].find("moc."):s[::-1].find(" ", s[::-1].find("moc."))][::-1]
Note that the second find is looking for a space after the email address, which is the example you gave. You might ask what if the string ends with the email? That's fine, since in that case find will return -1 which is the end of the string, thus you are still able to get the correct email address. The only exception is, there are other characters followed by the email address (i.e., a comma).
I would use the re library as follows:
import re
p = r"\w+#\w+.\w{3}"
email = re.findall(p, s)
see Regular expression operations fort an explanation of the syntax for p

How can I extract the email address string

My python script currently pulls an email address as a list, but I need to get the text portion only. In this example, it should have been golfshop#3lakesgolf.com. I have tried using the text attribute (gc_email.text), but that didn't work.
gc_email=web.select('a[href^=mailto]')
print(gc_email)
output:
[golfshop#3lakesgolf.com]
Help! How can I extract just the mailto address?
You can use a regex capture to pull this string
import re
str = 'golfshop#3lakesgolf.com'
regex = '<a href="mailto:(.*?)".*'
try:
match = re.match(regex, str).group(1)
except:
match = None
x=1
if match is not None:
print(match)
Output
golfshop#3lakesgolf.com
Assuming every line follows the format you provided, you could use the '.split()' function on a series of characters and then select the appropriate items from the returned lists.
line = 'golfshop#3lakesgolf.com]'
sections1 = line.split(':')
email = sections1[1].split('.com')[0]+'.com'
Output
golfshop#3lakesgolf.com
If the formatting varies and is not like this every single time, then I'd go with regular expressions.

Python how to check if something is alphanumeric except for certain values

I'm trying to determine whether a string is an email or not. The requirements are of course the #email.com, the first letter has to be capital and it has to be alphanumeric, except for the # and the period. What I was looking for is whether there is a way for me to check whether the email is alphanumeric, except for the period and the #
What I would like is for the code to return True for the email if and only if the first letter is capital, it has the #emuail.com and it is alphanumeric except for the # and the period. What I would like is a solution that checks for alphanumerics except for the # and the period in the #emauil.com portion of the email.
I was thinking I could separate the email at the #email part and check for .isalnum for everything before the #email, but I just wanted to see if there was an easier way.
Here is my current code, which of course returns all False, because of the # and the period:
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
result = []
for idx, email in enumerate(emails):
if '#emuail.com' in email and email[0].isupper() and email.isalnum():
result.append(True)
else:
result.append(False)
print(result)
When doing string searching/testing that gets even modestly complicated, it's usually better (more readable and more flexible) to use regular expressions.
import re
# from https://emailregex.com/
email_pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
for email in emails:
if email_pattern.match(email):
print(email)
Note that hyphens are allowed in email addresses, but if you want to disallow them for some reason, delete them from the regular expression.
This generator will return valid emails. If you want more rules, add them to the condition. re is better, however.
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
[i for i in emails if '#' in i and i[-4:] == '.com' and i.split('#')[0].isalnum() and '#' is not i[-5]]

Find the next word after a word in a string

I am trying to record the word after a specific word. For example, let's say I have a string:
First Name: John
Last Name: Doe
Email: John.Doe#email.com
I want to search the string for a key word such as "First Name:". Then I want to only capture the next word after it, in this case John.
I started using string.find("First Name:"), but I do not think that is the correct approach.
I could use some help with this. Most examples either split the string or keep everything else after "John". My goal is to be able to search strings for specific keywords no mater their location.
SOLUTION:
I used a similar set of code as below:
search = r"(First Name:)(.)(.+)"
x = re.compile(search)
This gave me the "John" with no spaces
a regular expression is the way to go
import re
pattern = r"(?:First Name\: ).+\b"
first_names = re.findall(pattern, mystring)
It will find the prefix (First name: ) without extracting r"(?:First Name: )
then extracts .+\b which denotes a word. Or you can split the string and itterate over resulting list
my_words = [ x.split()[0] for x in my_string.split("First Name: ")]
The .find approach is a good start.
You can use split on the remaining string to limit results to the single word.
Without using regex
s = "abc def opx"
q = 'abc'
res = s[s.find(q)+len(q):].split()[0]
res == 'def'

Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.
This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']
import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']
import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.
You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b
Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'
content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)
# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))
One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)
Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']
import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)

Categories

Resources