I need to extract emails from random text strings. For example:
s = 'Application for training - customer#gmail.com Some notes'
I found out how can i find end of email:
email_end = s.find('.com')+4
But how can i find it's start index? Maybe we could reverse string and find first ' ' after # but how can we do it?
This is a very non-trivial approach without using regular expression: you can reverse the string.
s = 'Application for training - customer#gmail.com Some notes'
s_rev = s[::-1]
# Now you are looking for "moc." and this is the starting point:
s_rev.find("moc.")
-> 11
# Then you can search for the next "space" after this index:
s_rev.find(" ", 11)
-> 29
# Then you can find the email from the reversed string:
s_rev[11:29]
-> 'moc.liamg#remotsuc'
# Finally reverse it back:
s_rev[11:29][::-1]
-> 'customer#gmail.com'
As a one-liner:
s[::-1][s[::-1].find("moc."):s[::-1].find(" ", s[::-1].find("moc."))][::-1]
Note that the second find is looking for a space after the email address, which is the example you gave. You might ask what if the string ends with the email? That's fine, since in that case find will return -1 which is the end of the string, thus you are still able to get the correct email address. The only exception is, there are other characters followed by the email address (i.e., a comma).
I would use the re library as follows:
import re
p = r"\w+#\w+.\w{3}"
email = re.findall(p, s)
see Regular expression operations fort an explanation of the syntax for p
My python script currently pulls an email address as a list, but I need to get the text portion only. In this example, it should have been golfshop#3lakesgolf.com. I have tried using the text attribute (gc_email.text), but that didn't work.
gc_email=web.select('a[href^=mailto]')
print(gc_email)
output:
[golfshop#3lakesgolf.com]
Help! How can I extract just the mailto address?
You can use a regex capture to pull this string
import re
str = 'golfshop#3lakesgolf.com'
regex = '<a href="mailto:(.*?)".*'
try:
match = re.match(regex, str).group(1)
except:
match = None
x=1
if match is not None:
print(match)
Output
golfshop#3lakesgolf.com
Assuming every line follows the format you provided, you could use the '.split()' function on a series of characters and then select the appropriate items from the returned lists.
line = 'golfshop#3lakesgolf.com]'
sections1 = line.split(':')
email = sections1[1].split('.com')[0]+'.com'
Output
golfshop#3lakesgolf.com
If the formatting varies and is not like this every single time, then I'd go with regular expressions.
I'm trying to determine whether a string is an email or not. The requirements are of course the #email.com, the first letter has to be capital and it has to be alphanumeric, except for the # and the period. What I was looking for is whether there is a way for me to check whether the email is alphanumeric, except for the period and the #
What I would like is for the code to return True for the email if and only if the first letter is capital, it has the #emuail.com and it is alphanumeric except for the # and the period. What I would like is a solution that checks for alphanumerics except for the # and the period in the #emauil.com portion of the email.
I was thinking I could separate the email at the #email part and check for .isalnum for everything before the #email, but I just wanted to see if there was an easier way.
Here is my current code, which of course returns all False, because of the # and the period:
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
result = []
for idx, email in enumerate(emails):
if '#emuail.com' in email and email[0].isupper() and email.isalnum():
result.append(True)
else:
result.append(False)
print(result)
When doing string searching/testing that gets even modestly complicated, it's usually better (more readable and more flexible) to use regular expressions.
import re
# from https://emailregex.com/
email_pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
for email in emails:
if email_pattern.match(email):
print(email)
Note that hyphens are allowed in email addresses, but if you want to disallow them for some reason, delete them from the regular expression.
This generator will return valid emails. If you want more rules, add them to the condition. re is better, however.
emails = ['Hello#emuail.com', 'Hello2#emuail.comaas', 'hello--1#emuail.com']
[i for i in emails if '#' in i and i[-4:] == '.com' and i.split('#')[0].isalnum() and '#' is not i[-5]]
I am trying to record the word after a specific word. For example, let's say I have a string:
First Name: John
Last Name: Doe
Email: John.Doe#email.com
I want to search the string for a key word such as "First Name:". Then I want to only capture the next word after it, in this case John.
I started using string.find("First Name:"), but I do not think that is the correct approach.
I could use some help with this. Most examples either split the string or keep everything else after "John". My goal is to be able to search strings for specific keywords no mater their location.
SOLUTION:
I used a similar set of code as below:
search = r"(First Name:)(.)(.+)"
x = re.compile(search)
This gave me the "John" with no spaces
a regular expression is the way to go
import re
pattern = r"(?:First Name\: ).+\b"
first_names = re.findall(pattern, mystring)
It will find the prefix (First name: ) without extracting r"(?:First Name: )
then extracts .+\b which denotes a word. Or you can split the string and itterate over resulting list
my_words = [ x.split()[0] for x in my_string.split("First Name: ")]
The .find approach is a good start.
You can use split on the remaining string to limit results to the single word.
Without using regex
s = "abc def opx"
q = 'abc'
res = s[s.find(q)+len(q):].split()[0]
res == 'def'
I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.
This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']
import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']
import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.
You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b
Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'
content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)
# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))
One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)
Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']
import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)