Finding first index after symbol - python

I need to extract emails from random text strings. For example:
s = 'Application for training - customer#gmail.com Some notes'
I found out how can i find end of email:
email_end = s.find('.com')+4
But how can i find it's start index? Maybe we could reverse string and find first ' ' after # but how can we do it?

This is a very non-trivial approach without using regular expression: you can reverse the string.
s = 'Application for training - customer#gmail.com Some notes'
s_rev = s[::-1]
# Now you are looking for "moc." and this is the starting point:
s_rev.find("moc.")
-> 11
# Then you can search for the next "space" after this index:
s_rev.find(" ", 11)
-> 29
# Then you can find the email from the reversed string:
s_rev[11:29]
-> 'moc.liamg#remotsuc'
# Finally reverse it back:
s_rev[11:29][::-1]
-> 'customer#gmail.com'
As a one-liner:
s[::-1][s[::-1].find("moc."):s[::-1].find(" ", s[::-1].find("moc."))][::-1]
Note that the second find is looking for a space after the email address, which is the example you gave. You might ask what if the string ends with the email? That's fine, since in that case find will return -1 which is the end of the string, thus you are still able to get the correct email address. The only exception is, there are other characters followed by the email address (i.e., a comma).

I would use the re library as follows:
import re
p = r"\w+#\w+.\w{3}"
email = re.findall(p, s)
see Regular expression operations fort an explanation of the syntax for p

Related

Regex : replace url inside string

i have
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
i need a python regex expression to identify xxx-zzzzzzzzz.eeeeeeeeeee.fr to do a sub-string function to it
Expected output :
string : 'Server:PIPELININGSIZE'
the URL is inside a string, i tried a lot of regex expressions
Not sure if this helps, because your question was quite vaguely formulated. :)
import re
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
string_1 = re.search('[a-z.-]+([A-Z]+)', string).group(1)
print(f'string: Server:{string_1}')
Output:
string: Server:PIPELININGSIZE
No regex. single line use just to split on your target word.
string = 'Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE'
last = string.split("fr",1)[1]
first =string[:string.index(":")]
print(f'{first} : {last}')
Gives #
Server:PIPELININGSIZE
The wording of the question suggests that you wish to find the hostname in the string, but the expected output suggests that you want to remove it. The following regular expression will create a tuple and allow you to do either.
import re
str = "Server:xxx-zzzzzzzzz.eeeeeeeeeee.frPIPELININGSIZE"
p = re.compile('^([A-Za-z]+[:])(.*?)([A-Z]+)$')
m = re.search(p, str)
result = m.groups()
# ('Server:', 'xxx-zzzzzzzzz.eeeeeeeeeee.fr', 'PIPELININGSIZE')
Remove the hostname:
print(f'{result[0]} {result[2]}')
# Output: 'Server: PIPELININGSIZE'
Extract the hostname:
print(result[1])
# Output: 'xxx-zzzzzzzzz.eeeeeeeeeee.fr'

Python extract email address from a HUGE string [duplicate]

This question already has answers here:
Extract email sub-strings from large document
(14 answers)
Closed 21 days ago.
I have been using this:
(I know, there are probably more efficient ways...)
Given this in an email message:
Submitted data:
First Name: MyName
Your Email Address: email#domain.com
TAG:
I coded this:
intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]
intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]
... and got what I needed. This worked because I had the 'TAG' label.
Now I am given this:
Submitted data:
First name: MyName
Last name:
Email: email#domain.com
I'm having a brain block on getting the email address without a next word. There is whitespace. Can someone nudge me in the right direction? I suspect I can dig out the email address after the occurrence of 'Email:' using regex...
You can, in fact, make use of RegEx to extract e-mails.
To find single e-mails in a text, you can make use of
re.search().group()
In case you want to find multiple emails, you can make use of
re.findall()
An example
import re
text = "First name: MyName Last name: Email: email#domain.com "
email = re.search(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print(email.group())
emails = re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)
This would give the output as
email#domain.com
['email#domain.com']
If the email should come after the word Email followed by a :, you could match the Name part, and capture the email in a group with an email like pattern.
\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)
\bEmail A word boundary to prevent a partial match, match Email
[^:]*:\s* Match optional chars other than :, then match : and optional whitespace chars
( Capture group 1
[^\s#]+#[^\s#]+ Match a single # between 1+ more non whitespace chars ecluding the # itself
) Close group 1
Regex demo
Example with re.findall that returns the values of the capture groups:
import re
regex = r"\bEmail[^:]*:\s*([^\s#]+#[^\s#]+)"
s = ("Submitted data:\n"
"First Name: MyName\n"
"Your Email Address: email#domain.com\n"
"TAG:\n\n"
"Submitted data:\n"
"First name: MyName\n"
"Last name:\n"
"Email: email#domain.com")
print(re.findall(regex, s))
Output
['email#domain.com', 'email#domain.com']
Searching for strings is often better done with splitting, and occasionally regular expressions. So first split the lines:
bodylines = bodystring.splitlines()
Split the resulting lines on the : delimiter (make a generator):
chunks = (line.split(':') for line in bodylines)
Now grab the first one that has "email" on the left and # on the right:
address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val)
If you want all the emails across multiple lines, replace next with a list comprehension:
addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '#' in val]
This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it). Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result. If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!

How can I extract the email address string

My python script currently pulls an email address as a list, but I need to get the text portion only. In this example, it should have been golfshop#3lakesgolf.com. I have tried using the text attribute (gc_email.text), but that didn't work.
gc_email=web.select('a[href^=mailto]')
print(gc_email)
output:
[golfshop#3lakesgolf.com]
Help! How can I extract just the mailto address?
You can use a regex capture to pull this string
import re
str = 'golfshop#3lakesgolf.com'
regex = '<a href="mailto:(.*?)".*'
try:
match = re.match(regex, str).group(1)
except:
match = None
x=1
if match is not None:
print(match)
Output
golfshop#3lakesgolf.com
Assuming every line follows the format you provided, you could use the '.split()' function on a series of characters and then select the appropriate items from the returned lists.
line = 'golfshop#3lakesgolf.com]'
sections1 = line.split(':')
email = sections1[1].split('.com')[0]+'.com'
Output
golfshop#3lakesgolf.com
If the formatting varies and is not like this every single time, then I'd go with regular expressions.

How to use "find" in a string search to locate a starting position to the left of the search result

Forgive me for probably posting the dumbest of questions, but Python newbee here. I am on the chapter of regex and it seems easy to extract an email address out of a file. Problem is that I have not yet understood how to accomplish this using "normal" code. I can locate the position of the "#" using .find and then the end of the email address by finding the next space after the "#". But how to move the search "to the left" of the "#"? There is no lfind...
It is probably the simplest of things, but I have searched so many sites now that I gave up and created an account here. I thought by going negative I could maybe move to the left, but wrong. Would be very grateful if someone could turn the lightbulb on for me. Thanks a bunch!
Example:
data = "From random-text myemail#gmail.com Sat 21:19"
atpos = data.find("#")
end = data.find(" ",atpos)
start = data.find(" ",**???**,**???**)
address = data[start:end]
print(address)
data = "From random-text myemail#gmail.com Sat 21:19"
data_list = data.split(' ')
for word in data_list:
if '#' in word:
print(word)
You can later extract the email domain and name by spliting this loop result by '#' sign
You could just split the string by spaces and then for each word check if has the # string. If it does then you can split the word by the # string to get the part to the left and right
data = "From random-text myemail#gmail.com Sat 21:19"
for text in data.split():
if "#" in text:
left, right = text.split('#')
print(f'The email starts with "{left}" and is in the domain "{right}"')
OUTPUT
The email starts with "myemail" and is in the domain "gmail.com"
UPDATE
if you truley do want to do this with index positions and find. Then you already know how to find the position of #, and you know how to search the first space after the # by specifying your start index.
the documentation of find specifies that it finds the lowest index. However we can use rfind to find the highest index i.e the last occurnace in the string.
string.find(s, sub[, start[, end]]) Return the lowest index in s where
the substring sub is found such that sub is wholly contained in
s[start:end]. Return -1 on failure. Defaults for start and end and
interpretation of negative values is the same as for slices.
string.rfind(s, sub[, start[, end]]) Like find() but find the highest
index
So using rfind we can find the last instance of space in the string. if we pass 0 as the start (so starting at the beginning) then pass the index of # as the end then it will find the index of the last space char between the start of the string and the # symbol. we then want to add 1 because we dont want the index of the space but the index after it.
data = "From random-text myemail#gmail.com Sat 21:19"
#get the index of the #
at_index = data.find('#')
# get the first index of space starting after #
right_index = data.find(' ', at_index)
# get the last index of space starting from start but not exceeding index of #
left_index = data.rfind(' ', 0, at_index) + 1
print(data[left_index:right_index])
OUTPUT
myemail#gmail.com

Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
...<name#domain.com>...
What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain #domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.
This code extracts the email addresses in a string. Use it while reading line by line
>>> import re
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol"
>>> match = re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk#bob.com.lol'
If you have several email addresses use findall:
>>> line = "should we use regex more often? let me know at jdsk#bob.com.lol or popop#coco.com"
>>> match = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', line)
>>> match
['jdsk#bob.com.lol', 'popop#coco.com']
The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.
Edit: as suggested in a comment by #kostek:
In the string Contact us at support#example.com. my regex returns support#example.com. (with dot at the end). To avoid this, use [\w\.,]+#[\w\.,]+\.\w+)
Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+#[\w\.-]+\.\w+which will capture example#do-main.com as well.
Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad#ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."
Update 2023
Seems stackabuse has compiled a post based on the popular SO answer mentioned above.
import re
regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")#([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")
def isValid(email):
if re.fullmatch(regex, email):
print("Valid email")
else:
print("Invalid email")
isValid("name.surname#gmail.com")
isValid("anonymous123#yahoo.co.uk")
isValid("anonymous123#...uk")
isValid("...#domain.us")
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2#als.com, Users1#gmail.de " \
"Dariush#dasd-asasdsa.com.lo,Dariush.lastName#someDomain.com"
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(#|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
If you're looking for a specific domain:
>>> import re
>>> text = "this is an email la#test.com, it will be matched, x#y.com will not, and test#test.com will"
>>> match = re.findall(r'[\w-\._\+%]+#test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la#test.com', 'test#test.com']
import re
reg_pat = r'\S+#\S+\.\S+'
test_text = 'xyz.byc#cfg-jj.com ir_er#cu.co.kl uiufubvcbuw bvkw ko#com m#urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc#cfg-jj.com', 'ir_er#cu.co.kl']
import re
mess = '''Jawadahmed#gmail.com Ahmed#gmail.com
abc#gmail'''
email = re.compile(r'([\w\.-]+#gmail.com)')
result= email.findall(mess)
if(result != None):
print(result)
The above code will help to you and bring the Gmail, email only after calling it.
You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+#[\w\-\.]+\b
Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:
>>> str1 = "abcdef_12345#gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[#]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345#gmail.com'
content = ' abcdabcd jcopelan#nyx.cs.du.edu afgh 65882#mimsy.umd.edu qwertyuiop mangoe#cs.umd'
match_objects = re.findall(r'\w+#\w+[\.\w+]+', content)
# \b[\w|\.]+ ---> means begins with any english and number character or dot.
import re
marks = '''
!()[]{};?#$%:'"\,/^&é*
'''
text = 'Hello from priyankv#gmail.com to python#gmail.com, datascience##gmail.com and machinelearning##yahoo..com wrong email address: farzad#google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*#[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
for x in marks:
p = p.replace(x, "")
if len(re.findall(pattern, p)) > 0:
print(re.findall(pattern, p))
One other way is to divide it into 3 different groups and capture the group(0). See below:
emails=[]
for line in email: # email is the text file where some emails exist.
e=re.search(r'([.\w\d-]+)(#)([.\w\d-]+)',line) # 3 different groups are composed.
if e:
emails.append(e.group(0))
print(emails)
Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello#world.com>><123#123.at> <huhu#fake> bla bla <myname#some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello#world.com>', '<123#123.at>', '<huhu#fake>', '<myname#somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello#world.com', '123#123.at', 'myname#some-domain.pt']
import re
txt = 'hello from absc#gmail.com to par1#yahoo.com about the meeting #2PM'
email =re.findall('\S+#\S+',s)
print(email)
Printed output:
['absc#gmail.com', 'par1#yahoo.com']
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+#\S+',s)
for r in result:
print(r)

Categories

Resources