I tried this code:
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
match = re.findall(r'[\w\.-]+#[\w\.-]+', contents)
print match
Result:
alokm.014#gmail.yahoo.com.....thankyou
I want to remove ....thankyou from my email
Is it possible to obtain only alok.014#gmail.yahoo.com
and one more thing the content list is bigger so I want some changes in
re.findall(r'[\w\.-]+#[\w\.-]+', contents)
if it is possible.
I don't know about python, but languages like Java have libraries that help validate URLs and email addresses. Alternately, you can use a well-vetted regex expression.
My suggestion would be to keep removing the end of the string based on dots until the string validates. So test the string, and if it doesn't validate as an email, read the string from the right until you encounter a period, then drop the period and everything to the right and start again.
So you'd loop through like this
alokm.014#gmail.yahoo.com.....thankyou
alokm.014#gmail.yahoo.com....
alokm.014#gmail.yahoo.com...
alokm.014#gmail.yahoo.com..
alokm.014#gmail.yahoo.com.
alokm.014#gmail.yahoo.com
At which point it would validate as a real email address. Yes, it's slow. Yes, it can be tricked. But it will work most of the time based on the little info (possible strings) given.
Interesting question! And, here's a Python Regex program to help make extraction of email from the contents possible:
import re
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
emailRegex = re.compile(r'''
[a-zA-Z0-9.]+ # username
# # # symbol
[a-zA-Z0-9.]+\.com # domain
''', re.VERBOSE) # re.VERBOSE helps make Regex multi-line with comments for better readability
extractEmail = emailRegex.findall(contents)
print(extractEmail)
Output will be:
['alokm.014#gmail.yahoo.com']
I will now suggest that you refer to this Regex-HowTo doc to understand what's happening in this program and to come up with a better version that could extract all the emails from your larger text.
Related
I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?
I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None
The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo
I was trying to scape a link out of a .eml file but somehow I always get "NONE" as return for my search. But I don't even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.
One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.
CODE/REGEX I USE:
def get_url(raw):
#get rid of whitespaces
raw = raw.replace(' ', '')
#search for the link
url = re.search('href=3D(.*?)token([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)', raw).group(1)
return url
First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.
Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it's proper usage scenarios, but some simple python can usually do the same without having to use regex' flavor of magic, which can be confusing as [REDACTED].
Note that if you're building regex, it's always adviced to use one of the gazillion regex editors as these will help you build your regex... My personal favorite is regex101
EDIT: added regex way to do it.
import quopri
import re
def get_url_by_regex(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
return re.search('(<a href=")(.*?)(")', decoded).group(2)
def get_url(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
for line in decoded.split('\n'):
if 'token=' in line:
return line.split('<a href="')[1].split('"')[0]
return None # just in case this is needed
print(get_url(raw_email))
print(get_url_by_regex(raw_email))
result is:
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+#[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
However I want to only show the unique addresses which would be
{'email': ['billy666#stanford.edu',
'cantorfamilies#stanford.edu',
'footer-stanford-logo#2x.png']}
If you want to throw in how to only collect the email and not that
'footer-stanford-logo#2x.png'
that is helpful also.
Thanks everyone!
Here is how you can get rid of the dupes and 'footer-stanford-logo#2x.png'-like thingies in your output:
import re
p = re.compile(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'cantorfamilies#stanford.edu',\n 'footer-stanford-logo#2x.png']}"
print(set(p.findall(test_str)))
See the Python demo
The regex will look like
[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^
See demo
The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).
Dupes can easily be removed with a set - it is the least troublesome part here.
FINAL SOLUTION:
item["email"] = set(re.findall(r'[\w.-]+#(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
Can't you just use a set instead of a list?
item["email"] = set(re.findall('[\w\.-]+#[\w\.-]+', response.body))
And if you really want a list then:
item["email"] = list(set(re.findall('[\w\.-]+#[\w\.-]+', response.body)))
i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:
import re
regex = r'\b[\w|\.|-]+#([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc#def.ghi"
match=re.findall(regex,input_string)
print match
now when i run this (using a very simple mail) it doesn't catch it!!
instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?
Here's a simple one to start you off with
regex = r'\b[\w.-]+?#\w+?\.\w+?\b'
re.findall(regex,input_string) # ['abc#def.ghi']
The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).
Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the # symbol:
#\w+?\.\w+?\b
This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: forbes#2x-302019213j32.png
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = 'forbes#2x-302019213j32.png'
re.findall(y, s) # Will return an empty list
s2 = 'myname#email2018529391230.net'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list
Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
#[A-Za-z0-9\._-]+ # literal # followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.