This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: forbes#2x-302019213j32.png
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = 'forbes#2x-302019213j32.png'
re.findall(y, s) # Will return an empty list
s2 = 'myname#email2018529391230.net'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list
Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
#[A-Za-z0-9\._-]+ # literal # followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+#[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.
Related
I tried this code:
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
match = re.findall(r'[\w\.-]+#[\w\.-]+', contents)
print match
Result:
alokm.014#gmail.yahoo.com.....thankyou
I want to remove ....thankyou from my email
Is it possible to obtain only alok.014#gmail.yahoo.com
and one more thing the content list is bigger so I want some changes in
re.findall(r'[\w\.-]+#[\w\.-]+', contents)
if it is possible.
I don't know about python, but languages like Java have libraries that help validate URLs and email addresses. Alternately, you can use a well-vetted regex expression.
My suggestion would be to keep removing the end of the string based on dots until the string validates. So test the string, and if it doesn't validate as an email, read the string from the right until you encounter a period, then drop the period and everything to the right and start again.
So you'd loop through like this
alokm.014#gmail.yahoo.com.....thankyou
alokm.014#gmail.yahoo.com....
alokm.014#gmail.yahoo.com...
alokm.014#gmail.yahoo.com..
alokm.014#gmail.yahoo.com.
alokm.014#gmail.yahoo.com
At which point it would validate as a real email address. Yes, it's slow. Yes, it can be tricked. But it will work most of the time based on the little info (possible strings) given.
Interesting question! And, here's a Python Regex program to help make extraction of email from the contents possible:
import re
contents = 'alokm.014#gmail.yahoo.com.....thankyou'
emailRegex = re.compile(r'''
[a-zA-Z0-9.]+ # username
# # # symbol
[a-zA-Z0-9.]+\.com # domain
''', re.VERBOSE) # re.VERBOSE helps make Regex multi-line with comments for better readability
extractEmail = emailRegex.findall(contents)
print(extractEmail)
Output will be:
['alokm.014#gmail.yahoo.com']
I will now suggest that you refer to this Regex-HowTo doc to understand what's happening in this program and to come up with a better version that could extract all the emails from your larger text.
I'm struggling to do multiline regex with multiple matches.
I have data separated by newline/linebreaks like below. My pattern matches each of these lines if i test it separately. How can i match all the occurrences (specifically numbers?
I've read that i could/should use DOTALL somehow (possibly with MULTILINE). This seems to match any character (newlines also) but not sure of any eventual side effects. Don't want to have it match an integer or something and give me malformed data in the end.
Any info on this would be great.
What i really need though, is some assistance in making this example code work. I only need to fetch the numbers from the data.
I used re.fullmatch when i only needed one specific match in a previous case and not entirely sure which function i should use now by the way (finditer, findall, search etc.).
Thank you for any and all help :)
data = """http://store.steampowered.com/app/254060/
http://www.store.steampowered.com/app/254061/
https://www.store.steampowered.com/app/254062
store.steampowered.com/app/254063
254064"""
regPattern = '^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$'
evaluateData = re.search(regPattern, data, re.DOTALL | re.MULTILINE)
if evaluateString2 is not None:
print('do stuff')
else:
print('found no match')
import re
p = re.compile(ur'^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$', re.MULTILINE)
test_str = u"http://store.steampowered.com/app/254060/\nhttp://www.store.steampowered.com/app/254061/\nhttps://www.store.steampowered.com/app/254062\nstore.steampowered.com/app/254063\n254064"
re.findall(p, test_str)
https://regex101.com/r/rC9rI0/1
this gives [u'254060', u'254061', u'254062', u'254063', u'254064'].
Are you trying to return those specific integers?
re.search stop at the first occurrence
You should use this intead
re.findall(regPattern, data, re.MULTILINE)
['254060', '254061', '254062', '254063', '254064']
Note: Search was not working for me (python 2.7.9). It just return the first line of data
/ has no special meaning so you do not have to escape it (and in not-raw strings you would have to escape every \)
try this
regPattern = r'^\s*(?:https?://)?(?:www\.)?(?:store\.steampowered\.com/app/)?([0-9]+)/?\s*$'
I am trying to write a generic replace function for a regex sub operation in Python (trying in both 2 and 3) Where the user can provide a regex pattern and a replacement for the match. This could be just a simple string replacement to replacing using the groups from the match.
In the end, I get from the user a dictionary in this form:
regex_dict = {pattern:replacement}
When I try to replace all the occurrences of a pattern via this command, the replacement works for replacements for a group number, (such as \1) and I call the following operation:
re.sub(pattern, regex_dict[pattern], text)
This works as expected, but I need to do additional stuff when a match is found. Basically, what I try to achieve is as follows:
replace_function(matchobj):
result = regex_dict[matchobj.re]
##
## Do some other things
##
return result
re.sub(pattern, replace_function, text)
I see that this works for normal replacements, but the re.sub does not use the group information to get the match when the function is used.
I also tried to convert the \1 pattern to \g<1>, hoping that the re.sub would understand it, but to no avail.
Am I missing something vital?
Thanks in advance!
Additional notes: I compile the pattern using strings as in bytes, and the replacements are also in bytes. I have non-Latin characters in my pattern, but I read everything in bytes, including the text where the regex substitution will operate on.
EDIT
Just to clarify, I do not know in advance what kind of replacement the user will provide. It could be some combination of normal strings and groups, or just a string replacement.
SOLUTION
replace_function(matchobj):
repl = regex_dict[matchobj.re]
##
## Do some other things
##
return matchobj.expand(repl)
re.sub(pattern, replace_function, text)
I suspect you're after .expand, if you've got a compiled regex object (for instance), you can provide a string to be taken into consideration for the replacements, eg:
import re
text = 'abc'
# This would be your key in the dict
rx = re.compile('a(\w)c')
# This would be the value for the key (the replacement string, eg: `\1\1\1`)
res = rx.match(text).expand(r'\1\1\1')
# bbb
Is there a simple method to pull content between a regex? Assume I have the following sample text
SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT
My regex is:
compiledRegex = re.compile('\[.*\] value=("|\').*("|\')')
This will obviously return the entire [SOME MORE TEXT] value="ssss", however I only want ssss to be returned since that's what I'm looking for
I can obviously define a parser function but I feel as if python provides some simple pythonic way to do such a task
This is what capturing groups are designed to do.
compiledRegex = re.compile('\[.*\] value=(?:"|\')(.*)(?:"|\')')
matches = compiledRegex.match(sampleText)
capturedGroup = matches.group(1) # grab contents of first group
The ?: inside the old groups (the parentheses) means that the group is now a non-capturing group; that is, it won't be accessible as a group in the result. I converted them to keep the output simpler, but you can leave them as capturing groups if you prefer (but then you have to use matches.group(2) instead, since the first quote would be the first captured group).
Your original regex is too greedy: r'.*\]' won't stop at the first ']' and the second '.*' won't stop at '"'. To stop at c you could use [^c] or '.*?':
regex = re.compile(r"""\[[^]]*\] value=("|')(.*?)\1""")
Example
m = regex.search("""SOME TEXT [SOME MORE TEXT] value="ssss" SOME MORE TEXT""")
print m.group(2)
While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!
Here is the whole regex if it helps
"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"
Its just basically to look for pictures in a url
Use (jpg|bmp) instead of square brackets.
Square brackets mean - match a character from the set in the square brackets.
Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)
When you are using [] your are creating a character class that contains all characters between the brackets.
So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...
You should add an anchor for the end of the string to your regex
http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
^ ^^
if you need double escaping then every where in your pattern
http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
to ensure that it checks for the file ending at the very end of the string.
If you are searching a list of URLs
urls = [ 'http://some.link.com/path/to/file.jpg',
'http://some.link.com/path/to/another.png',
'http://and.another.place.com/path/to/not-image.txt',
]
to find ones that match a given pattern you can use:
import re
for url in urls:
if re.match(r'http://.*(jpg|png|gif)$'):
print url
which will output
http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png
re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.
If you are getting just the extension, you can use the following:
for url in urls:
m = re.match(r'http://.*(jpg|png|gif)$')
print m.group(0)
which will print
('jpg',)
('png',)
You will get just the extensions because that's what was defined as a group.
If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,
response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
kdlfjd dkkf aldfkaklfakldfkja df"""
reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)
print reg.groups()
will print
('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)
or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.