Python re.findall hangs on certain websites - python

I´ve got a python script to loop through a list of websites/domains to scrape phones and e-mails from my clients websites, 99% of websites scrapes are OK and works. Some websites just hangs and cant even force break operation, like it is on an insane loop. Below an example. Anyone could help me improve or fix this?
import requests,re
try:
r = requests.Session()
f = r.get('http://www.poffoconsultoria.com.br', verify=False, allow_redirects=False,timeout=(5,5) )
s = f.text
tels = set(re.findall(r"\s?\(?0?[1-9][1-9]\)?[-\.\s][2-5]\d{3}\.?-?\s?\d{4}",s))
emails = set(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
print(tels)
print(emails)
except Exception as e:
print(e)

You should remove the \s? from the first regex (you do not really need a whitespace at the start of the match), or replace with (?<!\S) if you want to only match after a whitespace or start of string.
The real problem is with the second regex where . resides in a character class that is quantified with +. The \. that follows it also matches a . and that makes it a problem when no matching text appears in the string. This is catastrophic backtracking.
Since the matches you expect are whole words, I suggest enhancing the pattern by 1) adding word boundaries, 2) making all adjoining subpatterns match different types of chars.
Use
r'\b[A-Za-z0-9._%+-]+#(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b'
to match emails.
See the (?:[A-Za-z0-9-]+\.)+ part: it matches one or more repetitions of 1 or more alphanumeric/hyphen chars followed with a dot, and there is no \. after this pattern, there is an alpha character class, so there should be no problem like the one present before.

So. I got the website data fine in Python27 using >>> string = requests.get('http://www.poffoconsultoria.com.br').text
I then took the length of the string and it was >>> len(strings)
474038 That's a really high value.
So for problems like these when one sees regex take such a long time (really, after getting the length of the page), you should visit the page in your browser and inspect the page source
When I inspected the page in my browser I found these:
The second regex [A-Za-z0-9._%+-]+ will definitely hang (really, take a long time) because it isn't quantifiable and has to search through those ginormous portions.
You either need to chunk the page or limit your regex. Or you could write a function that discards dictionary data if you suspect that what you need to return won't appear inside of them; basically though, those huge dictionaries above are causing the regex you posted to take a long time.

Use valid email
(?i)(?:("[^"\\]*(?:\\.[^"\\]*)*"#)|((?:[0-9a-z](?:\.(?!\.)|[-!#$%&'*+/=?^`{}|~\w])*)?[0-9a-z]#))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))

Related

Positive Lookbehind Stripping Out Metacharacters

I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.
I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the .+ after the (?<=series\-) with something to negate that - but it hasn't worked.
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
res = re.search(r"(?<=series\-).+", url).group(0)
re.sub('-', '', res)
Which gives the desired result 'kbw10a'
Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?
More examples;
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',
You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.
The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(.+)", url)
if resObj:
res = resObj.group(1).replace('-', '')
Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.
Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.

Why do multi-line strings lead to different pattern matches from single line strings when using python regex?

I am trying to create a Discord Bot that reads users messages and detects when an Amazon link(s) is/are present in their message.
If I use a multi-line string I capture different results from when the message is used on a single line.
Here is the code I am using:
import re
AMAZON_REGEX = re.compile("(http[s]?://[a-zA-Z0-9.-]*(?:amazon|amzn).["
"a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))")
def extract_url(message):
foo = AMAZON_REGEX.findall(message)
return foo
user_message = """https://www.amazon.co.uk/dp/B07RLWTXKG blah blah
hello
https://www.amazon.co.uk/dp/B07RLWToop foobar"""
print(extract_url(user_message))
The result of the above code is: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah', 'https://www.amazon.co.uk/dp/B07RLWToop']
However, if I change user_message from a multiline string to a single line one then I get the following result: ['https://www.amazon.co.uk/dp/B07RLWTXKG blah blah hello https://www.amazon.co.uk/dp/B07RLWToop']
Why is this the case? Also, how do I capture just the URL without the rest of the users' messages?
It seems like you're having an issue with the exact regex you're using.
Why does the newline change the output?
After parsing the link, it seems like your regex captures the following words, separated by spaces, but the newline character stops the regex from continuing. The fact that there's a newline between "blah" and "hello" in the first case is what's causing the "hello" to not be captured in the multi-line case. As you might know, there's a newline character (\n), a bit like a, * and other character exist.
Only capturing the link
I'm not quite sure what format the amazon link would come in, so it's difficult to say how it should look. However, you know that the link will not contain a space, so stopping the matching when you see a space character would be optimal.
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|.+(?= )|[^?]+))
(http[s]?:\/\/[a-zA-Z0-9.-]*(?:amazon|amzn).[a-zA-Z]+(?:.+?(?:ref=[^?]+)|[^ ]+(?= )|[^?]+))
In the example above, I turned one of your last . (basically "match all characters") into [^ ] (basically "match all except for a space"). This means you won't start matching the words following the spaces after the word.
Good luck with the Discord bot!
So the reason you're getting a different result between your two different input sources is because you're not doing any checks for the presence of new lines in your regex. This answer goes into a little more detail about how your regex might need to be modified to detect a newline string.
But - if what you really want is just to get a list of links without the rest of the text, you're better off using a different regex string designed to capture just the URL. This post has several different regex strategies for matching just a single URL.

Find substring of url using regular expression

I would like suggestions on extracting a substring from a range of URLs. The code I'm writing should extract this piece of info (the actual id of the URL) from URLs in incoming events from our web tracker.
Take these URLs (the URLs that contain the substrings I'm looking for is in the format of the first three)
https://www.rbnett.no/sport/i/LA8gxP/_
https://www.itromso.no/sport/sprek/i/GGobq6/derfor-vraker-tromsoes-beste-loeper-sesongens-eneste-konkurranse-det-er-for-risikabelt-aa-delta
https://www.adressa.no/sport/fotball/i/9vyQGW/brann-treneren-ferdig-avsluttet-pressekonferansen-med-aa-sitere-max-manus
https://www.rbnett.no/dakapo/banner/
https://www.adressa.no/search/
where I want to extract the substrings "LA8gxP", "GGobq6" and "9vyQGW" from the three former URLs respectively, without hitting "dakapo", "banner" or "search" from the latter two.
I'm asking for suggestions on a regexp to extract that piece of info. As far as I know, the substrings only contain a-z, A-Z, and 0-9. The substrings seem to be only 6 chars long, but that will probably change over time.
The best solution (using Python) I have found so far is this:
match = re.search(r"/i/([a-zA-Z0-9]+)/", url)
substring = match.group(1)
It works, but I don't find it to be very elegant.
Also, it's relying on having the /i/-pattern as a prefix. Even though it looks like a consistent pattern, I'm not 100% sure if it is.
The only other alternative I can think of is:
\/i\/(.+)\/
Here is the demo: https://regex101.com/r/2iOyCE/1

Regex to Catch Url

I have an url: http://200.73.81.212/.CREDIT-UNION/update.php None of reg expressions I've found and develop myself works. I'm working on phishing mails dataset and there are lots of strange hyperlinks. This is one of mine:
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?.
Of course no success. I work in Python.
EDIT:
I need a regex to catch this kind of url's and, also, any ordinary hyperlinks, like:
https://cnn.com/
www.foxnews.com/story/122345678
Any thoughts?
What about something like this?
import re
phish = re.compile('''(?P<http>http\://)
(?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
(?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
(?P<ending>(update\.php))''', re.VERBOSE)
example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'
found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
# in case there are many matches, iterate over them
for mtch in phish.finditer(example_string):
# and append matches to master list
found_matches.append(mtch.group(0))
print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']
This is flexible enough so now in case you have alternate endings than update.php, you can simply include them in the named capture group by separating all alternate ending matches with |, i.e.
(update\.php|remove\.php, ...)
Furthermore, your ip address named capture group can take any number of 123.23.123.12, it doesnt have to be a fixed number of repeating numbers followed by period patterns. Now I believe IP addresses cap out at 3 numbers, so you could anchor those down to make sure you are matching the right types of numbers with curly brackets:
[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3
While #datawrestler answer works for original question, I had to extend it to catch wider group of url's (I've edited the question). This url, seems to work, for the task:
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})"""
Three alternatives: https?://www, https://domain, www.domain

Extracting a string from a txt file

So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.

Categories

Resources