Find substring of url using regular expression - python

I would like suggestions on extracting a substring from a range of URLs. The code I'm writing should extract this piece of info (the actual id of the URL) from URLs in incoming events from our web tracker.
Take these URLs (the URLs that contain the substrings I'm looking for is in the format of the first three)
https://www.rbnett.no/sport/i/LA8gxP/_
https://www.itromso.no/sport/sprek/i/GGobq6/derfor-vraker-tromsoes-beste-loeper-sesongens-eneste-konkurranse-det-er-for-risikabelt-aa-delta
https://www.adressa.no/sport/fotball/i/9vyQGW/brann-treneren-ferdig-avsluttet-pressekonferansen-med-aa-sitere-max-manus
https://www.rbnett.no/dakapo/banner/
https://www.adressa.no/search/
where I want to extract the substrings "LA8gxP", "GGobq6" and "9vyQGW" from the three former URLs respectively, without hitting "dakapo", "banner" or "search" from the latter two.
I'm asking for suggestions on a regexp to extract that piece of info. As far as I know, the substrings only contain a-z, A-Z, and 0-9. The substrings seem to be only 6 chars long, but that will probably change over time.
The best solution (using Python) I have found so far is this:
match = re.search(r"/i/([a-zA-Z0-9]+)/", url)
substring = match.group(1)
It works, but I don't find it to be very elegant.
Also, it's relying on having the /i/-pattern as a prefix. Even though it looks like a consistent pattern, I'm not 100% sure if it is.

The only other alternative I can think of is:
\/i\/(.+)\/
Here is the demo: https://regex101.com/r/2iOyCE/1

Related

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

Python re.findall hangs on certain websites

I´ve got a python script to loop through a list of websites/domains to scrape phones and e-mails from my clients websites, 99% of websites scrapes are OK and works. Some websites just hangs and cant even force break operation, like it is on an insane loop. Below an example. Anyone could help me improve or fix this?
import requests,re
try:
r = requests.Session()
f = r.get('http://www.poffoconsultoria.com.br', verify=False, allow_redirects=False,timeout=(5,5) )
s = f.text
tels = set(re.findall(r"\s?\(?0?[1-9][1-9]\)?[-\.\s][2-5]\d{3}\.?-?\s?\d{4}",s))
emails = set(re.findall(r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}",s))
print(tels)
print(emails)
except Exception as e:
print(e)
You should remove the \s? from the first regex (you do not really need a whitespace at the start of the match), or replace with (?<!\S) if you want to only match after a whitespace or start of string.
The real problem is with the second regex where . resides in a character class that is quantified with +. The \. that follows it also matches a . and that makes it a problem when no matching text appears in the string. This is catastrophic backtracking.
Since the matches you expect are whole words, I suggest enhancing the pattern by 1) adding word boundaries, 2) making all adjoining subpatterns match different types of chars.
Use
r'\b[A-Za-z0-9._%+-]+#(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b'
to match emails.
See the (?:[A-Za-z0-9-]+\.)+ part: it matches one or more repetitions of 1 or more alphanumeric/hyphen chars followed with a dot, and there is no \. after this pattern, there is an alpha character class, so there should be no problem like the one present before.
So. I got the website data fine in Python27 using >>> string = requests.get('http://www.poffoconsultoria.com.br').text
I then took the length of the string and it was >>> len(strings)
474038 That's a really high value.
So for problems like these when one sees regex take such a long time (really, after getting the length of the page), you should visit the page in your browser and inspect the page source
When I inspected the page in my browser I found these:
The second regex [A-Za-z0-9._%+-]+ will definitely hang (really, take a long time) because it isn't quantifiable and has to search through those ginormous portions.
You either need to chunk the page or limit your regex. Or you could write a function that discards dictionary data if you suspect that what you need to return won't appear inside of them; basically though, those huge dictionaries above are causing the regex you posted to take a long time.
Use valid email
(?i)(?:("[^"\\]*(?:\\.[^"\\]*)*"#)|((?:[0-9a-z](?:\.(?!\.)|[-!#$%&'*+/=?^`{}|~\w])*)?[0-9a-z]#))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][-a-z0-9]{0,22}[a-z0-9]))

Regex to Catch Url

I have an url: http://200.73.81.212/.CREDIT-UNION/update.php None of reg expressions I've found and develop myself works. I'm working on phishing mails dataset and there are lots of strange hyperlinks. This is one of mine:
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?.
Of course no success. I work in Python.
EDIT:
I need a regex to catch this kind of url's and, also, any ordinary hyperlinks, like:
https://cnn.com/
www.foxnews.com/story/122345678
Any thoughts?
What about something like this?
import re
phish = re.compile('''(?P<http>http\://)
(?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
(?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
(?P<ending>(update\.php))''', re.VERBOSE)
example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'
found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
# in case there are many matches, iterate over them
for mtch in phish.finditer(example_string):
# and append matches to master list
found_matches.append(mtch.group(0))
print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']
This is flexible enough so now in case you have alternate endings than update.php, you can simply include them in the named capture group by separating all alternate ending matches with |, i.e.
(update\.php|remove\.php, ...)
Furthermore, your ip address named capture group can take any number of 123.23.123.12, it doesnt have to be a fixed number of repeating numbers followed by period patterns. Now I believe IP addresses cap out at 3 numbers, so you could anchor those down to make sure you are matching the right types of numbers with curly brackets:
[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3
While #datawrestler answer works for original question, I had to extend it to catch wider group of url's (I've edited the question). This url, seems to work, for the task:
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})"""
Three alternatives: https?://www, https://domain, www.domain

how to find occurrence of special character using regex

I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

Categories

Resources