I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).
Related
I have an url: http://200.73.81.212/.CREDIT-UNION/update.php None of reg expressions I've found and develop myself works. I'm working on phishing mails dataset and there are lots of strange hyperlinks. This is one of mine:
https?:\/\/([a-zA-z0-9]+.)+)|(www.[a-zA-Z0-9]+.([a-zA-Z0-9]+\.[a-zA-Z0-9]+)+)(((/[\.A-Za-z0-9]+))+/?.
Of course no success. I work in Python.
EDIT:
I need a regex to catch this kind of url's and, also, any ordinary hyperlinks, like:
https://cnn.com/
www.foxnews.com/story/122345678
Any thoughts?
What about something like this?
import re
phish = re.compile('''(?P<http>http\://)
(?P<ipaddress>(([0-9]*(\.)?)[0-9]*)*)/\.
(?P<name>(\.)?([A-Za-z]*)(\-)?([A-Za-z]*))/
(?P<ending>(update\.php))''', re.VERBOSE)
example_string = 'http://200.73.81.212/.CREDIT-UNION/update.php'
found_matches = []
# check that matches actually exist in input string
if phish.search(example_string):
# in case there are many matches, iterate over them
for mtch in phish.finditer(example_string):
# and append matches to master list
found_matches.append(mtch.group(0))
print(found_matches)
# ['http://200.73.81.212/.CREDIT-UNION/update.php']
This is flexible enough so now in case you have alternate endings than update.php, you can simply include them in the named capture group by separating all alternate ending matches with |, i.e.
(update\.php|remove\.php, ...)
Furthermore, your ip address named capture group can take any number of 123.23.123.12, it doesnt have to be a fixed number of repeating numbers followed by period patterns. Now I believe IP addresses cap out at 3 numbers, so you could anchor those down to make sure you are matching the right types of numbers with curly brackets:
[0-9]{2, 3}\. # minimum of 2 numbers, maximum of 3
While #datawrestler answer works for original question, I had to extend it to catch wider group of url's (I've edited the question). This url, seems to work, for the task:
r"""(https?://www\.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(https?://[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})|\
(www.[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+(/[a-zA-Z0-9.#-]+){0,20})"""
Three alternatives: https?://www, https://domain, www.domain
I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.
So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.
I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851
You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']
As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.
You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'
You can try following regex
.*\/watch\/\d+
Working Demo
My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*