Using regex to deal with escape characters in URLs - python

I'm in the process of tokenizing strings which contain URLs. Here is the part I use to pick up the URLs:
regex_str = [r'http[s]?://(?:[a-z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+']
It picks up "regular" URLs perfectly fine; however some of the URLs look like this:
https:\/\/t.co\/c1taPXzi4X
How can I modify the regex so that it deals with the escape characters, in order to end up with a complete and clean URL?
Many thanks in advance! :)

As pointed out in this other question, you can't add a "\" in a url.
You regex seems ok to me, i've tested against regxr. The only thing I've done is scape the backslashes after http.

Calling re.sub before you apply the regex would work
re.sub(r"\\","",r"https:\/\/abc.com\/defg")

Related

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

Regex to extract file paths except urls

I have a large text containing some file paths, and I need a regex which can help me extract all the paths. Currently I'm using this one:
\/.+?\/[\w]+\.\w+
It works almost perfectly, but links containing filename or a dot at the end are interpreted as paths too, like this one:
http://example.com/index.html
Help in providing a valid regular expression is highly appreciated. Also if you can add support of spaces in paths in this regex, it would be awesome. Thanks in advance!
Link to regex test: click
You could try a negative look-behind to exclude the "http:" and "https:" prefix.
(?<!https:)(?<!http:)(?<!/)(?<!\w)((/[^\s]+)?/\w+\.\w+)
If you try it with this test strings in pythex:
/abc/def/def.ps
/abc/def/ttt/def.ps
/test.txt
/abc/test.txt http://example.com/index.html
http://www.google.com/bla/test/index.html https://www.google.com/bla/test/index.html
It will only match the first 4.
Here is the pythex link.
The advantage of this regular expression is that is does not rely on the beginning of the line to work.
You can add as many look behinds as you wish to support other protocols, like ftp, etc.
Try this : ^\/.+?\/[\w]+\.\w+$ with multi-line mode enabled.

Regex to capture url until a certain character

With a url such as
https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&
I am using
pat = re.compile('<a href="(https?://.*?)".*',re.DOTALL)
as a search pattern.
I want to pick any url like the yahoo url above, but I want to capture the url up to the literal ? in the actual url.
In other words I want to extract the url up to ?, knowing that all the urls I'm parsing don't have the ? character. In such a case I need to capture all of the url.
The above regex works and extracts the url but goes to the end of the url. How can I get it to stop at the first ? it encounters, and keep going to the end if it doesn't encounter a ?
Regex is really the wrong tool for the job. Doing a basic string split will get you exactly what you want.
def beforeQuestionMrk(inputStr):
return inputStr.split("?")[0]
url = "https://search.yahoo.com/sometext"
url2 = "https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"
print(beforeQuestionMrk(url))
print(beforeQuestionMrk(url2))
#https://search.yahoo.com/sometext
#https://search.yahoo.com/search
If you really wanted wanted to use regex I suppose you could fo the following:
import re
def getBeforeQuestRegex(inputStr):
return re.search(r"(.+?\?|.+)", inputStr).group(0)
print(getBeforeQuestRegex("https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"))
print(getBeforeQuestRegex("https://search.yahoo.com/sometext"))
#https://search.yahoo.com/search?
#https://search.yahoo.com/sometext
Bobble bubbles solution above worked very well for me;
"You can try like this by use of negated class: ]*?href="(http[^"?]+)"<- bobbles answer.
url looks like this
https://search.yahoo.com/search?p=Justin+Bieber&fr=fp-tts&fr2=p:fp,m:tn,ct:all......
or it could be something like this
https://www.yahoo.com/style/5-joyful-bob-ross-tees-202237009.html
objective was to extract full url if there was no literal ? in it, but if it did to stop just before the literal ?.
was Bobble Bubbles answer and works very cleanly, does what I wanted done, Again thank you for everyone in participating in this discussion, really appreciate it.
I agree with other answer, that using regexp here is not a solution, especially because there my be any number of parameters before opening of the <a> tag and href parameter, there can be a new line in between too.
but, answering to the initial question:
'*', '+', and '?' qualifiers are all greedy - they match as much text as possible
that's why there are non-greedy versions of them:
'*?', '+?' and '??'

Tuning of a Web handler regex routes configuration

In a web handler routes configuration I have the following Regex:
('/post/(\w+)/.*', foo.app.WebHandlerFooClass)
this regex matches these kind of urls:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
passing the correct HUIHUIGgS823SHUIH Id parameter to the web handler matched by the (\w+) group.
How could I modify the above Regex to match also this url?
/post/HUIHUIGgS823SHUIH
The handler is coded to accept just one parameter, the base64 Id, so there should be just one group in the Regex that matches the Id.
So, these are the urls that should be matched:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH <-- Hey, I wanna this too
'/post/(\w+_-)(?:/([\w-]+))?/?'
This matches the following.
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/this-is-the-slug/
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH
I think this is a better implementation because it captures only the pieces you want, e.g. the slug doesn't capture a trailing /. However, your spec is still slightly unclear to me, so this may not be your intention.
If you don't care about the data at the end, then why not just use this?
'/post/(\w+).*'
Otherwise you'll have to provide more info.
I think you just want:
'/post/([^/]+).*'
But that seems too simple an answer :)
If I guessed right your real intention, then you are fine with this one:
'/post/(\w+)'

What is wrong with this Regular Expression?

I am trying to create a test to verify that a link is rendered on a webpage.
I'm not understanding what I'm doing wrong on this assertion test:
self.assertRegexpMatches( response.content, r'elite')
I know that the markup is on the page because I copied it from response.content
I tried to use the regular expression in the Python shell:
In [27]: links = """<div class="tabsA">activenewesthottestmost votedelite</div>"""
In [28]: re.search(r'elite', links)
For some reason it's not working their either.
How do I create the regular expression so it works?
Why are you using a regex here? There's absolutely no reason to. You're just matching a simple string. Use:
self.assertContains(response, 'elite')
The ? in your regex is getting interpreted as a ? quantifier (end of this part):
<a href="/questions/?...
Thus the engine never matches the literal ? that appears in the string, and instead matches an optional / at that position. Escape it with a backslash like so:
<a href="/questions/\?...
You should escape "?", because that symbol has a special meaning on regex.
>>> re.search(r'elite', links)
The ? character is a special RegEx Character and must be escaped.
The follow regexp would work
elite
Note the \ before the ?
A great tool for messing around with RegEx can be found here:
http://regexpal.com/
It can save you an awful lot of time and headaches...
It's probably the "<" and ">" characters. In some regular expression syntaxes they are special characters that indicate beginning and end of line.
You might look at a regular expression tester tool to help you learn them.

Categories

Resources