Tuning of a Web handler regex routes configuration - python

In a web handler routes configuration I have the following Regex:
('/post/(\w+)/.*', foo.app.WebHandlerFooClass)
this regex matches these kind of urls:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
passing the correct HUIHUIGgS823SHUIH Id parameter to the web handler matched by the (\w+) group.
How could I modify the above Regex to match also this url?
/post/HUIHUIGgS823SHUIH
The handler is coded to accept just one parameter, the base64 Id, so there should be just one group in the Regex that matches the Id.
So, these are the urls that should be matched:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH <-- Hey, I wanna this too

'/post/(\w+_-)(?:/([\w-]+))?/?'
This matches the following.
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/this-is-the-slug/
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH
I think this is a better implementation because it captures only the pieces you want, e.g. the slug doesn't capture a trailing /. However, your spec is still slightly unclear to me, so this may not be your intention.

If you don't care about the data at the end, then why not just use this?
'/post/(\w+).*'
Otherwise you'll have to provide more info.

I think you just want:
'/post/([^/]+).*'
But that seems too simple an answer :)

If I guessed right your real intention, then you are fine with this one:
'/post/(\w+)'

Related

regex to add conditional values in python

i am working with regex with python and trying to write regex so that if the url has https then we need to have www3 in url and if http is there then www. my solution is working for https but for http it does not show http. Can anybody help to correct this
st='''
https://www3.yahoo.com
http://www.yahoo.com
'''
p=re.compile(r'(https)?://(?(1)www3|www)\.\w+\.\w+')
It would seem the simpest solution is just to write out both alternatives:
st = '''
https://www3.yahoo.com
https://www.yahoo.com
http://www3.yahoo.com
http://www.yahoo.com
'''
p = re.compile(r'http(?:s://www3|://www)\.\w+\.\w+')
p.findall(st)
Output:
['https://www3.yahoo.com', 'http://www.yahoo.com']
A normal solution, sample but work
re.findall(r'(http(?P<s>s)?://www(?(s)3|)\..*)', """
https://www3.yahoo.com
http://www.yahoo.com
http://www3.yahoo.com
https://www34.yahoo.com
""")
[('https://www3.yahoo.com', 's'), ('http://www.yahoo.com', '')]
Explain
(?P<s>s): (?P<name>) will give a name for the group.
(?(s)): (?(<id|name>)) will reference the group that match before.
(?(s)3|\.): (?(<id|name>)yes-pattern|no-pattern) will choice the yes pattern if a group matched.
Advice
group-id (1) does not always work, cause you need careful with the group order, and calculate the index of it by yourself, it usually caused an error
group-named (name) is a good idea to avoid such the problem.
Reference
docs.python/re
For the conditional to work, you have to make only the s char optional
http(s)?://(?(1)www3|www)\.\w+\.\w+
Regex demo
Note that using \.\w+\.\w+ is limited to match an url. This could be a broader match, using \S to match a non whitspace character.
Regex demo
http(s)?://(?(1)www3|www)\.\S+

Python regex to exclude several words

I try to search for URLS and want to exclude some. In the variable download_artist I stored the base URL and wanto to find additional links, but not upload, favorites, followers or listens.
So I tried different versions with the mentioned words and a |. Like:
urls = re.findall(rf'^{download_artist}uploads/|{download_artist}^favorites/|^{download_artist}followers/|^{download_artist}listens/|{download_artist}\S+"', response.text, re.IGNORECASE)
or:
urls = re.findall(rf'{download_artist}^uploads/|^favorites/|^followers/|^listens/|\S+"', response.text, re.IGNORECASE)
But it ignores my ^ for excluding the words. Where is my mistake?
You need use "lookaround" in this case, can see more details in https://www.regular-expressions.info/lookaround.html.
So, i think wich this regex solve your problem:
{download_artist}(?!uploads/|favorites/|followers/|listens/)\S+\"
You can test if regex working in https://regex101.com/. This site is very useful when you work with regex.
^ only works as a negation in character classes inside [], outside it represents the beginning of the input.
I suggest you do two matches: One to match all urls and another one to match the ones to exclude. Then remove the second set of urls from the first one.
That will keep the regexes simple and readable.
If you have to do it in one regex for whatever reason you can try to solve it with (negative) lookaround pattern (see https://www.rexegg.com/regex-lookarounds.html).

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

Regex to capture url until a certain character

With a url such as
https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&
I am using
pat = re.compile('<a href="(https?://.*?)".*',re.DOTALL)
as a search pattern.
I want to pick any url like the yahoo url above, but I want to capture the url up to the literal ? in the actual url.
In other words I want to extract the url up to ?, knowing that all the urls I'm parsing don't have the ? character. In such a case I need to capture all of the url.
The above regex works and extracts the url but goes to the end of the url. How can I get it to stop at the first ? it encounters, and keep going to the end if it doesn't encounter a ?
Regex is really the wrong tool for the job. Doing a basic string split will get you exactly what you want.
def beforeQuestionMrk(inputStr):
return inputStr.split("?")[0]
url = "https://search.yahoo.com/sometext"
url2 = "https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"
print(beforeQuestionMrk(url))
print(beforeQuestionMrk(url2))
#https://search.yahoo.com/sometext
#https://search.yahoo.com/search
If you really wanted wanted to use regex I suppose you could fo the following:
import re
def getBeforeQuestRegex(inputStr):
return re.search(r"(.+?\?|.+)", inputStr).group(0)
print(getBeforeQuestRegex("https://search.yahoo.com/search?p=Fetty+Wap&fr=fp-tts&"))
print(getBeforeQuestRegex("https://search.yahoo.com/sometext"))
#https://search.yahoo.com/search?
#https://search.yahoo.com/sometext
Bobble bubbles solution above worked very well for me;
"You can try like this by use of negated class: ]*?href="(http[^"?]+)"<- bobbles answer.
url looks like this
https://search.yahoo.com/search?p=Justin+Bieber&fr=fp-tts&fr2=p:fp,m:tn,ct:all......
or it could be something like this
https://www.yahoo.com/style/5-joyful-bob-ross-tees-202237009.html
objective was to extract full url if there was no literal ? in it, but if it did to stop just before the literal ?.
was Bobble Bubbles answer and works very cleanly, does what I wanted done, Again thank you for everyone in participating in this discussion, really appreciate it.
I agree with other answer, that using regexp here is not a solution, especially because there my be any number of parameters before opening of the <a> tag and href parameter, there can be a new line in between too.
but, answering to the initial question:
'*', '+', and '?' qualifiers are all greedy - they match as much text as possible
that's why there are non-greedy versions of them:
'*?', '+?' and '??'

How to parse a list in Django urlparser?

On Stack Overflow, you can view a list of questions with multiple tags at a URL such as http://stackoverflow.com/questions/tagged/django+python.
I'd like to do something similar in a project I am working on, where one of the url parameters would be a list of tags, but I'm not sure how to write a regex urlparser that can parse it out. I'm fond of SO's way of using the + sign, but it's not a dealbreaker. I also imagine that the urlparser may have to take the whole string (foo+bar+baz) as a single variable to give to the view, which is also fine as I can just split it in the view itself- that is, I'm not expecting the URL parser to give the view an already split list, but if it can, even better!
Right now all I have is:
url(r'^documents/tag/(?P<tag>\w+)/$', ListDocuments.as_view(), name="list_documents"),
Which just pulls out one single tag since \w+ just gets me those [A-Za-z0-9_], but not +. I tried something like:
url(r'^documents/tag/(?P<tag>[\w+\+*])/$', ListDocuments.as_view(), name="list_documents"),
But this no longer matched documents/tag/foo nor documents/tag/foo+bar.
Please assist, I'm not so great with regex, thanks!
It's not possible to do this automatically. From the documentation: "Each captured argument is sent to the view as a plain Python string, regardless of what sort of match the regular expression makes." Splitting it in the view is the way to go.
The second regex in your answer is OK, but it does allow some things you might not want (e.g. 'django+++python+'). A stricter version might be something like: (?P<tag>\w+(?:\+\w+)*). Then you can just do a simple tag.split('+') in the view without worrying about any edge cases.
This works for now:
url(r'^documents/tag/(?P<tag>[A-Za-z0-9_\+]+)/$', ListDocuments.as_view(), name="list_documents"),
But I'd like to be able to get that w back in there instead of the full list of characters like that.
[Edit]
Here we go:
url(r'^documents/tag/(?P<tag>[\w\+]+)/$', ListDocuments.as_view(), name="list_documents"),
I will still select a better answer if there is a way for the Django urlparser to give the view an actual list instead of just one big long string, but if that's not possible, this solution does work.

Categories

Resources