Unable to match trailing slash after email address in url regex - python

I can't match an email address with a trailing slash in my url regex, and I can't figure out why. Here's the regex that matches the email address with no trailing slash:
r'^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63}$)$'
As expected, this matches /customer/someone#example.com, and not /customer/someone#example.com/
I would have thought that appending /? would work, given that the regex to match the domain suffix of the email address should not greedily match the slash. (This was the solution to many of the other duplicate regex trailing-slash questions).
r'^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63}$)/?$'
As expected, this matches /customer/someone#example.com, but unexpectedly it doesn't match /customer/someone#example.com/. Why?
APPEND_SLASH in settings.py is not set. I don't want to capture the slash as part of the customer_email url parameter.

The $ anchor means the end of string, and the first time you have it inside a consuming pattern, it requires the end of string there.
Thus, you need to remove the first $ in your pattern and use
^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63})/?$
See the regex demo.

Related

Python / regex - Extract string between nth and nth character

I have multiple url values, for ex:
https://www.happy.com/de/article/98238811/poppers
https://www.happy.com/sr
https://www.happy.com/en/forum/ocean-liveliness
I want to extract the values between the 3rd and 4th slash (if 4th slash exists) (ex: de, sr, en)
between the 4th and 5th slash (ex: article, forum)
I'm terrible at regex, I've tried this [\/]*[^\/]+[\/]([^\/]+) but it seems to get all values including www.happy. which I don't want.
I agree with the other answers/comments that Split function is easier, but if you insist on regex you have the \K operator in Python"s regex that discards the match portion to the left.
So, ^(?:.*?\/){3}\K.*?(?=\/|$) will search for three slashes from the beginning of each line, then discard it from the match, do a non-greedy match .*? to pick up the result you want, then do a lookahead to stop your match on a slash or end of line, whichever is encountered first. The lookahead will not be included in the match.
Make sure you include the RE.M flag if you are scanning multiple examples at once so ^ and $ match begin/end of line as well as begin/end of string.
In such case you might not even need regular expression. Just simply split the string by slash. And check the returned chunks.
For example.
>>> "https://www.happy.com/de/article/98238811/poppers".split('/')[3]
'de'
>>> "https://www.happy.com/de/article/98238811/poppers".split('/')[4]
'article'

Python regular expression domain names

I am trying to extract multiple domain names that end in .com either starting with https or http from a string.
The string is:
string="jssbhshhahttps://www.one.comsbshhshshttp://www.another.comhehsbwkwkwjhttp://www.again.co.uksbsbs"
I have created the pattern as follows:
pattern=re.compile("https?://")
I am not sure how to finish it off.
I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains in the output.
I have tried using (.*) in the middle to represent unlimited combinations of characters but now sure how to finish it off.
Any help would be much appreciated and it would be great if all parts of the expression could be explained.
You can use
https?://(?:(?!https?://)\S)*?\.com
See the regex demo. You may use a case insensitive modifier re.I or add (?i) inline flag to make the regex case insensitive.
Details
https?:// - http:// or https://
(?:(?!https?://)\S)*? - any non-whitespace char, zero or more but as few as possible occurrences, not starting a http:// or https:// char sequence (this regex construct is known under a "tempered greedy token" name)
\.com - a .com string.

Django URLs - trailing slash gets added to variable value

I have a django application hosted with Apache. I'm busy using the django restframework to create an API, but I am having issues with URLs. As an example, I have a URL like this:
url(r'path/to/endpoint/(?P<db_id>.+)/$', views.PathDetail.as_view())
If I try to access this url and don't include the trailing slash, it will not match. If I add a question mark on at the end like this:
url(r'path/to/endpoint/(?P<db_id>.+)/?', views.PathDetail.as_view())
This matches with and without a trailing slash. The only issue is that if a trailing slash is used, it now gets included in the db_id variable in my view. So when it searches the database, the id doesn't match. I don't want to have to go through all my views and remove trailing slashes from my url variables using string handling.
So my question is, what is the best way to make a url match both with and without a trailing slash without including that trailing slash in a parameter that gets sent to the view?
Your pattern for the parameter is .+, which means 1 or more of any character, including /. No wonder the slash is included in it, why wouldn't it?
If you want the pattern to include anything but /, use [^/]+ instead. If you want the pattern to include anything except slashes at the end, use .*[^/] for the pattern.
The .+ part of your regex will match one or more characters. This match is "greedy", meaning it will match as many characters as it can.
Check out: http://www.regular-expressions.info/repeat.html.
In the first case, the / has to be there for the full pattern to match.
In the second case, when the slash is missing, the pattern will match anyway because the slash is optional.
If the slash is present, the greedy db_id field will expand to the end (including the slash) and the slash will not match anything, but the overall pattern will still match because the slash is optional.
Some easy solutions would be to make the db_id non greedy by using the ? modifier: (?P<db_id>.+?)/? or make the field not match any slashes: (?P<db_id>[^/]+)/?

repeat only once character regex in python

I want to match repeat character(/) only once in the middle of the string. At the start it must be http://www.website/shop. Now I am using this code and it matches all html. How can I restrict it.
'^[http://www.website/shop/].*.html$'
With this regex, no result
^[http://]\W*(/){2}\W
Valid
http://www.website/shop/men.html
http://www.website/shop/women.html
Invalid
http://www.website/shop/men/footwear.html
http://www.website/shop/men/causal.html
http://www.website/shop/women/footwear.html
This regex should work
"http://www.website/shop/\w*.html"
... it won't match if there's another slash after the slashes in .../shop/

python regex pattern to extract value between two characters

I am trying to extract an id number from urls in the form of
http://www.domain.com/some-slug-here/person/237570
http://www.domain.com/person/237570
either one of these urls could also have params on them
http://www.domain.com/some-slug-here/person/237570?q=some+search+string
http://www.domain.com/person/237570?q=some+search+string
I have tried the following expressions to capture the id value of '237570' from the above urls but each one kinda works but does work across all four url scenarios.
(?<=person\/)(.*)(?=\?)
(?<=person\/)(.*)(?=\?|\z)
(?<=person\/)(.*)(?=\??*)
what I am seeing happening is it is getting the 237570 but including the ? and characters that come after it in the url. how can I say stop capturing either when you hit a ?, /, or the end of the string?
String:
http://www.domain.com/some-slug-here/person/1234?q=some+search+string
http://www.domain.com/person/3456?q=some+search+string
http://www.domain.com/some-slug-here/person/5678
http://www.domain.com/person/7890
Regexp:
person\/(\d{1,})
Output:
>>> regex.findall(string)
[u'1234', u'3456', u'5678', u'7890']
Don't use .* to match the ID. . will match any character (except for line breaks, unless you use the DOTALL option). Just match a bunch of digits: (.*) --> (\d+)

Categories

Resources