Python regular expression domain names - python

I am trying to extract multiple domain names that end in .com either starting with https or http from a string.
The string is:
string="jssbhshhahttps://www.one.comsbshhshshttp://www.another.comhehsbwkwkwjhttp://www.again.co.uksbsbs"
I have created the pattern as follows:
pattern=re.compile("https?://")
I am not sure how to finish it off.
I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains in the output.
I have tried using (.*) in the middle to represent unlimited combinations of characters but now sure how to finish it off.
Any help would be much appreciated and it would be great if all parts of the expression could be explained.

You can use
https?://(?:(?!https?://)\S)*?\.com
See the regex demo. You may use a case insensitive modifier re.I or add (?i) inline flag to make the regex case insensitive.
Details
https?:// - http:// or https://
(?:(?!https?://)\S)*? - any non-whitespace char, zero or more but as few as possible occurrences, not starting a http:// or https:// char sequence (this regex construct is known under a "tempered greedy token" name)
\.com - a .com string.

Related

Colliding regex for emails (Python)

I'm trying to grab both usernames (such as abc123#) and emails (such as (abc123#company.com) in the same Pythonic regex.
Here's an example statement:
abc123# is a researcher at abc123#company.com doing cool work.
Regex used:
For username:
re.match("^([A-Za-z])+([#]){1}$")
For email:
re.match("^([A-Za-z0-9-_])+(#company.com){1}$")
Most cases, what happens is username gets grabbed but not email address (trying to grab them as two separate entities) - any ideas what's going on?
Actually you have a lot of groups and repetition counts and start/end boundaries in your regexes that are not really necessary. These 2 are just enough to find each in the input string.
For user: [A-Za-z0-9]+#
For email: [A-Za-z0-9-_]+#company.com
If, however, you want your groupings, these versions that will work:
For user: ([A-Za-z0-9])+(#)
For email: ([A-Za-z0-9-_]+(#company.com)
Disclaimer: I have tested this only on Java, as I am not so familiar with Python.
In your patterns you use anchors ^ and $ to assert the start and end of the string.
Removing the anchors, will leave this for the username pattern ([A-Za-z])+([#]){1}
Here, you can omit the {1} and the capture groups. Note that in the example, abc123# has digits that you are not matching.
Still, using [A-Za-z0-9]+# will get a partial match in the email abc123#company.com To prevent that, you can use a right hand whitespace boundary.
The username pattern might look like
\b[A-Za-z0-9]+#(?!\S)
\b A word boundary
[A-Za-z0-9]+ Match 1+ occurrences of the listed (including the digits)
# Match literally
(?!\S) Negative lookahead, assert not a non whitspace char to the right
Regex demo
For the email address, using a character class like [A-Za-z0-9-_] is quite strict.
If you want a broad match, you might use:
[^\s#]+#[^\s#]+\.[a-z]{2,}
Regex demo

Unable to match trailing slash after email address in url regex

I can't match an email address with a trailing slash in my url regex, and I can't figure out why. Here's the regex that matches the email address with no trailing slash:
r'^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63}$)$'
As expected, this matches /customer/someone#example.com, and not /customer/someone#example.com/
I would have thought that appending /? would work, given that the regex to match the domain suffix of the email address should not greedily match the slash. (This was the solution to many of the other duplicate regex trailing-slash questions).
r'^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63}$)/?$'
As expected, this matches /customer/someone#example.com, but unexpectedly it doesn't match /customer/someone#example.com/. Why?
APPEND_SLASH in settings.py is not set. I don't want to capture the slash as part of the customer_email url parameter.
The $ anchor means the end of string, and the first time you have it inside a consuming pattern, it requires the end of string there.
Thus, you need to remove the first $ in your pattern and use
^customer/(?P<customer_email>[\w.%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,63})/?$
See the regex demo.

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

python regex pattern to extract value between two characters

I am trying to extract an id number from urls in the form of
http://www.domain.com/some-slug-here/person/237570
http://www.domain.com/person/237570
either one of these urls could also have params on them
http://www.domain.com/some-slug-here/person/237570?q=some+search+string
http://www.domain.com/person/237570?q=some+search+string
I have tried the following expressions to capture the id value of '237570' from the above urls but each one kinda works but does work across all four url scenarios.
(?<=person\/)(.*)(?=\?)
(?<=person\/)(.*)(?=\?|\z)
(?<=person\/)(.*)(?=\??*)
what I am seeing happening is it is getting the 237570 but including the ? and characters that come after it in the url. how can I say stop capturing either when you hit a ?, /, or the end of the string?
String:
http://www.domain.com/some-slug-here/person/1234?q=some+search+string
http://www.domain.com/person/3456?q=some+search+string
http://www.domain.com/some-slug-here/person/5678
http://www.domain.com/person/7890
Regexp:
person\/(\d{1,})
Output:
>>> regex.findall(string)
[u'1234', u'3456', u'5678', u'7890']
Don't use .* to match the ID. . will match any character (except for line breaks, unless you use the DOTALL option). Just match a bunch of digits: (.*) --> (\d+)

Stop capturing any string based on / for urls in Django (regex)

So I'm using Django to build a webservice, but I've got a problem with the regexp in Urls.
I want to go to the localhost/user/USERNAME/products/ where the username can be any character, not just \w. I tried capturing it using this:
('^user/(?P<username>.+)/$', 'user'),
('^user/(?P<username>.+)/products/$', 'get_products'),
but ofc it captures the username 'USERNAME/products' and sends it to user. How should I say every possible character EXCEPT front slash for any length except zero?
Use [^/]+ instead of .+.
As Chris Pratt notes below, .+? will also work, as it is a "non-capturing" regex, which will match any character until a match for the following sub-expression. I generally prefer normal eager solutions to non-capturing solutions where possible, because they do not introduce multiple evaluation rules.

Categories

Resources