Stop capturing any string based on / for urls in Django (regex) - python

So I'm using Django to build a webservice, but I've got a problem with the regexp in Urls.
I want to go to the localhost/user/USERNAME/products/ where the username can be any character, not just \w. I tried capturing it using this:
('^user/(?P<username>.+)/$', 'user'),
('^user/(?P<username>.+)/products/$', 'get_products'),
but ofc it captures the username 'USERNAME/products' and sends it to user. How should I say every possible character EXCEPT front slash for any length except zero?

Use [^/]+ instead of .+.
As Chris Pratt notes below, .+? will also work, as it is a "non-capturing" regex, which will match any character until a match for the following sub-expression. I generally prefer normal eager solutions to non-capturing solutions where possible, because they do not introduce multiple evaluation rules.

Related

Colliding regex for emails (Python)

I'm trying to grab both usernames (such as abc123#) and emails (such as (abc123#company.com) in the same Pythonic regex.
Here's an example statement:
abc123# is a researcher at abc123#company.com doing cool work.
Regex used:
For username:
re.match("^([A-Za-z])+([#]){1}$")
For email:
re.match("^([A-Za-z0-9-_])+(#company.com){1}$")
Most cases, what happens is username gets grabbed but not email address (trying to grab them as two separate entities) - any ideas what's going on?
Actually you have a lot of groups and repetition counts and start/end boundaries in your regexes that are not really necessary. These 2 are just enough to find each in the input string.
For user: [A-Za-z0-9]+#
For email: [A-Za-z0-9-_]+#company.com
If, however, you want your groupings, these versions that will work:
For user: ([A-Za-z0-9])+(#)
For email: ([A-Za-z0-9-_]+(#company.com)
Disclaimer: I have tested this only on Java, as I am not so familiar with Python.
In your patterns you use anchors ^ and $ to assert the start and end of the string.
Removing the anchors, will leave this for the username pattern ([A-Za-z])+([#]){1}
Here, you can omit the {1} and the capture groups. Note that in the example, abc123# has digits that you are not matching.
Still, using [A-Za-z0-9]+# will get a partial match in the email abc123#company.com To prevent that, you can use a right hand whitespace boundary.
The username pattern might look like
\b[A-Za-z0-9]+#(?!\S)
\b A word boundary
[A-Za-z0-9]+ Match 1+ occurrences of the listed (including the digits)
# Match literally
(?!\S) Negative lookahead, assert not a non whitspace char to the right
Regex demo
For the email address, using a character class like [A-Za-z0-9-_] is quite strict.
If you want a broad match, you might use:
[^\s#]+#[^\s#]+\.[a-z]{2,}
Regex demo

Python regular expression domain names

I am trying to extract multiple domain names that end in .com either starting with https or http from a string.
The string is:
string="jssbhshhahttps://www.one.comsbshhshshttp://www.another.comhehsbwkwkwjhttp://www.again.co.uksbsbs"
I have created the pattern as follows:
pattern=re.compile("https?://")
I am not sure how to finish it off.
I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains in the output.
I have tried using (.*) in the middle to represent unlimited combinations of characters but now sure how to finish it off.
Any help would be much appreciated and it would be great if all parts of the expression could be explained.
You can use
https?://(?:(?!https?://)\S)*?\.com
See the regex demo. You may use a case insensitive modifier re.I or add (?i) inline flag to make the regex case insensitive.
Details
https?:// - http:// or https://
(?:(?!https?://)\S)*? - any non-whitespace char, zero or more but as few as possible occurrences, not starting a http:// or https:// char sequence (this regex construct is known under a "tempered greedy token" name)
\.com - a .com string.

How can I avoid selecting email Ids with a particular domain name with regex

I have a list of e-mail ids among which I have to select only those which do not have ruba.com as domain name with regex. For examples, if I have ads#gmail.com, dgh#rubd.com and ert#ruba.com, then my regular expression should select first two Ids. What should be the regular expression for this problem?
I have tried with two expressions:
[a-zA-Z0-9_.+-]+#[^(ruba)]+.[a-zA-Z0-9-.]+
and
[a-zA-Z0-9_.+-]+#[^r][^u][^b][^a]+.[a-zA-Z0-9-.]+
None of the above two was able to fulfill my requirement.
I assume that by email ID you mean the part before the # symbol, otherwise that would be a full email address.
.+(?=#)(?!#ruba\.com)
. the dot character is a special symbol for regex engines
and it is used to capture everything
* also known as Kleene plus says you want to capture one or more instances of the preceding symbol, in our case .; basically you are saying "give me every char"
(?=#) is a positive lookahead, i.e. a special search feature that makes sure that what follows is #; I'm using it to take the cursor to the position of # and "stop" capturing, otherwise + would go on indefinitely
(?!#ruba\.com) is a negative lookahead, i.e. a special search feature that makes sure that what follows is not (!) #ruba\.com; I'm escaping the dot not to confuse it with the capture-all symbol I was talking before
Live demo here.
You could use a negative lookahead to ensure that you do not match the domain ruba.com.
The negative lookahead: (?!rubd) will match against anything that you want to exclude. Also, because emails typically have more than word characters (such as hyphens and periods), you would be better off using [\w\.\-] rather than just \w.
^[\w\.\-]+#(?!rubd)[\w\.\-]+\.(?:com|net|org|edu)$
DEMO

How to have 2 or more regular expressions with exclusive OR in Python?

I have a regex to find an email mailto:(.*)\"|(\S+#+\S*)|(.{1,40}#.{1,40}) on several HTML sources.
Having the string Email: <u><a href="mailto:test#test.com">email me the piece mailto:(.*)\" works great,
I would want it to stop there and return the value instead of continuing with the other expressions. Is there something like an XOR operator or a way to do this since I would have to add more pieces eventually?
I tried here: http://pythex.org/
Regex should do this naturally. To illustrate this point, one of the easy ways of matching a word with one exception is to precede your match with the exception and use an alternation.
For instance in my email program I need to sort all emails with the subject line /labels?/ to another folder. However some of my contacts never learned how to spell (apparently) and I also sort /lables?/ However I noticed that emails containing the subject line Available were also being picked up by this filter.
I could have done /\blables?/ but preferred instead to catch available and handle it separately, so I did:
`/available|(lables?)|(labels?)/`
This alternation will match available, but only match and capture lable, lables, label, or labels.
As Adam Smith said, the problem you're having is that your last alternation .{1,40}#.{1,40} begins matching well before the rest of your alternations, so it consumes the text and that's the match that returns.
To overcome this, you could modify the other partial patterns so that they match just as early, by prepending .*? e. g. .*?mailto:(.*)\"|.*?(\S+#+\S*)|(.{1,40}#.{1,40}).
Or, perhaps somewhat less convoluted, you could just search for one after the other:
string = 'Email: <u><a href="mailto:test#test.com">email me'
m = re.search('mailto:(.*)\"', string) \
or re.search('(\S+#+\S*)', string) \
or re.search('(.{1,40}#.{1,40})', string)
print(m.group(1))

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Categories

Resources