Conditional Regular Expressions - python

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.

Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex

Related

Python Regex that will work for any type of bracket

Is there a way to take a regular expression, such as
\(.*\)
and make it correctly identify pairs of any type of bracket, like
(\(|\{|\[).*(\)|\}|\])
without making incorrect matches, like \(.*\]?
I'm specifically working with Python, but it should work similarly in any language.
No. Regular languages can't handle nesting correctly. You'll need a proper parser for that.
((?:\([^()]*\))|(?:\{[^{}]*\})|(?:\[[^[\]]*\]))
Even more unwieldy, but unlike the solution containing .* this will only catch the innermost pair of brackets in case of nested brackets. Between a pair of brackets everything is allowed, even newlines, except the brackets themselves [^{}].
.* is greedy, it would catch two pairs of brackets as one group, like (ab)cd(ef), or even mix pairs chaotically and match (ab)cd) for example.
To catch a group containing the outer pair of brackets like (ab(cd)ef) I would consider not possible with regex.
There is no magic way to tell regex to match a specific char to some other char (if it was the exact same char/string you could do a backreference look, but this is not the case).
What you can do is write a "complex" (not so much) expression for that:
((?:\(.*\))|(?:\{.*\})|(?:\[.*\]))

regex named group starting with but not ends with

I have two regexes (simplified to be equal)
r'^(?P<slug>(^foo)[-\w]+)/$'
r'^(?P<slug>(^foo)[-\w]+)/$'
I would to add an exclusion on the first to check for the end so the latter wins.
For example:
foobar/ should pass the first and never the latter
I want foobar-my-string/ to fail the first but match the latter
I have tried #sdanzig's answer:
r'^(?P<slug>(^foo)[-\w]+(?!my-string$))/$'
r'^(?P<slug>(^foo)[-\w]+)/$'
But it doesn't work I always get into the latter with strings that do or do not end with "my-string"
I also tried it the other way around as my regexes are evaluated top to bottom, but it also doesn't work:
r'^(?P<slug>(^foo)[-\w]+(my-string$))/$'
r'^(?P<slug>(^foo)[-\w]+(?!my-string$))/$'
You should use this negative lookahead for the second regex because [-\w]+ is greedy so you end up consuming the entire string even before you trigger the check for negative lookahead.
p = r'(?P<slug>(?!.*my-string/$)(^foo)[-\w]+)'
Correction... for this particular requirement, you need a look BEHIND assertion, just before the $, to make sure the string doesn't end with my-string/:
(foo[-\w\/]+)(?<!my-string\/)$
I'm not really sure what you're trying to do with the P... it looks like you want to capture it, optionally? You could put (?:P)? just before the foo:
((?:P<slug>)?foo[-\w\/]+)(?<!my-string\/)$

Regular expression code is not working (Python)

Assume I have a word AB1234XZY or even 1AB1234XYZ.
I want to extract ONLY 'AB1234' or 1AB1234 (ie. everything up until the letters at the end).
I have used the following code to extract that but it's not working:
base= re.match(r"^(\D+)(\d+)", word).group(0)
When I print base, it's not working for the second case. Any ideas why?
Your regex doesn't work for the second case because it starts with a number; the \D at the beginning of your pattern matches anything that ISN'T a number.
You should be able to use something quite simple for this--simpler, in fact, than anything else I see here.
'.*\d'
That's it! This should match everything up to and including the last number in your string, and ignore everything after that.
Here's the pattern working online, so you can see for yourself.
(.+?\d+)\w+ would give you what you want.
Or even something like this
^(.+?)[a-zA-Z]+$
re.match starts at the beginning of the string, and re.search simply looks for it in the string. both return the first match. .group(0) is everything included in the match, if you had capturing groups, then .group(1) is the first group...etc etc... as opposed to normal convention where 0 is the first index, in this case, 0 is a special use case meaning everything.
in your case, depending on what you really need to capture, maybe using re.search is better. and instead of using 2 groups, you can use (\D+\d+) keep in mind, it will capture the first (non-digits,digits) group. it might be sufficient for you, but you might want to be more specific.
after reading your comment "everything before the letters at the end"
this regex is what you need:
regex = re.compile(r'(.+)[A-Za-z]')

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

beginning and ending sign in regular expression in python

'[A-Za-z0-9-_]*'
'^[A-Za-z0-9-_]*$'
I want to check if a string only contains the sign in the above expression, just want to make sure no more weird sign like #%&/() are in the strings.
I am wondering if there's any difference between these two regular expression? Did the beginning and ending sign matter? Will it affect the result somehow?
Python regular expressions are anchored at the beginning of strings (like in many other languages): hence the ^ sign at the beginning doesn’t make any difference. However, the $ sign does very much make one: if you don’t include it, you’re only going to match the beginning of your string, and the end could contain anything – including the characters you want to exclude. Just try re.match("[a-z0-9]", "abcdef/%&").
In addition to that, you may want to use a regular expression that simply excludes the characters you’re testing for, it’s much safe (hence [^#%&/()] – or maybe you have to do something to escape the parentheses; can’t remember how it works at the moment).
The beginning and end sign match the beginning and end of a String.
The first will match any String that contains zero or more ocurrences of the class [A-Za-z0-9-_] (basically any string whatsoever...).
The second will match an empty String, but not one that contains characters not defined in [A-Za-z0-9-_]
Yes it will. A regex can match anywhere in its input. # will match in your first regex.

Categories

Resources