Perhaps a silly question, but though google returned lots of similar cases, I could not find this exact situation: what regular expression will match all string NOT containing a particular string. For example, I want to match any string that does not contain 'foo_'.
Now,
re.match('(?<!foo_).*', 'foo_bar')
returns a match. While
re.match('(?<!foo_)bar', 'foo_bar')
does not.
I tried the non-greedy version:
re.match('(?<!foo_).*?', 'foo_bar')
still returns a match.
If I add more characters after the ),
re.search('(?<!foo_)b.*', 'foo_bar')
it returns None, but if the target string has more trailing chars:
re.search('(?<!foo_)b.*', 'foo_barbaric')
it returns a match.
I intentionally kept out the initial .* or .*? in the re. But same thing happens with that.
Any ideas why this strange behaviour? (I need this as a single regular expression - to be entered as a user input).
You're using lookbehind assertions where you need lookahead assertions:
re.match(r"(?!.*foo_).*", "foo_bar")
would work (i. e. not match).
(?!.*foo_) means "Assert that it is impossible to match .*foo_ from the current position in the string. Since you're using re.match(), that position is automatically defined as the start of the string.
Try this pattern instead:
^(?!.*foo_).*
This uses the ^ metacharacter to match from the beginning of the string, and then uses a negative look-ahead that checks for "foo_". If it exists, the match will fail.
Since you gave examples using both re.match() and re.search(), the above pattern would work with both approaches. However, when you're using re.match() you can safely omit the usage of the ^ metacharacter since it will match at the beginning of the string, unlike re.search() which matches anywhere in the string.
I feel like there is a good chance that you could just design around this with a conditional statement.
(It would be nice if we knew specifically what you're trying to accomplish).
Why not:
if not re.match("foo", something):
do_something
else:
print "SKipping this"
Related
What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.
This is the results from python2.7.
>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'
The results I thought should be as follows.
>>> re.sub('.*?', '-', 'abc')
'-------'
But it's not. Why?
The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).
Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?
Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.
Examples:
# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'
(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).
For your new, edited question:
The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.
The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.
Are you sure you interpreted re.sub's documentation correctly?
*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if
the RE <.> is matched against '<H1>title</H1>', it will match the
entire string, and not just '<H1>'. Adding '?' after the qualifier
makes it perform the match in non-greedy or minimal fashion; as few
characters as possible will be matched. Using .*? in the previous
expression will match only ''.
Adding a ? will turn the expression into a non-greedy one.
Greedy:
re.sub(".*", "-", "abc")
non-Greedy:
re.sub(".*?", "-", "abc")
Update: FWIW re.sub does exactly what it should:
>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'
See #BrenBarn's awesome answer on why you get -a-b-c- :)
Here's a visual representation of what's going on:
.*?
Debuggex Demo
To elaborate on Veedrac's answer, different implementation has different treatment of zero-width matches in a FindAll (or ReplaceAll) operations. Two behaviors can be observed among different implementations, and Python re simply chooses to follow the first line of implementation.
1. Always bump along by one character on zero-width match
In Java and JavaScript, zero-width match causes the index to bump along by one character, since staying at the same index will cause an infinite loop in FindAll or ReplaceAll operations.
As a result, output of FindAll operations in such implementation can contain at most 1 match starting at a particular index.
The default Python re package probably also follow the same implementation (and it seems to be also the case for Ruby).
2. Disallow zero-width match on next match at same index
In PHP, which provides a wrapper over PCRE libreary, zero-width match does not cause the index to bump along immediately. Instead, it will set a flag (PCRE_NOTEMPTY) requiring the next match (which starts at the same index) to be a non-zero-width match. If the match succeeds, it will bump along by the length of the match (non-zero); otherwise, it bumps along by one character.
By the way, PCRE library does not provide built-in FindAll or ReplaceAll operation. It is actually provided by PHP wrapper.
As a result, output of FindAll operations in such implementation can contain up to 2 matches starting at the same index.
Python regex package probably follows this line of implementation.
This line of implementation is more complex, since it requires the implementation of FindAll or ReplaceAll to keep an extra state of whether to disallow zero-width match or not. Developer also needs to keep track of this extra flags when they use the low level matching API.
I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!
I'm having a hell of a time trying to transfer my experience with javascript regex to Python.
I'm just trying to get this to work:
print(re.match('e','test'))
...but it prints None. If I do:
print(re.match('e','est'))
It matches... does it by default match the beginning of the string? When it does match, how do I use the result?
How do I make the first one match? Is there better documentation than the python site offers?
re.match implicitly adds ^ to the start of your regex. In other words, it only matches at the start of the string.
re.search will retry at all positions.
Generally speaking, I recommend using re.search and adding ^ explicitly when you want it.
http://docs.python.org/library/re.html
the docs is clear i think.
re.match(pattern, string[, flags])¶
If zero or more characters **at the beginning of string** match the
regular expression pattern, return a corresponding MatchObject
instance. Return None if the string does not match the pattern; note
that this is different from a zero-length match.
I am using python, and this regexp doesn't match, and I don't understand why.
string = "15++12"
if re.match("[-+*/][-+*/]+",string):
# raise an error here
I am trying to raise an error, if one or more of "-","+","*","/" follows another one of those.
Use re.search() as re.match() only searches at the beginning of the string:
string = "15++12"
if re.search("[-+*/][-+*/]+",string):
# raise an error here
Also, this could be simplified to:
string = "15++12"
if re.search("[-+*/]{2,}",string):
# raise an error here
as the {2,} operator searches for two or more of the previous class.
re.match tries to match from the beginning of the string. To match any substring, either use re.search or put a .* before the pattern:
>>> re.match("[-+*/][-+*/]+", s)
>>> re.search("[-+*/][-+*/]+", s)
<_sre.SRE_Match object at 0x7f5639474780>
>>> re.match(".*[-+*/][-+*/]+", '15++12')
<_sre.SRE_Match object at 0x7f5639404c60>
I believe it is because re.match matches only the beginning of the string. Try re.search or re.findall
Check out 7.2.2 at python docs:
http://docs.python.org/library/re.html
Python violates the Principle of Least Surprise here: they've chosen a
word with an established meaning and warped it into meaning something
different from that. This isn't quite evil and wrong, but it is
certainly stupid and wrong. – tchrist #tchrist
I don't agree. In fact, I think exactly the contrary, it isn't stupid
If I say :
a regex's pattern "\d+[abc]" matches the string '145caba'
everybody will agree with this assertion.
If I say :
a regex's pattern "\d+[abc]" matches the string 'ref/ 789lomono
145abaca ubulutatouti'
80 % of people will agree
and the other rigorous 20 % of people, in which I am, will be unsatisfied by the wording and will reclaim that the expression be changed to :
"\d+[abc]" matches SOMEWHERE in the string 'ref/ 789lomono
145abaca ubulutatouti'
That's why I find justified to call an action that consists to search if and where a pattern matches in a string: search()
and to call the action to verify if a match occurs from the beginning: match()
For me it's very much logical, not surprising
.
PS
A former answer of mine have been deleted. As I don't know how to write to the author of the deletion to ask him the reason why he judged my former answer being a rant (!!!?), I re-post what seems to me absolutely impossible to be qualified so