Python regex negating metacharacters - python

Python metacharacter negation.
After scouring the net and writing a few different syntaxes I'm out of ideas.
Trying to rename some files. They have a year in the title e.g. [2002].
Some don't have the brackets, which I want to rectify.
So I'm trying to find a regex (that I can compile preferably) that in my mind looks something like (^[\d4^]) because I want the set of 4 numbers that don't have square brackets around them. I'm using the brackets in the hope of binding this so that I can then rename using something like [\1].

If you want to check for things around a pattern you can use lookahead and lookbehind assertions. These don't form part of the match but say what you expect to find (or not find) around it.
As we don't want brackets we'll need use a negative lookbehind and lookahead.
A negative lookahead looks like this (?!...) where it matches if ... does not come next. Similarly a negative lookbehind looks like this (?<!...) and matches if ... does not come before.
Our example is make slightly more complicated because we're using [ and ] which themselves have meaning in regular expressions so we have to escape them with \.
So we can build up a pattern as follows:
A negative lookbehind for [ - (?<!\[)
Four digits - \d{4}
A negative lookahead for ] - (?!\])
This gives us the following Python code:
>>> import re
>>> r = re.compile("(?<!\[)\d{4}(?!\])")
>>> r.match(" 2011 ")
>>> r.search(" 2011 ")
<_sre.SRE_Match object at 0x10884de00>
>>> r.search("[2011]")
To rename you can use the re.sub function or the sub function on your compiled pattern. To make it work you'll need to add an extra set of brackets around the year to mark it as a group.
Also, when specifying your replacement you refer to the group as \1 and so you have to escape the \ or use a raw string.
>>> r = re.compile("(?<!\[)(\d{4})(?!\])")
>>> name = "2011 - This Year"
>>> r.sub(r"[\1]",name)
'[2011] - This Year'

Related

Python regex: how to achieve this complex replacement rule?

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

Python regex find all matches

I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.
You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall
Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.
i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)
When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Why does `(c*)|(cccd)` match `ccc`, not `cccd`?

I thought I understood Regular Expressions pretty well, but why is this matching 'ccc', not 'cccd'?
>>> mo = re.match('(c*)|(cccd)', 'cccd')
>>> mo.group(0)
'ccc'
This particular case is using Python's re module.
Regex patterns are evaluated from left to right. Put the pattern which has higher precedence as first (to the left of |) and the lower precedence as second (to the right of |). Note that the second pattern was not allowed to match the text which was already matched by the first pattern. That is, regex engine by default won't do overlapping matches. To make the regex engine to do overlapping match then you need to put your pattern inside a capturing group and again put the capturing group inside a positive lookaround assertion (positive lookahead and positive lookbehind).
mo = re.match('(cccd)|(c*)', 'cccd')
Your regex ((c*)|(cccd)) is saying match either one of two things:
0 or unlimited c's
The literal sequence cccd
Because regular expressions are greedy, it consumes the ccc string as the match, so that is what you're returning. It will first try what ever comes first (in this case c*, and if it is able to make a match, then it will.
To correct to what you want, try the regex: (cccd)|(c*). With this:
>>> mo = re.match('(cccd)|(c*)', 'cccd')
>>> mo.group(0)
'cccd'
Example is here: https://regex101.com/r/aU8pE7/1
(c*) matches 'ccc', thus you get the match. To match "cccd", use ^(?:(c*)|(cccd))$
See demo.

Python Regex - Is it possible to use the same group (named or unnamed) in multiple spots?

I have a bunch of strings, some of which I need to replace a part of. However, the parts before and after the parts that need to be replaced are not always the same. Also, the part of the string that needs to be replaced is not something I can match with a regex without it matching other parts that I don't want to replace. For example:
"prefixA_REPLACEME_postfixA",
"prefixB_SOMETHING_postfixB",
"prefixA_LLAMAS_postfixC",
"prefixB_DONTREPLACE_postfixA",
Turned into:
"prefixA_NEWSTR_postfixA",
"prefixB_NEWSTR_postfixB",
"prefixA_NEWSTR_postfixC",
"prefixB_DONTREPLACE_postfixA",
I would love to do this with a single regex, like this:
re.sub('(prefixA_).*(_postfixA)|(prefixB_).*(_postfixB)|(prefixA_).*(_postfixC)', '\\1NEWSTR\\2', stringToFix)
Unfortunately this doesn't work, because group 1 and group 2 are (prefixA_) and (postfixA), whether or not that is the part of the regex that ends up being used. I also can't use this
re.sub('(?P<one>prefixA_).*(?P<two>_postfixA)|(?P<one>prefixB_).*(?P<two>_postfixB)|(?P<one>prefixA_).*(?P<two>_postfixC)', '\\1NEWSTR\\2', stringToFix)
because it gives me the error
sre_constants.error: redefinition of group name 'one' as group 3; was group 1
Something else that won't work is this
re.sub('(prefixA_|prefixB).*(_postfixA|_postfixB|_postfixC)', '\\1NEWSTR\\2', stringToFix)
because this would capture the fourth string, which I don't want to be matched.
So is there a way to make it so that any uncaptured groups are not counted (which would make my first regex work correctly)? Or any other way to do this with a single regex?
You can't define a named capturing group more than once within the same regex (unlike other regex flavors like .NET). But since you're not doing anything with the pre- and postfixes, you can simply use lookaround assertions:
>>> s = """prefixA_REPLACEME_postfixA
... prefixB_SOMETHING_postfixB
... prefixA_LLAMAS_postfixC
... prefixB_DONTREPLACE_postfixA"""
>>> import re
>>> print re.sub("(?<=prefixA).*(?=postfixA)|(?<=prefixB).*(?=postfixB)|(?<=prefixA).*(?=postfixC)", "_NEWSTR_", s)
prefixA_NEWSTR_postfixA
prefixB_NEWSTR_postfixB
prefixA_NEWSTR_postfixC
prefixB_DONTREPLACE_postfixA
looks like what you want to do is use
if re.search("shouldReplaceRegex",matchstring): matchstring = re.sub("_.*?_","_yourReplacement_",matchstring)

Categories

Resources