Python regex: how to achieve this complex replacement rule?

Python regex: how to achieve this complex replacement rule? - python

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?

There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Regex only finds results once

I'm trying to find any text between a '>' character and a new line, so I came up with this regex:
result = re.search(">(.*)\n", text).group(1)
It works perfectly with only one result, such as:
>test1
(something else here)
Where the result, as intended, is
test1
But whenever there's more than one result, it only shows the first one, like in:
>test1
(something else here)
>test2
(something else here)
Which should give something like
test1\ntest2
But instead just shows
test1
What am I missing? Thank you very much in advance.

re.search only returns the first match, as documented:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
To find all the matches, use findall.
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found.
Here's an example from the shell:
>>> import re
>>> re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx")
['test1', 'test2']
Edit: I just read your question again and realised that you want "test1\ntest2" as output. Well, just join the list with \n:
>>> "\n".join(re.findall(">(.*)\n", ">test1\nxxx>test2\nxxx"))
'test1\ntest2'

You could try:
y = re.findall(r'((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+)', text)
Which returns ['t1\nt2\nt3'] for 't1\nt2\nt3\n'. If you simply want the string, you can get it by:
s = y[0]
Although it seems much larger than your initial code, it will give you your desired string.
Explanation -
((?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|))+) is the regex as well as the match.
(?:(?:.+?)(?:(?=[\n\r][^\n\r])\n|)) is the non-capturing group that matches any text followed by a newline, and is repeatedly found one-or-more times by the + after it.
(?:.+?) matches the actual words which are then followed by a newline.
(?:(?=[\n\r][^\n\r])\n|) is a non-capturing conditional group which tells the regex that if the matched text is followed by a newline, then it should match it, provided that the newline is not followed by another newline or carriage return
(?=[\n\r][^\n\r]) is a positive look-ahead which ascertains that the text found is followed by a newline or carriage return, and then some non-newline characters, which combined with the \n| after it, tells the regex to match a newline.
Granted, after typing this big mess out, the regex is pretty long and complicated, so you would be better off implementing the answers you understand, rather than this answer, which you may not. However, this seems to be the only one-line answer to get the exact output you desire.

Regex, find pattern only in middle of string

I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n's, e.g. nnnnnnnABCnnnnnnnnnDEF. In any place of the string the number of n's can be variable.
If I construct a regex like this:
re.findall(r'^(((?i)n)\2{2,})', s),
I can find occurences of case-insensitive n's only in the beginning of the string, which is fine. If I do it like this:
re.findall(r'(((?i)n)\2{2,}$)', s),
I can detect the ones only in the end of the sequence. But what about just in the middle?
At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regex(-ices?) to check the length of the returned list and the presence of n's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.
Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s), which seems to exlude the beginning just fine but (?!$) or (?!\z) at the end of the regex only excludes the last n in ABCnnnn. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.

Instead of using a complicated regex in order to refuse of matching the leading and trailing n characters. As a more pythonic approach you can strip() your string then find all the sequence of ns using re.findall() and a simple regex:
>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn"
>>> import re
>>>
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']
Note : re.I is Ignore-case flag which makes the regex engine matches upper case and lower case characters.

Since "n" is a character (and not a subpattern), you can simply use:
re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)
or better:
re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)

NOTE: This solution assumes n may be a sequence of some characters. For more efficient alternatives when n is just 1 character, see other answers here.
You can use
(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)
See the regex demo
The regex will match repeated consecutive ns (ignoring case can be achieved with re.I flag) that are not at the beginning ((?<!^)) or end ((?!$)) of the string and not before ((?!n)) or after ((?<!n)) another n.
The (?<!^)(?<!n) is a sequence of 2 lookbehinds: (?<!^) means do not consume the next pattern if preceded with the start of the string. The (?<!n) negative lookbehind means do not consume the next pattern if preceded with n. The negative lookaheads (?!$) and (?!n)have similar meanings: (?!$) fails a match if after the current position the end of string occurs and (?!n) will fail a match if n occurs after the current position in string (that is, right after matching all consecutive ns. The lookaround conditions must all be met, that is why we only get the innermost matches.
See IDEONE demo:
import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])

multiple negative lookahead assertions

I can't figure out how to do multiple lookaround for the life of me. Say I want to match a variable number of numbers following a hash but not if preceded by something or followed by something else. For example I want to match #123 or #12345 in the following. The lookbehinds seem to be fine but the lookaheads do not. I'm out of ideas.
matches = ["#123", "This is #12345",
# But not
"bad #123", "No match #12345", "This is #123-ubuntu",
"This is #123 0x08"]
pat = '(?<!bad )(?<!No match )(#[0-9]+)(?! 0x0)(?!-ubuntu)'
for i in matches:
print i, re.search(pat, i)

You should have a look at the captures as well. I bet for the last two strings you will get:
#12
This is what happens:
The engine checks the two lookbehinds - they don't match, so it continues with the capturing group #[0-9]+ and matches #123. Now it checks the lookaheads. They fail as desired. But now there's backtracking! There is one variable in the pattern and that is the +. So the engine discards the last matched character (3) and tries again. Now the lookaheads are no problem any more and you get a match. The simplest way to solve this is to add another lookahead that makes sure that you go to the last digit:
pat = r'(?<!bad )(?<!No match )(#[0-9]+)(?![0-9])(?! 0x0)(?!-ubuntu)'
Note the use of a raw string (the leading r) - it doesn't matter in this pattern, but it's generally a good practice, because things get ugly once you start escaping characters.
EDIT: If you are using or willing to use the regex package instead of re, you get possessive quantifiers which suppress backtracking:
pat = r'(?<!bad )(?<!No match )(#[0-9]++)(?! 0x0)(?!-ubuntu)'
It's up to you which you find more readable or maintainable. The latter will be marginally more efficient, though. (Credits go to nhahtdh for pointing me to the regex package.)

Why is the minimal (non-greedy) match affected by the end of string character '$'?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)

The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.
Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.
Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.
With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.
On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.
As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.

Answer to original question:
Why does the first search() span
multiple "/"s rather than taking the
shortest match?
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.
Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
>>> import re
>>> string = "baaa"
>>>
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>>
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>>
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'
Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.
>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'
Edit: Removed text related to old versions of the question.

Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.
>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex: how to achieve this complex replacement rule? - python

Related

Regex Statement to only match parts of a string for comparison - Python

Regex only finds results once

Regex, find pattern only in middle of string

multiple negative lookahead assertions

Why is the minimal (non-greedy) match affected by the end of string character '$'?

Categories

Resources