regex non greedy matching in python [duplicate] - python

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.
Is this possible with regular expressions?

When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz
See the diagram below (if re.S is used, . will mean AnyChar):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']

Using PCRE a solution would be:
This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo

The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.

You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)

Related

How to regex for a numerical suffix?

I have the following regex (example is in Python):
pattern = re.compile(r'^(([a-zA-Z0-9]*[a-zA-Z]+)([\d]+)|([\d]+))$')
This correctly parses any string that has a numerical suffix and an optional prefix that is alphanumerics:
a123
a2a123
123
All will correctly see 123 as a suffix. It will correctly reject bad inputs:
abc
123abc
()123 # Or other non-alphanumerics
The regex itself is fairly unwieldy, though, and several of the capture groups are often empty as a result, meaning I have to go through the additional step of filtering them out. I am curious if there is a better way to be thinking about this regex than "a number OR a number preceeded by an alphanumeric that ends in a character"?
You may use
^[A-Za-z0-9]*?([0-9]+)$
See the regex demo
Details
^ - start of string
[A-Za-z0-9]*? - any letters/digits, zero or more times, as few as possible (due to this non-greedy matching, the next pattern, ([0-9]+), will match all digits at the end of the string there are)
([0-9]+) - Group 1: one or more digits
$ - end of string.
In Python:
m = re.search(r'^[A-Za-z0-9]*?([0-9]+)$') # Or, see below
# m = re.match(r'[A-Za-z0-9]*?([0-9]+)$') # re.match only searches at the start of the string
# m = re.fullmatch(r'[A-Za-z0-9]*?([0-9]+)') # Only in Python 3.x
if m:
print(m.group(1))
If you use non-capturing groups and a correct management of repetitions, the problem eases itself.
pattern = re.compile(r'^(?:[a-zA-Z0-9]*[a-zA-Z]+)?([0-9]+)$')
There's only one capturing group (group 1) for the suffix, and the alphanumerics before it is not captured.
Alternatively, using named groups is another option, and it often makes long, structured regexes easier to maintain:
pattern = re.compile(r'^(?P<a>[a-zA-Z0-9]*[a-zA-Z]+)?(?P<suffix>[0-9]+)$')

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

Recursive regex in python regex module?

I would like to capture all [[A-Za-z].]+ in my string, that is, all repeats of a alphabetic character followed by a dot.
So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."
I would like to pull out "A.B.C." and "U.V.W.X." only (as they are repeats of one character followed by a dot).
It seems almost that I need a recursive regex to do this [[A-Za-z].]+.
Is it possible to implement this with either python's re module or regex module?
You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:
<!-- language: lang-py -->
import re
MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)
your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z." # get a list of matches
print(MATCH_GROUPS.findall(your_string)) # ['A.B.C.', 'U.V.W.X.']
A bit clunky but should get the job done with edge cases as well.
P.S. The above will match single occurrences as well (e.g. A. if it appears as standalone) if you're seeking for multiple repeats only, replace the + (one or more repeats) with a range of your choice (e.g. {2,} for two or more repeats).
edit: A small change to match beginning/end of string boundaries as well.
This will work for you, using simple re.findall notation:
(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+
In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)
You can see it working here:
https://regex101.com/r/ZwW7c7/4
Python Code (that I wrote):
import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))
Output:
['D.E.F.', 'A.B.C.', 'U.V.W.X.']
Using positive look-around assertions:
>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']
UPDATE As #bubblebobble suggested, you the regex could be simplified using \S (non-space character) with negative look-around assertions:
pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'
This regex seems to do the job (testing if we are on the beginning of the string or after a space) :
\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+
EDIT : Sorry Shawn didn't see your modified answer

Match specific pattern with regular expression

I've to make a regex to match exactly this kind of pattern
here an example
JK+6.00,PP*2,ZZ,GROUPO
having a match for every group like
Match 1
JK
+
6.00
Match 2
PP
*
2
Match 3
ZZ
Match 4
GROUPO
So comma separated blocks of
(2 to 12 all capitals letters) [optional (+ or *) and a (positive number 0[.0[0]])
This block successfully parse the pattern
(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)
we have the subject group
(?P<subject>[A-Z]{2,12})
The value
(?P<value>\d+(?:.?\d{1,2})?)
All the optional operation section (value within)
(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?
But the regex must fail if the string doesn't match EXACTLY the pattern
and that's the problem
I tried this but doesn't work
^(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)(?:,(?P=block))*$
Any suggestion?
PS. I use Python re
I'd personally go for a 2 step solution, first check that the whole string fits to your pattern, then extract the groups you want.
For the overall check you might want to use ^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$ as a pattern, which contains basically your pattern, the (?:,|$) to match the delimiters and anchors.
I have also adjusted your pattern a bit, to (?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?). I have replaced (?:\*|\+) with [+*] in your operation pattern and \. with .? in your value pattern.
A (very basic) python implementation could look like
import re
str='JK+6.00,PP*2,ZZ,GROUPO'
full_pattern=r'^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$'
extract_pattern=r'(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?)'
if re.fullmatch(full_pattern, str):
for match in re.finditer(extract_pattern, str):
print(match.groups())
http://ideone.com/kMl9qu
I'm guessing this is the pattern you were looking for:
(2 different letter)+(time stamp),(2 of the same letter)*(1 number),(2 of the same letter),(a string)
If thats the case, this regex would do the trick:
^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\*\d),((\w)\5),(\w+)$
Demo: https://regex101.com/r/8B3C6e/2

Conditional Regex: if A and B, choose B

I need to extract IDs from a string of the following format: Name ID, where the two are separated by white space.
Example:
'Riverside 456'
Sometimes, the ID is followed by the letter A or B (separated by white space):
'Riverside 456 A'
In this case I want to extract '456 A' instead of just '456':
I tried to accomplish this with the following regex:
(\d{1,3}) | (\d{1,3}\s[AB])
The conditional operator | does not quite work in this setting as I only get numerical IDs. Any suggestions how to properly set up regex in this setting?
Any help would be appreciated.
Try just reversing the order of the statements to have the more specific one first. I.e.:
(\d{1,3}\s[AB]) | (\d{1,3})
If you have an optional part that you might want to include, but not necessarily need, you could just use an "at most one time" quantifier:
Riverside (\d{1,3}(?: [AB])?)
The ?: marks groups as "not-capturing", so they won't be returned. And the ? tells it to either match it once or ignore it.
Your (\d{1,3})|(\d{1,3}\s[AB]) will always match the first branch as in an NFA regex, if the alternation group is not anchored on either side, the first branch that matches "wins", and the rest of the branches to the right are not tested against.
You can use an optional group:
\d{1,3}(?:\s[AB])?
See the regex demo
Add a $ at the end if the value you need is always at the end of the string.
If there can be more than 1 whitespace, add + after \s. Or * if there can be zero o more whitespaces.
Note that the last ? quantifier is greedy, so if there is a whitespace and A or B, they will be part of the match.
See the Python demo:
import re
rx = r'\d{1,3}(?:\s[AB])?'
s = ['Riverside 456 A', 'Riverside 456']
print([re.search(rx, x).group() for x in s])
import re
pattern = re.compile(r'(\d{1,3}\s?[AB]?)$')
print(pattern.search('Riverside 456').group(0)) # => '456'
print(pattern.search('Riverside 456 A').group(0)) # => '456 A'
You could use alternation
p = re.compile('''(\d{1,3}\s[AB]|\d{1,3})$''')
NB $ or maybe \s at the end (outside the group) is important, otherwise it will capture both 123 C and 1234 as 123 rather than fail to match.

Categories

Resources