Conditional Regex: if A and B, choose B - python

I need to extract IDs from a string of the following format: Name ID, where the two are separated by white space.
Example:
'Riverside 456'
Sometimes, the ID is followed by the letter A or B (separated by white space):
'Riverside 456 A'
In this case I want to extract '456 A' instead of just '456':
I tried to accomplish this with the following regex:
(\d{1,3}) | (\d{1,3}\s[AB])
The conditional operator | does not quite work in this setting as I only get numerical IDs. Any suggestions how to properly set up regex in this setting?
Any help would be appreciated.

Try just reversing the order of the statements to have the more specific one first. I.e.:
(\d{1,3}\s[AB]) | (\d{1,3})

If you have an optional part that you might want to include, but not necessarily need, you could just use an "at most one time" quantifier:
Riverside (\d{1,3}(?: [AB])?)
The ?: marks groups as "not-capturing", so they won't be returned. And the ? tells it to either match it once or ignore it.

Your (\d{1,3})|(\d{1,3}\s[AB]) will always match the first branch as in an NFA regex, if the alternation group is not anchored on either side, the first branch that matches "wins", and the rest of the branches to the right are not tested against.
You can use an optional group:
\d{1,3}(?:\s[AB])?
See the regex demo
Add a $ at the end if the value you need is always at the end of the string.
If there can be more than 1 whitespace, add + after \s. Or * if there can be zero o more whitespaces.
Note that the last ? quantifier is greedy, so if there is a whitespace and A or B, they will be part of the match.
See the Python demo:
import re
rx = r'\d{1,3}(?:\s[AB])?'
s = ['Riverside 456 A', 'Riverside 456']
print([re.search(rx, x).group() for x in s])

import re
pattern = re.compile(r'(\d{1,3}\s?[AB]?)$')
print(pattern.search('Riverside 456').group(0)) # => '456'
print(pattern.search('Riverside 456 A').group(0)) # => '456 A'

You could use alternation
p = re.compile('''(\d{1,3}\s[AB]|\d{1,3})$''')
NB $ or maybe \s at the end (outside the group) is important, otherwise it will capture both 123 C and 1234 as 123 rather than fail to match.

Related

regex non greedy matching in python [duplicate]

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.
Is this possible with regular expressions?
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz
See the diagram below (if re.S is used, . will mean AnyChar):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)

Repeated pattern in python regex

New to python regex and would like to write something that matches this
<name>.name.<age>.age#<place>
I can do this but would like the pattern to have and check name and age.
pat = re.compile("""
^(?P<name>.*)
\.
(?P<name>.*)
\.
(?P<age>.*)
\.
(?P<age>.*?)
\#
(?P<place>.*?)
$""", re.X)
I then match and extract the values.
res = pat.match('alan.name.65.age#jamaica')
Would like to know the best practice to do this?
Match .name and .age literally. You don't need new groups for that.
pat = re.compile("""
^(?P<name>[^.]*)\.name
\.
(?P<age>[^.]*)\.age
\#
(?P<place>.*)
$""", re.X)
Notes
I've replaced .* ("anything") by [^.]* ("anything except a dot"), because the dot cannot really be part of the name in the pattern you show.
Think whether you mean * (0-unlimited occurrences) or rather + (1-unlimited occurrences).
No reason not to allow . in names, e.g. John Q. Public.
import re
pat = re.compile(r"""(?P<name>.*?)\.name
\.(?P<age>\d+)\.age
#(?P<place>.*$)""",
flags=re.X)
m = pat.match('alan.name.65.age#jamaica')
print(m.group('name'))
print(m.group('age'))
print(m.group('place'))
Prints:
alan
65
jamaica
You dont need the groups if you use re.split :
re.split('\.name\.|\.age', "alan.name.65.age#jamaica")
This will return name and age as first two elements of the list.

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!
You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.
Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

Recursive regex in python regex module?

I would like to capture all [[A-Za-z].]+ in my string, that is, all repeats of a alphabetic character followed by a dot.
So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."
I would like to pull out "A.B.C." and "U.V.W.X." only (as they are repeats of one character followed by a dot).
It seems almost that I need a recursive regex to do this [[A-Za-z].]+.
Is it possible to implement this with either python's re module or regex module?
You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:
<!-- language: lang-py -->
import re
MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)
your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z." # get a list of matches
print(MATCH_GROUPS.findall(your_string)) # ['A.B.C.', 'U.V.W.X.']
A bit clunky but should get the job done with edge cases as well.
P.S. The above will match single occurrences as well (e.g. A. if it appears as standalone) if you're seeking for multiple repeats only, replace the + (one or more repeats) with a range of your choice (e.g. {2,} for two or more repeats).
edit: A small change to match beginning/end of string boundaries as well.
This will work for you, using simple re.findall notation:
(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+
In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)
You can see it working here:
https://regex101.com/r/ZwW7c7/4
Python Code (that I wrote):
import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))
Output:
['D.E.F.', 'A.B.C.', 'U.V.W.X.']
Using positive look-around assertions:
>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']
UPDATE As #bubblebobble suggested, you the regex could be simplified using \S (non-space character) with negative look-around assertions:
pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'
This regex seems to do the job (testing if we are on the beginning of the string or after a space) :
\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+
EDIT : Sorry Shawn didn't see your modified answer

How can I express 'repeat this part' in a regular expression?

Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

Categories

Resources