(?:) regular expression Python - python

I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);

From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']

?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1

It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'

Related

python regex - findall not returning output as expected

I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']
When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.
This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

Python regex: findall() and search()

I have the following Python regex:
>>> p = re.compile(r"(\b\w+)\s+\1")
\b : word boundary
\w+ : one or more alphanumerical characters
\s+ : one or more whitespaces (can be , \t, \n, ..)
\1 : backreference to group 1 ( = the part between (..))
This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:
>>> p.search("I am in the the car.")
<_sre.SRE_Match object; span=(8, 15), match='the the'>
The found match is the the, just as I had expected. The weird behaviour is in the findall function:
>>> p.findall("I am in the the car.")
['the']
The found match is now only the. Why the difference?
When using groups in a regular expression, findall() returns only the groups; from the documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:
>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]
The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:
>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")
You can filter that back to just the outer group result:
>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']

Matching sequentially repeated brackets with Python Regex

Basically I'm trying to find a series of consecutive repeating patterns using the python with the regex:
(X[0-9]+)+
For example, give the input string:
YYYX4X5Z3X2
Get a list of results:
["X4X5", "X2"]
However I am instead getting:
["X5", "X2"]
I have tested the regex on regexpal and verified that it is correct however, due to the way python treats "()" I am unable to get the desired result. Can someone advise?
Turn your capturing group into a non-capturing (?:...) group instead ...
>>> import re
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2')
['X4X5', 'X2']
Another example:
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2Z4X6X7X8Z5X9')
['X4X5', 'X2', 'X6X7X8', 'X9']
modify your pattern like so
((?:X[0-9]+)+)
Demo
( # Capturing Group (1)
(?: # Non Capturing Group
X # "X"
[0-9] # Character Class [0-9]
+ # (one or more)(greedy)
) # End of Non Capturing Group
+ # (one or more)(greedy)
) # End of Capturing Group (1)
You need to give in a non-capturing group (?:<pattern>) for the first pattern:
((?:X[0-9]+)+)

Python regex match OR operator

I'm trying to match time formats in AM or PM.
i.e. 02:40PM
12:29AM
I'm using the following regex
timePattern = re.compile('\d{2}:\d{2}(AM|PM)')
but it keeps returning only AM PM string without the numbers. What's going wrong?
Use a non capturing group (?: and reference to the match group.
Use re.I for case insensitive matching.
import re
def find_t(text):
return re.search(r'\d{2}:\d{2}(?:am|pm)', text, re.I).group()
You can also use re.findall() for recursive matching.
def find_t(text):
return re.findall(r'\d{2}:\d{2}(?:am|pm)', text, re.I)
See demo
Use a non-delimited capture group (?:...):
>>> from re import findall
>>> mystr = """
... 02:40PM
... 12:29AM
... """
>>> findall("\d{2}:\d{2}(?:AM|PM)", mystr)
['02:40PM', '12:29AM']
>>>
Also, you can shorten your Regex to \d\d:\d\d(?:A|P)M.
It sounds like you're accessing group 1, when you need to be accessing group 0.
The groups in your regex are as follows:
\d{2}:\d{2}(AM|PM)
|-----| - group 1
|----------------| - group 0 (always the match of the entire pattern)
You can access the entire match via:
timePattern.match('02:40PM').group(0)
You're not capturing the Hour, minute fields:
>>> import re
>>> r = re.compile('(\d{2}:\d{2}(?:AM|PM))')
>>> r.search('02:40PM').group()
'02:40PM'
>>> r.search('Time is 12:29AM').group()
'12:29AM'
Are you accidentally grabbing the 1st cluster (the stuff in that matches the portion of the pattern in the parentheses) instead of the "0st" cluster (which is the whole match)?

Categories

Resources