Matching sequentially repeated brackets with Python Regex - python

Basically I'm trying to find a series of consecutive repeating patterns using the python with the regex:
(X[0-9]+)+
For example, give the input string:
YYYX4X5Z3X2
Get a list of results:
["X4X5", "X2"]
However I am instead getting:
["X5", "X2"]
I have tested the regex on regexpal and verified that it is correct however, due to the way python treats "()" I am unable to get the desired result. Can someone advise?

Turn your capturing group into a non-capturing (?:...) group instead ...
>>> import re
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2')
['X4X5', 'X2']
Another example:
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2Z4X6X7X8Z5X9')
['X4X5', 'X2', 'X6X7X8', 'X9']

modify your pattern like so
((?:X[0-9]+)+)
Demo
( # Capturing Group (1)
(?: # Non Capturing Group
X # "X"
[0-9] # Character Class [0-9]
+ # (one or more)(greedy)
) # End of Non Capturing Group
+ # (one or more)(greedy)
) # End of Capturing Group (1)

You need to give in a non-capturing group (?:<pattern>) for the first pattern:
((?:X[0-9]+)+)

Related

Regex string between square brackets only if '.' is within string

I'm trying to detect the text between two square brackets in Python however I only want the result where there is a "." within it.
I currently have [(.*?] as my regex, using the following example:
String To Search:
CASE[Data Source].[Week] = 'THIS WEEK'
Result:
Data Source, Week
However I need the whole string as [Data Source].[Week], (square brackets included, only if there is a '.' in the middle of the string). There could also be multiple instances where it matches.
You might write a pattern matching [...] and then repeat 1 or more times a . and again [...]
\[[^][]*](?:\.\[[^][]*])+
Explanation
\[[^][]*] Match from [...] using a negated character class
(?: Non capture group to repeat as a whole part
\.\[[^][]*] Match a dot and again [...]
)+ Close the non capture group and repeat 1+ times
See a regex demo.
To get multiple matches, you can use re.findall
import re
pattern = r"\[[^][]*](?:\.\[[^][]*])+"
s = ("CASE[Data Source].[Week] = 'THIS WEEK'\n"
"CASE[Data Source].[Week] = 'THIS WEEK'")
print(re.findall(pattern, s))
Output
['[Data Source].[Week]', '[Data Source].[Week]']
If you also want the values of between square brackets when there is not dot, you can use an alternation with lookaround assertions:
\[[^][]*](?:\.\[[^][]*])+|(?<=\[)[^][]*(?=])
Explanation
\[[^][]*](?:\.\[[^][]*])+ The same as the previous pattern
| Or
(?<=\[)[^][]*(?=]) Match [...] asserting [ to the left and ] to the right
See another regex demo
I think an alternative approach could be:
import re
pattern = re.compile("(\[[^\]]*\]\.\[[^\]]*\])")
print(pattern.findall(sss))
OUTPUT
['[Data Source].[Week]']

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

regex: combine capturing group with OR condition

Let's say I have a string :
s = "id_john, num847, id_000, num___"
I know how to retrieve either of 2 patterns with |:
re.findall("id_[a-z]+|num[0-9]+", s)
#### ['id_john', 'num847'] # OK
I know how to capture a portion only of a match with parenthesis:
re.findall("id_([a-z]+)", s)
#### ['john']
But I fail when i try to combine those two features, this is my desired output:
#### ['john', '847']
Thanks for your help.. (I work with python)
No need for lookaheads or complex patterns.
Consider this:
>>> re.findall('id_([a-z]+)|num([0-9]+)', s)
[('john', ''), ('', '847')]
When the first pattern matches, the first group will contain the match, and the second group will be empty. When the second pattern matches, the first group is empty, and the second group contains the match.
Since one of the two groups will always be empty, joining them couldn't hurt.
>>> [a+b for a,b in re.findall('id_([a-z]+)|num([0-9]+)', s)]
['john', '847']
You may use this code in Python with lookaheads:
>>> s = "id_john, num847, id_000, num___"
>>> print re.findall(r'(?:id_(?=[a-z]+\b)|num(?=\d+\b))([a-z\d]+)', s)
['john', '847']
RegEx Details:
(?:: Start non-capture group
id_(?=[a-z]+\b): Match id_ with a lookahead assertion to make sure we have [a-z]+ characters ahead followed by word boundary
|: OR
num(?=\d+\b))([a-z\d]+: Matchnum` with a lookahead assertion to make sure we have digits ahead followed by word boundary
): End non-capture group
([a-z\d]+): Match 1+ characters with lowercase letters or digits

Python Regular expression for splitting mentions of two years appearing altogether

I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d is written more concisely as \d{4}.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".
The 2nd group is referenced with \2.
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4} matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.
(?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.
Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.

(?:) regular expression Python

I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);
From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']
?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1
It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'

Categories

Resources