Python Regex - find all occurences of a group after a prefix - python

I have a strings like that:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
I would like to get the elements separated by space after the H inside groups, so I tried:
myRegex = r'\bH\s*[\s|\:]+(?:\s?(\b\d+[\.?\d+]*\b))*'
It's fine with string s1
print(re.search(myRegex , s1).groups())
I's giving me: ('1234.34.34',) => It's fine
But for s2, I have:
print(re.search(myRegex , s2).groups())
It's sending back only the last group ('123.5',), but I'm expecting to have ('1234.34.34', '12.12', '123.5').
Do you have an idea how to get my expected value?
In addition, I'm not limited to 2 groups, I may have much more...
Thanks a lot
Fred

In your pattern, in this part (?:\s?(\b\d+[\.?\d+]*\b))* you have a capturing group inside a repeating non capturing group which will give the capturing group the value of the last iteration of the outer non capturing group.
The last iteration will match 123.5 and that will be the group 1 value.
One option is to match the whole pattern and use a capturing group for the last part.
\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b
Regex demo | Python demo
If you have the group, you could use split:
import re
s2 = 'H: 1234.34.34 12.12 123.5'
myRegex = r'\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b'
res = re.search(myRegex , s2)
if res:
print(res.group(1).split())
Output
['1234.34.34', '12.12', '123.5']
Using the PyPi regex module, you could make use of \G to get iterative matches for the numbers and use \K to forget what was currently matched, which would be the space before the number.
(?:\bH:|\G(?!A)) \K\d+(?:\.\d+)+
Regex demo | Python demo

Assuming your string will always start with H:, you can do as follows :
s2 = 'H: 1234.34.34 12.12 123.5'
output = s2.split("H: ")[-1].split()
Output will be
['1234.34.34', '12.12', '123.5']
The first split will allow you to get all your character after the "H: "
The second split will split your sentences following your spaces.

Based on your examples, you don't need a regex, split() will suffice:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
match1 = s1.split()[1:]
match2 = s2.split()[1:]
print(match1)
print(match2)
['1234.34.34']
['1234.34.34', '12.12', '123.5']

Related

Split String in till first encounter of number and ":"

I have a string "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n" (for example) and need to split the string and get output only as
"nobody,jram,dapp,test1,app1,lasp\r\n"
how will i be able to do that?
You can use str.rsplit() it will split the string based on the delemiter from right side. rsplit() return result as list then you can access the values using index.
s = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
res = s.rsplit(':', 1)[-1]
print(res)
This solution uses regex to find an occurrence of a digit followed by a colon. Then returns the part afterwards as the match.
import re
s1 = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s1)
match1 = m.group(0)
print(match1)
Output: nobody,jram,dapp,test1,app1,lasp
Note that this solution will still work (according to what was requested in the title) even if you have another colon in the text which is not preceded by a number.
s2 = "person:x:1319:test:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s2)
match2 = m.group(0)
print(match2)
Output: test:nobody,jram,dapp,test1,app1,lasp

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

Stripping variable borders with python re

How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'

Matching sequentially repeated brackets with Python Regex

Basically I'm trying to find a series of consecutive repeating patterns using the python with the regex:
(X[0-9]+)+
For example, give the input string:
YYYX4X5Z3X2
Get a list of results:
["X4X5", "X2"]
However I am instead getting:
["X5", "X2"]
I have tested the regex on regexpal and verified that it is correct however, due to the way python treats "()" I am unable to get the desired result. Can someone advise?
Turn your capturing group into a non-capturing (?:...) group instead ...
>>> import re
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2')
['X4X5', 'X2']
Another example:
>>> re.findall(r'(?:X[0-9]+)+', 'YYYX4X5Z3X2Z4X6X7X8Z5X9')
['X4X5', 'X2', 'X6X7X8', 'X9']
modify your pattern like so
((?:X[0-9]+)+)
Demo
( # Capturing Group (1)
(?: # Non Capturing Group
X # "X"
[0-9] # Character Class [0-9]
+ # (one or more)(greedy)
) # End of Non Capturing Group
+ # (one or more)(greedy)
) # End of Capturing Group (1)
You need to give in a non-capturing group (?:<pattern>) for the first pattern:
((?:X[0-9]+)+)

Regex - Replacing a matching subgroup

The following code finds in a string the names of regex like groups to be replaced. I would like to use this so as to change the names name_1, name_2 and not_escaped to test_name_1, test_name_2 and test_not_escaped respectively. In the matches m, each name is equal to m.group(2). How can I do that ?
p = re.compile(r"(?<!\\)(\\\\)*\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
for m in p.finditer(text):
print(
'---',
m.group(),
m.group(2)
)
This gives the following output.
---
\g<name_1>
name_1
---
\g<name_2>
name_2
---
\\\\\g<not_escaped>
not_escaped
You'd need to reproduce the whole group 0 text, using \<digit> back-references to re-used captured groups:
p.sub(r'\1\\g<test_\2>', text)
Here \1 refers to the initial backslashes group, and \2 to the name to be prefixed by test_.
For this to work, you do need to move the * into the first capturing group to make sure that captured group was not un-matched:
p = re.compile(r"(?<!\\)((?:\\\\)*)\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
I've used a non-capturing group ((?:...)) to still keep the backslashes grouped together.
Demo:
>>> text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
>>> p = re.compile(r"(?<!\\)((?:\\\\)*)\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
>>> print(p.sub(r'\1\\g<test_\2>', text))
</\g<test_name_1>\g<test_name_2>\\\\\g<test_not_escaped>\\g<escaped>>>
The easiest way to accomplish this is by using a series of three simple calls to str.replace rather than using regexes for replacement:
import re
p = re.compile(r"(?<!\\)(\\\\)*\\g<([a-zA-Z_][a-zA-Z\d_]*)>")
text = r"</\g<name_1>\g<name_2>\\\\\g<not_escaped>\\g<escaped>>>"
for m in p.finditer(text):
if m.groups(2):
replacement = m.groups(2)[1]
text = text.replace(replacement, 'test_' + replacement)

Categories

Resources