re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string) - python

In a Python regular expression, I encounter this singular problem.
Could you give instruction on the differences between re.findall('(ab|cd)', string) and re.findall('(ab|cd)+', string)?
import re
string = 'abcdla'
result = re.findall('(ab|cd)', string)
result2 = re.findall('(ab|cd)+', string)
print(result)
print(result2)
Actual Output is:
['ab', 'cd']
['cd']
I'm confused as to why does the second result doesn't contain 'ab' as well?

+ is a repeat quantifier that matches one or more times. In the regex (ab|cd)+, you are repeating the capture group (ab|cd) using +. This will only capture the last iteration.
You can reason about this behaviour as follows:
Say your string is abcdla and regex is (ab|cd)+. Regex engine will find a match for the group between positions 0 and 1 as ab and exits the capture group. Then it sees + quantifier and so tries to capture the group again and will capture cd between positions 2 and 3.
If you want to capture all iterations, you should capture the repeating group instead with ((ab|cd)+) which matches abcd and cd. You can make the inner group non-capturing as we don't care about inner group matches with ((?:ab|cd)+) which matches abcd
https://www.regular-expressions.info/captureall.html
From the Docs,
Let’s say you want to match a tag like !abc! or !123!. Only these two
are possible, and you want to capture the abc or 123 to figure out
which tag you got. That’s easy enough: !(abc|123)! will do the trick.
Now let’s say that the tag can contain multiple sequences of abc and
123, like !abc123! or !123abcabc!. The quick and easy solution is
!(abc|123)+!. This regular expression will indeed match these tags.
However, it no longer meets our requirement to capture the tag’s label
into the capturing group. When this regex matches !abc123!, the
capturing group stores only 123. When it matches !123abcabc!, it only
stores abc.

I don't know if this will clear things more, but let's try to imagine what happen under the hood in a simple way,
we going to sumilate what happen using match
# group(0) return the matched string the captured groups are returned in groups or you can access them
# using group(1), group(2)....... in your case there is only one group, one group will capture only
# one part so when you do this
string = 'abcdla'
print(re.match('(ab|cd)', string).group(0)) # only 'ab' is matched and the group will capture 'ab'
print(re.match('(ab|cd)+', string).group(0)) # this will match 'abcd' the group will capture only this part 'cd' the last iteration
findall match and consume the string at the same time let's imagine what happen with this REGEX '(ab|cd)':
'abcdabla' ---> 1: match: 'ab' | capture : ab | left to process: 'cdabla'
'cdabla' ---> 2: match: 'cd' | capture : cd | left to process: 'abla'
'abla' ---> 3: match: 'ab' | capture : ab | left to process: 'la'
'la' ---> 4: match: '' | capture : None | left to process: ''
--- final : result captured ['ab', 'cd', 'ab']
Now the same thing with '(ab|cd)+'
'abcdabla' ---> 1: match: 'abcdab' | capture : 'ab' | left to process: 'la'
'la' ---> 2: match: '' | capture : None | left to process: ''
---> final result : ['ab']
I hope this clears thing a little bit.

So, for me confusing part was the fact that
If one or more groups are present in the pattern, return a list of groups;
docs
so it's returning you not a full match but only match of a capture. If you make this group not capturing (re.findall('(?:ab|cd)+', string), it'll return ["abcd"] as I initially expected

Related

python regex return non-capturing group

I want to generate a username from an email with :
firstname's first letter
lastname's first 7 letters
eg :
getUsername("my-firstname.my-lastname#email.com")
mmylastn
Here is getUsername's code :
def getUsername(email) :
re.match(r"(.){1}[a-z]+.([a-z]{7})",email.replace('-','')).group()
email.replace('-','') to get rid of the - symbol
regex that captures the 2 groups I discribed above
If I do .group(1,2) I can see the captured groups are m and mylastn, so it's all good.
But using .group() doesn't just return the capturing group but also everthing between them : myfirstnamemlastn
Can someone explain me this behavior ?
First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern
Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.
So, in your case,
Check if there is a match first, trying to access None.groups() will throw an exception
Then join the match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname#email.com"))
See the Python demo.

Python Regex - find all occurences of a group after a prefix

I have a strings like that:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
I would like to get the elements separated by space after the H inside groups, so I tried:
myRegex = r'\bH\s*[\s|\:]+(?:\s?(\b\d+[\.?\d+]*\b))*'
It's fine with string s1
print(re.search(myRegex , s1).groups())
I's giving me: ('1234.34.34',) => It's fine
But for s2, I have:
print(re.search(myRegex , s2).groups())
It's sending back only the last group ('123.5',), but I'm expecting to have ('1234.34.34', '12.12', '123.5').
Do you have an idea how to get my expected value?
In addition, I'm not limited to 2 groups, I may have much more...
Thanks a lot
Fred
In your pattern, in this part (?:\s?(\b\d+[\.?\d+]*\b))* you have a capturing group inside a repeating non capturing group which will give the capturing group the value of the last iteration of the outer non capturing group.
The last iteration will match 123.5 and that will be the group 1 value.
One option is to match the whole pattern and use a capturing group for the last part.
\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b
Regex demo | Python demo
If you have the group, you could use split:
import re
s2 = 'H: 1234.34.34 12.12 123.5'
myRegex = r'\bH: (\d+(?:\.\d+)+(?: \d+(?:\.\d+)+)*)\b'
res = re.search(myRegex , s2)
if res:
print(res.group(1).split())
Output
['1234.34.34', '12.12', '123.5']
Using the PyPi regex module, you could make use of \G to get iterative matches for the numbers and use \K to forget what was currently matched, which would be the space before the number.
(?:\bH:|\G(?!A)) \K\d+(?:\.\d+)+
Regex demo | Python demo
Assuming your string will always start with H:, you can do as follows :
s2 = 'H: 1234.34.34 12.12 123.5'
output = s2.split("H: ")[-1].split()
Output will be
['1234.34.34', '12.12', '123.5']
The first split will allow you to get all your character after the "H: "
The second split will split your sentences following your spaces.
Based on your examples, you don't need a regex, split() will suffice:
s1 = 'H: 1234.34.34'
s2 = 'H: 1234.34.34 12.12 123.5'
match1 = s1.split()[1:]
match2 = s2.split()[1:]
print(match1)
print(match2)
['1234.34.34']
['1234.34.34', '12.12', '123.5']

How to retrieve second value from search object?

I am unable to obtain the second match from a match object.
EDIT: found a fix
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+[.,][0-9]+)")
buy_in = re.findall(ptrn_buy_in, lines[0])
if buy_in[1] and buy_in[2]:
parsed_hand["metadata"]["buy_in"] = float(buy_in[1])
parsed_hand["metadata"]["rake"] = float(buy_in[2])
My string is: $14.69+$0.31
I have tried to see if the match object actually holds more values within the same group i.e. .group(0)[0] and [1]. This actually gave me the second digit of the number so not at all what I was expecting.
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+.[0-9]+)")
buy_in = re.search(ptrn_buy_in, lines[0])
if buy_in.group(0) and buy_in.group(1):
parsed_hand["metadata"]["buy_in"] = float(buy_in.group(0))
parsed_hand["metadata"]["rake"] = float(buy_in.group(1))
I am expecting to get 14.69 in .group(0) and 0.31 from .group(1) however I am getting 14.69 twice. Does anyone have any ideas?
Kind regards
There is only 1 capturing group ([0-9]+.[0-9]+) which will match 1+ digits, 1 times any character and again 1+ digits.
re.search returns a MatchObject which where .group(0) returns the entire match and .group(1) the first capturing group. That is why you get 14.69 twice.
You have to escape the dot to match it literally. But as you use a positive lookbehind, you could omit the group and get the match only:
(?<=[$€£])[0-9]+\.[0-9]+
ptrn_buy_in = re.compile("(?<=[$€£])[0-9]+\.[0-9]+")
print(re.findall(ptrn_buy_in, r"$14.69+$0.31"))
Regex demo | Python demo
Result
['14.69', '0.31']
Or use a match with the capturing group:
[$€£]([0-9]+\.[0-9]+)
Regex demo | Python demo

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

(?:) regular expression Python

I came across a regular expression today but it was very poorly and scarcely explained. What is the purpose of (?:) regex in python and where & when is it used?
I have tried this but it doesn't seem to be working. Why is that?
word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
expressoin = re.findall(r'(?:a-z\+a-z)', word);
From the re module documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever
regular expression is inside the parentheses, but the substring
matched by the group cannot be retrieved after performing a match or
referenced later in the pattern.
Basically, it's the same thing as (...) but without storing a captured string in a group.
Demo:
>>> import re
>>> re.search('(?:foo)(bar)', 'foobar').groups()
('bar',)
Only one group is returned, containing bar. The (?:foo) group was not.
Use this whenever you need to group metacharacters that would otherwise apply to a larger section of the expression, such as | alternate groups:
monty's (?:spam|ham|eggs)
You don't need to capture the group but do need to limit the scope of the | meta characters.
As for your sample attempt; using re.findall() you often do want to capture output. You most likely are looking for:
re.findall('([a-z]\+[a-z])', word)
where re.findall() will return a list tuples of all captured groups; if there is only one captured group, it's a list of strings containing just the one group per match.
Demo:
>>> word = "Hello. ) kahn. ho.w are 19tee,n doing 2day; (x+y)"
>>> re.findall('([a-z]\+[a-z])', word)
['x+y']
?: is used to ignore capturing a group.
For example in regex (\d+) match will be in group \1
But if you use (?:\d+) then there will be nothing in group \1
It is used for non-capturing group:
>>> matched = re.search('(?:a)(b)', 'ab') # using non-capturing group
>>> matched.group(1)
'b'
>>> matched = re.search('(a)(b)', 'ab') # using capturing group
>>> matched.group(1)
'a'
>>> matched.group(2)
'b'

Categories

Resources