Groups in regular expressions - python

I'm reading an online book on Python which explains regular expressions, but I can't understand what groups in regular expressions are.
For example what is the difference between :
regex = re.compile(r'Name (\w)*')
regex.findall('Name Mahmoud')
and:
regex = re.compile(r'Name \w*')
regex.findall('Name Mahmoud')
Why does the first call of findall() method gives me ['d'] but the second call of it gives me ['Name Mahmoud']?

Regex groups are used to capture part of a regex.
Name (\w)* capture a single character \w, and that capture is repeated many times *. You will only find the latest capture in your result (d of Mahmoud)
Name \w* does not use group ...
Name (\w*) capture a series of characters \w* which in your case will yield Mahmoud.
For further information refer to https://docs.python.org/2/library/re.html#regular-expression-syntax

What is a group in a regular expression?
A group is one matching pair of parentheses typically with stuff between them. Groups serve three primary purposes:
A group may have multiple alternatives separated by the "|" logical OR metacharacter.
A group allows applying a quantifier to repeat the contents of the group a specified number of times.
A capture group is a special type of group where the contents of the group are saved and available both inside the regex (using "\n" backreference syntax), and outside the regex (using "$n" syntax). Capture groups are numbered starting with 1 and are counted in order of the occurence of the opening parentheses.

Related

python regex, capturing a pattern with trimming repeated subpattern in string

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

Single regular expression for extracting different values

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?
Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2
Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

Regex, capture using word boundaries without stopping at "dot" and/or other characters

Given for example a string like this:
random word, random characters##?, some dots. username bob.1234 other stuff
I'm currently using this regex to capture the username (bob.1234):
\busername (.+?)(,| |$)
But my code needs a regex with only one capture group as python's re.findall returns something different when there are multiple capture groups. Something like this would almost work, except it will capture the username "bob" instead of "bob.1234":
\busername (.+?)\b
Anybody knows if there is a way to use the word boundary while ignoring the dot and without using more than one capture group?
NOTES:
Sometimes there is a comma after the username
Sometimes there is a space after the username
Sometimes the string ends with the username
The \busername (.+?)(,| |$) pattern contains 2 capturing groups, and re.findall will return a list of tuples once a match is found. See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, there are three approaches here:
Use a (?:...) non-capturing group rather than the capturing one: re.findall(r'\busername (.+?)(?:,| |$)', s). It will consume a , or space, but since only captured part will be returned and no overlapping matches are expected, it is OK.
Use a positive lookahead instead: re.findall(r'\busername (.+?)(?=,| |$)', s). The space and comma will not be consumed, that is the only difference from the first approach.
You may turn the (.+?)(,| |$) into a simple negated character class [^ ,]+ that matches one or more chars other than a space or comma. It will match till end of string if there are no , or space after username.

Python Regular Expressions to match option of strings

I am new to Python and Python RE.
I am trying to make a parser for ARM assembly code. I want to make regular expression for matching conditional branch instructions which could be:
beq, bne, blt, bgt
I tried a regular expression of the form
'b[eq|ne|lt|gt]'
But this does not match. Can someone please help me with this?
You should be using parentheses for options, not square brackets:
b(eq|ne|lt|gt)
And you'd usually want a non-capture group:
b(?:eq|ne|lt|gt)
And you can also make it a little more optimised too:
b(?:eq|ne|[lg]t)
Square brackets will be understood as being any of the characters or range of characters. So [eq|ne|lt|gt] effectively means either one of e, q, |, n, e (again, so it becomes redundant), etc.
Try the following pattern: b(?:eq|ne|lt|gt)
[] Character set: Will only match any one character inside the brackets. You can specify a range of characters by using the metacharacter -, eg: [a-e] or even negate the expression by using the metacharacter ^, eg: [^aeiou]
() Capturing parentesis: Used for grouping part & for creating number capturing group, you can disable this feature by using the following char-set ?: within the capturing parentesis, eg(?:)
As mentioned above, you should be using the capturing parentesis for more than one character to be matched, so, that is why your pattern using brackets did not match your string.
Please note that using the non capturing parentesis was meant to no save any data being matched, however you can remove the metacharacters ?: in order to capture the group.
As python performs perl compatible regular expression engine, you are able to use named captured groups & numbered backreferences, the main advantage of using it, is to keep your expression easy to maintain, read, edit, etc.
Eg:
(?P<opcode>b(?:eq|ne|lt|gt)) - Will capture the match of your pattern b(?:eq|ne|lt|gt) into the backreference name opcode

String pattern Regular Expression python

I am a novice in regular expressions. I have written the following regex to find abababab9 in the given string. The regular expression returns two results, however I was expecting one result.
testing= re.findall(r'((ab)*[0-9])',temp);
**Output**: [('abababab9', 'ab')]
According to my understanding, it should have returned only abababab9, why has it returned ab alone.
You didnt' read the findall documentation:
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has
more than one group.
Empty matches are included in the result.
And if you take a look at the re module capturing groups are subpatterns enclosed in parenthesis like (ab).
If you want to only get the complete match you can use one of the following solutions:
re.findall(r'(?:ab)*[0-9]', temp) # use non-capturing groups
[groups[0] for groups in re.findall(r'(ab)*[0-9]', temp)] # take the first group
[match.group() for match in re.finditer(r'(ab)*[0-9]', temp)] # use finditer
You have configured by (...) two matching groups the first group is ((ab)*[0-9]) and the second group is (ab). Therefore you get these two results. To get only the first group you could make the second a non-capturing group. This is done by ?:. So this result is not delivered.
((?:ab)*[0-9])
Debuggex Demo
This one only matches abababab9.
Edit 1:
Here is an explanation of the grouping concept of regular expressions: groups and capturing
Remove the second group capturing (ab) using ?: inside:
testing= re.findall(r'((?:ab)*[0-9])',temp);

Categories

Resources