String pattern Regular Expression python - python

I am a novice in regular expressions. I have written the following regex to find abababab9 in the given string. The regular expression returns two results, however I was expecting one result.
testing= re.findall(r'((ab)*[0-9])',temp);
**Output**: [('abababab9', 'ab')]
According to my understanding, it should have returned only abababab9, why has it returned ab alone.

You didnt' read the findall documentation:
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has
more than one group.
Empty matches are included in the result.
And if you take a look at the re module capturing groups are subpatterns enclosed in parenthesis like (ab).
If you want to only get the complete match you can use one of the following solutions:
re.findall(r'(?:ab)*[0-9]', temp) # use non-capturing groups
[groups[0] for groups in re.findall(r'(ab)*[0-9]', temp)] # take the first group
[match.group() for match in re.finditer(r'(ab)*[0-9]', temp)] # use finditer

You have configured by (...) two matching groups the first group is ((ab)*[0-9]) and the second group is (ab). Therefore you get these two results. To get only the first group you could make the second a non-capturing group. This is done by ?:. So this result is not delivered.
((?:ab)*[0-9])
Debuggex Demo
This one only matches abababab9.
Edit 1:
Here is an explanation of the grouping concept of regular expressions: groups and capturing

Remove the second group capturing (ab) using ?: inside:
testing= re.findall(r'((?:ab)*[0-9])',temp);

Related

Single regular expression for extracting different values

I have some inputs like
ID= 5657A
ID=PID=FSGDVD
IDS=5645SD
I have created a regex i.e IDS=[A-Za-z0-9]+|ID=[A-Za-z0-9]+|PID=[A-Za-z0-9]+. But, in the case of ID=PID=FSGDVD, I want PID=FSGDVD as output.
My outputs must look like
ID= 5657A
PID=FSGDVD
IDS=5645SD
How to go for this problem?
Add end of line anchor and use grouping and quantifiers to simplify the regex:
(?:IDS?|PID)=[A-Za-z0-9]+$
IDS? will match both ID and IDS
(?:IDS?|PID) will match ID or IDS or PID
(?:pattern) is a non-capturing group, some functions like re.split and re.findall will change their behavior based on capture groups, thus non-capturing group is ideal whenever backreferences aren't needed
$ is end of line anchor, thus you'll get the match towards end of line instead of start of line
Demo: https://regex101.com/r/e9uvmC/1
In case your input can be something like ID=PID=FSGDVD xyz then you could use lookarounds:
(?:IDS?|PID)=[A-Za-z0-9]+\b(?!=)
Here \b will ensure to match all word characters after = sign and (?!=) is a negative lookahead assertion to avoid a match if there is = afterwards
Demo: https://regex101.com/r/e9uvmC/2
Another one could be
[A-Z]+=\s*[^=]+$
See a demo on regex101.com.

Regular Expression don't include Zero-Width assertions in match groups

I have updated the following re to not match when the string is B/C, B/O, S/C, or S/O.
old (.*)/(.*)
new: (.*)(?<!^(B|S)(?=/(C|O)$))/(.*)
This regex is being used downstream with a list of other regex patterns and is expected to separate the data into two groups. Is there a way for my regex pattern (or a better one) to not count the zero-width assertions?
I've tried pushing the validation till the end with a single lookbehind assertion but that only has access to the group after the slash.
I've also tried enclosing the assertions in (?:...) but inner parenthesis are still counted towards matching groups.
Thanks to #user2357112
(.*)(?<!^(?:B|S)(?=/(?:C|O)$))/(.*)
I was using (?:...) incorrectly on my first attempts

Regex, capture using word boundaries without stopping at "dot" and/or other characters

Given for example a string like this:
random word, random characters##?, some dots. username bob.1234 other stuff
I'm currently using this regex to capture the username (bob.1234):
\busername (.+?)(,| |$)
But my code needs a regex with only one capture group as python's re.findall returns something different when there are multiple capture groups. Something like this would almost work, except it will capture the username "bob" instead of "bob.1234":
\busername (.+?)\b
Anybody knows if there is a way to use the word boundary while ignoring the dot and without using more than one capture group?
NOTES:
Sometimes there is a comma after the username
Sometimes there is a space after the username
Sometimes the string ends with the username
The \busername (.+?)(,| |$) pattern contains 2 capturing groups, and re.findall will return a list of tuples once a match is found. See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, there are three approaches here:
Use a (?:...) non-capturing group rather than the capturing one: re.findall(r'\busername (.+?)(?:,| |$)', s). It will consume a , or space, but since only captured part will be returned and no overlapping matches are expected, it is OK.
Use a positive lookahead instead: re.findall(r'\busername (.+?)(?=,| |$)', s). The space and comma will not be consumed, that is the only difference from the first approach.
You may turn the (.+?)(,| |$) into a simple negated character class [^ ,]+ that matches one or more chars other than a space or comma. It will match till end of string if there are no , or space after username.

Groups in regular expressions

I'm reading an online book on Python which explains regular expressions, but I can't understand what groups in regular expressions are.
For example what is the difference between :
regex = re.compile(r'Name (\w)*')
regex.findall('Name Mahmoud')
and:
regex = re.compile(r'Name \w*')
regex.findall('Name Mahmoud')
Why does the first call of findall() method gives me ['d'] but the second call of it gives me ['Name Mahmoud']?
Regex groups are used to capture part of a regex.
Name (\w)* capture a single character \w, and that capture is repeated many times *. You will only find the latest capture in your result (d of Mahmoud)
Name \w* does not use group ...
Name (\w*) capture a series of characters \w* which in your case will yield Mahmoud.
For further information refer to https://docs.python.org/2/library/re.html#regular-expression-syntax
What is a group in a regular expression?
A group is one matching pair of parentheses typically with stuff between them. Groups serve three primary purposes:
A group may have multiple alternatives separated by the "|" logical OR metacharacter.
A group allows applying a quantifier to repeat the contents of the group a specified number of times.
A capture group is a special type of group where the contents of the group are saved and available both inside the regex (using "\n" backreference syntax), and outside the regex (using "$n" syntax). Capture groups are numbered starting with 1 and are counted in order of the occurence of the opening parentheses.

Why won't re.groups() give me anything for my one correctly-matched group?

When I run this code:
print re.search(r'1', '1').groups()
I get a result of (). However, .group(0) gives me the match.
Shouldn't groups() give me something containing the match?
To the best of my knowledge, .groups() returns a tuple of remembered groups. I.e. those groups in the regular expression that are enclosed in parentheses. So if you were to write:
print re.search(r'(1)', '1').groups()
you would get
('1',)
as your response. In general, .groups() will return a tuple of all the groups of objects in the regular expression that are enclosed within parentheses.
groups is empty since you do not have any capturing groups - http://docs.python.org/library/re.html#re.MatchObject.groups. group(0) will always returns the whole text that was matched regardless of if it was captured in a group or not
Edited.
You have no groups in your regex, therefore you get an empty list (()) as result.
Try
re.search(r'(1)', '1').groups()
With the brackets you are creating a capturing group, the result that matches this part of the pattern, is stored in a group.
Then you get
('1',)
as result.
The reason for this is that you have no capturing groups (since you don't use () in the pattern).
http://docs.python.org/library/re.html#re.MatchObject.groups
And group(0) returns the entire search result (even if it has no capturing groups at all):
http://docs.python.org/library/re.html#re.MatchObject.group

Categories

Resources