Parsing a regex for optional sections - python

Similar to this question but with a difference subtle enough that I still need some help.
Currently I have:
'(.*)\[(\d+\-\d+)\]'
as my regex, which matches any number of characters followed by square brackets [] that contain two decimals separated by a dash. My issue is, I'd like it to also match with just one decimal number between the square brackets, and possibly even with nothing in between the square brackets. So:
word[1-5] = match
word[5] = match
word[] = match (not essential)
and ensuring
word[-5] = no match
Could anyone possibly point my in the direction of the next step. I currently find regex to be a bit of a guessing game though I would like to become better with them.

Go with yours and make the last part optional
(.*)\[(\d+(-\d+)?)\]
Using ?.
To accomplish the other task, well, go with ? again
(.*)\[(\d+(-\d+)?)?\]
^here
A working example http://rubular.com/r/t0MaHyHfeS

Use ? to match 0 or 1 match
So use ? for the -\d+ and for both the digits separated by -
(.*)\[(\d+(-\d+)?)?\]
No need to escape -..It has special meaning only if its's between a character class.

(.*)\[((\d+(?:\-\d+)?)?)\]
This will match everything, even with 0 digits in there and will backreference you (in match[1-5]):
1- match
2- 1-5

Not every regex interpreter supports this, but you could try an "or" operator for the part inside the brackets:
'(.*)\[(\d+\-\d+|\d+)\]'

Related

How to empty or optional match group in Python RegEx= [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

How to search double digit numbers in python with regular expression?

I have a piece of code that records times in this format:
0.0-8.0
0.0-9.0
0.0-10.0
I want to use a regular expression that will find all of these strings and have checked here and here for help but am still confused. I understand how to do it if I only wanted to do single digit numbers, but I can't figure out how to handle double digit numbers like 10 or 20.
It is also important that the expression does not find the string
0.0-1.0
as it should be ignored.
So far my expression looks like this:
expression = re.compile(',0\.0\-[0-2][0-9])
If you want to match each line shown in your question, try an expression like this:
0\.0\-[0-2]?\d\.\d
\d is the same as [0-9]. The ? means 0 or 1 occurrences, so this will only match 1- or 2-digit numbers. If you need the comma at the start of the regex, add that in.
If you want to exclude 0.0-1.0, then you should do that in code, not in the regular expression, since that would make it less readable. But if you insist, I have included one that will exclude that string for you:
Try it here
0\.0\-[0-2]?[0-9]\.(?<!0-1\.)\d
This uses a negative lookbehind to ensure the previous part is not 0-1., which would only occur in the match you didn't want.

Python non-greedy regular expression is not exactly what I expected

string: XXaaaXXbbbXXcccXXdddOO
I want to match the minimal string that begin with 'XX' and end with 'OO'.
So I write the non-greedy reg: r'XX.*?OO'
>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']
I thought it will return ['XXdddOO'] but it was so 'greedy'.
Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.
But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.
Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO).
And of course the fact is that the string is processed from left to right.
How about:
.*(XX.*?OO)
The match will be in group 1.
Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO. If your data structure is that fixed, you could do:
XX[a-z]{3}OO
to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO: for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO)
Indeed, issue is not with greedy/non-greedy… Solution suggested by #devnull should work, provided you want to avoid even a single X between your XX and OO groups.
Else, you’ll have to use a lookahead (i.e. a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:
re.findall(r'XX(?:.(?!XX))*?OO', str)
With this negative lookahead, you match (non-greedily) any char (.) not followed by XX…
The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:
XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO

Match any characters more than once, but stop at a given character

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Categories

Resources