How to repeat a pattern in python regular expression? - python

I'm doing a python regex and have a working expression:
\n(?P<curve>\w+)(?:.+)(?P<unit>\.\S*)(?:\s+.\s+)(?P<desc>:.+)|\n(?P<curve2>\w+)(?:.+)(?P<unit2>\.\S*)|\n(?P<curve3>\w+)
I would like to know I could repeat the pattern from the first if, the reason is that I would like to not group in many "curve" or "unit" for each case.
My test data is as follows:
#-------------
MD
BMK_STA .Mpsi : Modulus
FANG . : Friction Angle
PR .unitless :
RHO .g/cm3
The idea is to have MD and RHO also in "curve" group.

There is no special syntax to avoid that kind of repetition in regexes, so in the general case you can't avoid a certain amount of repetition. However in your specific case you should be able to solve your problem using optional groups:
\n(?P<curve>\w+)((?:.+)(?P<unit>\.\S*)((?:\s+.\s+)(?P<desc>:.+))?)?
Which is probably better written in verbose mode as:
\n(?P<curve>\w+)
(
.+
(?P<unit>\.\S*)
(
\s+.\s+
(?P<desc>:.+)
)?
)?
to make the group nesting easier to read. I've also remove the ?: groups since in this case they are useless.

I am not entirely sure what you mean, but the following may help:
If you want to find every match for a pattern, you can use re.findall(pattern, string)
It returns a list of the matches..
re module docs

Assuming your regex is correct. Use the finditer() method for this purpose to iterate all the matches.
Example:
for m in re.finditer(r'REGEX_GOES_HERE', text):
print m.group('curve')
print m.group("unit")
In this way you picked all the matches, as well as their named groups are intact as you wanted!

Related

Prevent last duplicate character from string [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

python regex, capturing a pattern with trimming repeated subpattern in string

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

How can I extract two values from a string like this using a regular expression?

How can I get the value from the following strings using one regular expression?
/*##debug_string:value/##*/
or
/*##debug_string:1234/##*/
or
/*##debug_string:http://stackoverflow.com//##*/
The result should be
value
1234
http://stackoverflow.com/
Trying to read behind your pattern
re.findall("/\*##debug_string:(.*?)/##\*/", your_string)
Note that your variations cannot work because you didn't escape the *. In regular expressions, * mean a repetition of the previous character/group. If you really mean the * character, you must use \*.
import re
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:value/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:1234/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:http://stackoverflow.com//##*/")
Executes as:
['value']
['1234']
['http://stackoverflow.com/']
EDIT: Ok I see that you can have a URL. I've amended the pattern to take it into account.
Use this regex:
[^:]+:([^/]+)
And use capture group #1 for your value.
Live Demo: http://www.rubular.com/r/FxFnpfPHFn
Your regex will be something like: .*:(.*)/.+. Group 1 will be what you are looking for. However this is a REALLY inclusive regex, you might want to post some more details so that you can create some more restrictions.
Assuming that the format stays consistent:
re.findall('debug_string:([^\/]+)\/##', string)

Python Regex instantly replace groups

Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.

Grouping in Python Regular Expressions

So I'm playing around with regular expressions in Python. Here's what I've gotten so far (debugged through RegExr):
##(VAR|MVAR):([a-zA-Z0-9]+)+(?::([a-zA-Z0-9]+))*##
So what I'm trying to match is stuff like this:
##VAR:param1##
##VAR:param2:param3##
##VAR:param4:param5:param6:0##
Essentially, you have either VAR or MVAR followed by a colon then some param name, then followed by the end chars (##) or another : and a param.
So, what I've gotten for the groups on the regex is the VAR, the first param, and then the last thing in the parameter list (for the last example, the 3rd group would be 0). I understand that groups are created by (...), but is there any way for the regex to match the multiple groups, so that param5, param6, and 0 are in their own group, rather than only having a maximum of three groups?
I'd like to avoid having to match this string then having to split on :, as I think this is capable of being done with regex. Perhaps I'm approaching this the wrong way.
Essentially, I'm attempting to see if I can find and split in the matching process rather than a postprocess.
If this format is fixed, you don't need regex, it just makes it harder. Just use split:
text.strip('#').split(':')
should do it.
The number of groups in a regular expression is fixed. You will need to postprocess somehow.

Categories

Resources