Regular Expression don't include Zero-Width assertions in match groups - python

I have updated the following re to not match when the string is B/C, B/O, S/C, or S/O.
old (.*)/(.*)
new: (.*)(?<!^(B|S)(?=/(C|O)$))/(.*)
This regex is being used downstream with a list of other regex patterns and is expected to separate the data into two groups. Is there a way for my regex pattern (or a better one) to not count the zero-width assertions?
I've tried pushing the validation till the end with a single lookbehind assertion but that only has access to the group after the slash.
I've also tried enclosing the assertions in (?:...) but inner parenthesis are still counted towards matching groups.

Thanks to #user2357112
(.*)(?<!^(?:B|S)(?=/(?:C|O)$))/(.*)
I was using (?:...) incorrectly on my first attempts

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

python regex, capturing a pattern with trimming repeated subpattern in string

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

Python Regex: how to capture alternative groups with OR operator [duplicate]

Suppose I have the following regex that matches a string with a semicolon at the end:
\".+\";
It will match any string except an empty one, like the one below:
"";
I tried using this:
\".+?\";
But that didn't work.
My question is, how can I make the .+ part of the, optional, so the user doesn't have to put any characters in the string?
To make the .+ optional, you could do:
\"(?:.+)?\";
(?:..) is called a non-capturing group. It only does the matching operation and it won't capture anything. Adding ? after the non-capturing group makes the whole non-capturing group optional.
Alternatively, you could do:
\".*?\";
.* would match any character zero or more times greedily. Adding ? after the * forces the regex engine to do a shortest possible match.
As an alternative:
\".*\";
Try it here: https://regex101.com/r/hbA01X/1

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

Python Regular Expressions to match option of strings

I am new to Python and Python RE.
I am trying to make a parser for ARM assembly code. I want to make regular expression for matching conditional branch instructions which could be:
beq, bne, blt, bgt
I tried a regular expression of the form
'b[eq|ne|lt|gt]'
But this does not match. Can someone please help me with this?
You should be using parentheses for options, not square brackets:
b(eq|ne|lt|gt)
And you'd usually want a non-capture group:
b(?:eq|ne|lt|gt)
And you can also make it a little more optimised too:
b(?:eq|ne|[lg]t)
Square brackets will be understood as being any of the characters or range of characters. So [eq|ne|lt|gt] effectively means either one of e, q, |, n, e (again, so it becomes redundant), etc.
Try the following pattern: b(?:eq|ne|lt|gt)
[] Character set: Will only match any one character inside the brackets. You can specify a range of characters by using the metacharacter -, eg: [a-e] or even negate the expression by using the metacharacter ^, eg: [^aeiou]
() Capturing parentesis: Used for grouping part & for creating number capturing group, you can disable this feature by using the following char-set ?: within the capturing parentesis, eg(?:)
As mentioned above, you should be using the capturing parentesis for more than one character to be matched, so, that is why your pattern using brackets did not match your string.
Please note that using the non capturing parentesis was meant to no save any data being matched, however you can remove the metacharacters ?: in order to capture the group.
As python performs perl compatible regular expression engine, you are able to use named captured groups & numbered backreferences, the main advantage of using it, is to keep your expression easy to maintain, read, edit, etc.
Eg:
(?P<opcode>b(?:eq|ne|lt|gt)) - Will capture the match of your pattern b(?:eq|ne|lt|gt) into the backreference name opcode

Categories

Resources