python regex, capturing a pattern with trimming repeated subpattern in string

python regex, capturing a pattern with trimming repeated subpattern in string - python

Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?

Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.

Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927

The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)

Related

How to empty or optional match group in Python RegEx= [duplicate]

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?

You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?

Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.

Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Prevent last duplicate character from string [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.

You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).

location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c

How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.

Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/

Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.

Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?

The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.

Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/

import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Understanding regex pattern used to find string between strings in html

I have the following html file:
<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">
In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:
import re
Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result
I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:
.* # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)
More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).
Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is
match one or more a's followed by an a.
This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.
If we use the regex /(a+?)a this is
reluctantly match one or more as followed by an a
or
match one or more as until we reach another a
That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.
This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:
/Dev/videos/1610110089242029/
The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.
However we have
/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"
Without reluctance, the regex will match:
1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029
It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

It can be explained in a simple way:
.: match anything (any character),
*: any number of times (at least zero times),
?: as few times as possible (hence non-greedy).
videos/(.*?)/"
as a regular expression matches (for example)
videos/1610110089242029/"
and the first capturing group returns 1610110089242029, because any of the digits is part of “any character” and there are at least zero characters in it.
The ? causes something like this:
videos/1610110089242029/" something else … "videos/2387423470237509/"
to properly match as 1610110089242029 and 2387423470237509 instead of as 1610110089242029/" something else … "videos/2387423470237509, hence “as few times as possible”, hence “non-greedy”.

The . means any character. The * means any number of times, including zero. The ? does indeed mean non-greedy; that means that it will try to capture as few characters as possible, i.e., if the regex encounters a /, it could match it with the ., but it would rather not because the . is non-greedy, and since the next character in the regex is happy to match /, the . doesn't have to. If you didn't have the ?, that . would eat up the whole rest of the file because it would be chomping at the bit to match as many things as possible, and since it matches everything, it would go on forever.

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.