I need to match two cases by one reg expression and do replacement
'long.file.name.jpg' -> 'long.file.name_suff.jpg'
'long.file.name_a.jpg' -> 'long.file.name_suff.jpg'
I'm trying to do the following
re.sub('(\_a)?\.[^\.]*$' , '_suff.',"long.file.name.jpg")
But this is cut the extension '.jpg' and I'm getting
long.file.name_suff. instead of long.file.name_suff.jpg
I understand that this is because of [^.]*$ part, but I can't exclude it, because
I have to find last occurance of '_a' to replace or last '.'
Is there a way to replace only part of the match?
Put a capture group around the part that you want to preserve, and then include a reference to that capture group within your replacement text.
re.sub(r'(\_a)?\.([^\.]*)$' , r'_suff.\2',"long.file.name.jpg")
re.sub(r'(?:_a)?\.([^.]*)$', r'_suff.\1', "long.file.name.jpg")
?: starts a non matching group (SO answer), so (?:_a) is matching the _a but not enumerating it, the following question mark makes it optional.
So in English, this says, match the ending .<anything> that follows (or doesn't) the pattern _a
Another way to do this would be to use a lookbehind (see here). Mentioning this because they're super useful, but I didn't know of them for 15 years of doing REs
Just put the expression for the extension into a group, capture it and reference the match in the replacement:
re.sub(r'(?:_a)?(\.[^\.]*)$' , r'_suff\1',"long.file.name.jpg")
Additionally, using the non-capturing group (?:…) will prevent re to store to much unneeded information.
You can do it by excluding the parts from replacing. I mean, you can say to the regex module; "match with this pattern, but replace a piece of it".
re.sub(r'(?<=long.file.name)(\_a)?(?=\.([^\.]*)$)' , r'_suff',"long.file.name.jpg")
>>> 'long.file.name_suff.jpg'
long.file.name and .jpg parts are being used on matching, but they are excluding from replacing.
I wanted to use capture groups to replace a specific part of a string to help me parse it later. Consider the example below:
s= '<td> <address> 110 SOLANA ROAD, SUITE 102<br>PONTE VEDRA BEACH, FL32082 </address> </td>'
re.sub(r'(<address>\s.*?)(<br>)(.*?\<\/address>)', r'\1 -- \3', s)
##'<td> <address> 110 SOLANA ROAD, SUITE 102 -- PONTE VEDRA BEACH, FL32082 </address> </td>'
print(re.sub('name(_a)?','name_suff','long.file.name_a.jpg'))
# long.file.name_suff.jpg
print(re.sub('name(_a)?','name_suff','long.file.name.jpg'))
# long.file.name_suff.jpg
Related
Here is a list of input strings:
"collect_project_stage1_20220927_foot60cm_arm70cm_height170cm_......",
"collect_project_version_1_0927_foot60cm_height170cm_......",
"collect_project_ver1_20220927_arm70cm_height170cm_......",
These input strings are provided by many different users.
Leading "collect_" is fixed, and then follows "${project_version}" which doesn't have hard rule to set this variable, the naming will be very different by different users.
Then, there will be repeating "${part}${length}cm_.......", but the number of repeatence is not fixed.
I'd like to capture the the variable ${project_version}.
Then, I try using the following re.match to capture it.
re.match(r'collect_(.*)_(?:(?:foot|arm|height)\d+cm_)+.*' , string)
However, the result is not as expected.
Is there anyone give me a hint that what's wrong in my regular expression?
Assuming you were only planning to capture the part preceding the various cm suffixed components, the reason you're capturing so many of them instead of just checking and discarding them is that regexes are greedy by default.
You can narrow your capture group to only match what you really expect (e.g. just a name followed by a date), replacing (.*) with something like ((?:[a-z]+[0-9]*_)*\d{8}).
Alternatively, you can be lazy and enable non-greedy matching for the capture group, changing (.*) to (.*?) where the ? says to only take the minimal amount required to satisfy the regex. The latter is more brittle, but if you really can't impose any other restrictions on the expression for the capture group, it's what you've got.
Use a non-greedy quantifier. Otherwise, the capture group will match as far as it can, so it will keep going until the last match for (?:foot|arm|height)\d+cm_).
result = re.match(r'collect_(.*?)_(?:(?:foot|arm|height)\d+cm_)+' , string)
print(result.group(1)) # project_stage1_20220927
The regex "(.*)" will capture far too much.
re.match(r'collect_([a-z0-9]+_[a-z0-9]+_[a-z0-9]+)_(?:(?:foot|arm|height)\d+cm_)+' , string)
I need to match two cases by one reg expression and do replacement
'long.file.name.jpg' -> 'long.file.name_suff.jpg'
'long.file.name_a.jpg' -> 'long.file.name_suff.jpg'
I'm trying to do the following
re.sub('(\_a)?\.[^\.]*$' , '_suff.',"long.file.name.jpg")
But this is cut the extension '.jpg' and I'm getting
long.file.name_suff. instead of long.file.name_suff.jpg
I understand that this is because of [^.]*$ part, but I can't exclude it, because
I have to find last occurance of '_a' to replace or last '.'
Is there a way to replace only part of the match?
Put a capture group around the part that you want to preserve, and then include a reference to that capture group within your replacement text.
re.sub(r'(\_a)?\.([^\.]*)$' , r'_suff.\2',"long.file.name.jpg")
re.sub(r'(?:_a)?\.([^.]*)$', r'_suff.\1', "long.file.name.jpg")
?: starts a non matching group (SO answer), so (?:_a) is matching the _a but not enumerating it, the following question mark makes it optional.
So in English, this says, match the ending .<anything> that follows (or doesn't) the pattern _a
Another way to do this would be to use a lookbehind (see here). Mentioning this because they're super useful, but I didn't know of them for 15 years of doing REs
Just put the expression for the extension into a group, capture it and reference the match in the replacement:
re.sub(r'(?:_a)?(\.[^\.]*)$' , r'_suff\1',"long.file.name.jpg")
Additionally, using the non-capturing group (?:…) will prevent re to store to much unneeded information.
You can do it by excluding the parts from replacing. I mean, you can say to the regex module; "match with this pattern, but replace a piece of it".
re.sub(r'(?<=long.file.name)(\_a)?(?=\.([^\.]*)$)' , r'_suff',"long.file.name.jpg")
>>> 'long.file.name_suff.jpg'
long.file.name and .jpg parts are being used on matching, but they are excluding from replacing.
I wanted to use capture groups to replace a specific part of a string to help me parse it later. Consider the example below:
s= '<td> <address> 110 SOLANA ROAD, SUITE 102<br>PONTE VEDRA BEACH, FL32082 </address> </td>'
re.sub(r'(<address>\s.*?)(<br>)(.*?\<\/address>)', r'\1 -- \3', s)
##'<td> <address> 110 SOLANA ROAD, SUITE 102 -- PONTE VEDRA BEACH, FL32082 </address> </td>'
print(re.sub('name(_a)?','name_suff','long.file.name_a.jpg'))
# long.file.name_suff.jpg
print(re.sub('name(_a)?','name_suff','long.file.name.jpg'))
# long.file.name_suff.jpg
I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.
Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).
I was trying to write regex for identifying name starting with
Mr.|Mrs.
for example
Mr. A, Mrs. B.
I tried several expressions. These regular expressions were checked on online tool at pythonregex.com. The test string used is:
"hey where is Mr A how are u Mrs. B tt`"
Outputs mentioned are of findall() function of Python, i.e.
regex.findall(string)
Their respective outputs with regex are below.
Mr.|Mrs. [a-zA-Z]+ o/p-[u'Mr ', u'Mrs']
why A and B are not appearing with Mr. and Mrs.?
[Mr.|Mrs.]+ [a-zA-Z]+ o/p-[u's Mr', u'. B']
Why s is coming with Mr. instead of A?
I tried many more combinations but these are confusing so here are they. For name part I know regex has to cover more conditions but was starting from basic.
Change your regex like below,
(?:Mr\.|Mrs\.) [a-zA-Z]+
DEMO
You need to put Mr\., Mrs\. inside a non-capturing or capturing group , so that the | (OR) applies to the group itself.
You must need to escape the dot in your regex to match a literal dot or otherwise, it would match any character. . is a special meta character in regex which matches any character except line breaks.
OR
Even shorter one,
Mrs?\. [a-zA-Z]+
? quantifier in the above makes the previous character s as an optional one.
There's a python library for parsing human names :
https://github.com/derek73/python-nameparser
Much better than writing your own regex.