Stripping variable borders with python re - python

How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().

s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1

well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'

Related

Replace a substring between two substrings

How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.

Why doesn't re.compile(rf"(\b{re.escape('g(x)')}\b)+",re.I) find the string 'g(x)'?

I ran the following code in python 3.8
import re
a,b,c = 'g(x)g', '(x)g', 'g(x)'
a_re = re.compile(rf"(\b{re.escape(a)}\b)+",re.I)
b_re = re.compile(rf"(\b{re.escape(b)}\b)+",re.I)
c_re = re.compile(rf"(\b{re.escape(c)}\b)+",re.I)
a_re.findall('g(x)g')
b_re.findall('(x)g')
c_re.findall('g(x)')
c_re.findall(' g(x) ')
The result I want is below.
['g(x)g']
['(x)g']
['g(x)']
['g(x)']
But the actual result is below.
['g(x)g']
[]
[]
[]
The following conditions must be observed:
A combination of variables and f-string should be used.
\b must not be removed.
Because I want to know if there are certain characters in the sentence.
How can I get the results I want?
Regular characters have no problem using \b, but it won't work for words that start with '(' or end with ')'.
I was wondering if there is an alternative to \b that can be used in these words.
I must use the same function as \b because I want to make sure that the sentence contains a specific word.
\b is the boundary between \w and \W characters (Docs). That is why your first one gives the result (since it starts and ends with characters) but none of the others.
To get the expected result, your patterns should look like these:
a_re = re.compile(rf"(\b{re.escape(a)}\b)+",re.I) # No change
b_re = re.compile(rf"({re.escape(b)}\b)+",re.I) # No '\b' in the beginning
c_re = re.compile(rf"(\b{re.escape(c)})+",re.I) # No '\b' in the end
You can write your own \b by finding start, end, or separator and not capturing it
(^|[ .\"\']) start or boundary
($|[ .\"\']) end or boundary
(?:) non-capture group
>>> a_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(a)})(?:$|[ .\"\'])", re.I)
>>> b_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(b)})(?:$|[ .\"\'])", re.I)
>>> c_re = re.compile(rf"(?:^|[ .\"\'])({re.escape(c)})(?:$|[ .\"\'])", re.I)
>>> a_re.findall('g(x)g')
['g(x)g']
>>> b_re.findall('(x)g')
['(x)g']
>>> c_re.findall('g(x)')
['g(x)']
>>> c_re.findall(' g(x) ')
['g(x)']

Python: Non capturing group is not working in Regex

I'm using non-capturing group in regex i.e., (?:.*) but it's not working.
I can still able to see it in the result. How to ignore it/not capture in the result?
Code:
import re
text = '12:37:25.790 08/05/20 Something P LR 0.156462 sccm Pt 25.341343 psig something-else'
pattern = ['(?P<time>\d\d:\d\d:\d\d.\d\d\d)\s{1}',
'(?P<date>\d\d/\d\d/\d\d)\s',
'(?P<pr>(?:.*)Pt\s{3}\d*[.]?\d*\s[a-z]+)'
]
result = re.search(r''.join(pattern), text)
Output:
>>> result.group('pr')
'Something P LR 0.156462 sccm Pt 25.341343 psig'
Expected output:
'Pt 25.341343 psig'
More info:
>>> result.groups()
('12:37:25.790', '08/05/20', 'Something P LR 0.156462 sccm Pt 25.341343 psig')
The quantifier is inside the named group, you have to place it outside and possibly make it non greedy to.
The updated pattern could look like:
(?P<time>\d\d:\d\d:\d\d.\d\d\d)\s{1}(?P<date>\d\d/\d\d/\d\d)\s.*?(?P<pr>Pt\s{3}\d*[.]?\d*\s[a-z]+)
Note that with he current pattern, the number is optional as all the quantifiers are optional. You can omit {1} as well.
If the number after Pt can not be empty, you can update the pattern using \d+(?:\.\d+)? matching at least a single digit:
(?P<time>\d\d:\d\d:\d\d.\d{3})\s(?P<date>\d\d/\d\d/\d\d)\s.*?(?P<pr>Pt\s{3}\d+(?:\.\d+)?\s[a-z]+)
(?P<time> Group time
\d\d:\d\d:\d\d.\d{3} Match a time like format
)\s Close group and match a whitespace char
(?P<date> Group date
\d\d/\d\d/\d\d Match a date like pattern
)\s Close group and match a whitespace char
.*? Match any char except a newline, as least as possible
(?P<pr> Group pr
Pt\s{3} Match Pt and 3 whitespace chars
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
\s[a-z]+ Match a whitespace char an 1+ times a char a-z
) Close group
Regex demo
Remove the non-capturing group from your named group. Using a non-capturing group means that no new group will be created in the match, not that that part of the string will be removed from any including group.
import re
text = 'Something P LR 0.156462 sccm Pt 25.341343 psig something-else'
pattern = r'(?:.*)(?P<pr>Pt\s{3}\d*[.]?\d*\s[a-z]+)'
result = re.search(pattern, text)
print(result.group('pr'))
Output:
Pt 25.341343 psig
Note that the specific non-capturing group you used can be excluded completely, as it basically means that you want your regex to be preceded by anything, and that's what search will do anyway.
I think there is a confusion regarding the meaning of "non-capturing" here: It does not mean that the result omits this part, but that no match group is created in the result.
Example where the same regex is executed with a capturing, and a non-capturing group:
>>> import re
>>> match = re.search(r'(?P<grp>foo(.*))', 'foobar')
>>> match.groups()
('foobar', 'bar')
>>> match = re.search(r'(?P<grp>foo(?:.*))', 'foobar')
>>> match.groups()
('foobar',)
Note that match.group(0) is the same in both cases (group 0 contains the matching part of the string in full).

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Match first parenthesis with Python

From a string such as
70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30
I want to get the first parenthesized content linux;u;android4.2.1;zh-cn.
My code looks like this:
s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
re.search("(\d+)\s.+\((\S+)\)", s).group(2)
but the result is the last brackets' contents khtml,likegecko.
How to solve this?
The main issue you have is the greedy dot matching .+ pattern. It grabs the whole string you have, and then backtracks, yielding one character from the right at a time, trying to accommodate for the subsequent patterns. Thus, it matches the last parentheses.
You can use
^(\d+)\s[^(]+\(([^()]+)\)
See the regex demo. Here, the [^(]+ restricts the matching to the characters other than ( (so, it cannot grab the whole line up to the end) and get to the first pair of parentheses.
Pattern expalantion:
^ - string start (NOTE: If the number appears not at the start of the string, remove this ^ anchor)
(\d+) - Group 1: 1 or more digits
\s - a whitespace (if it is not a required character, it can be removed since the subsequent negated character class will match the space)
[^(]+ - 1+ characters other than (
\( - a literal (
([^()]+) - Group 2 matching 1+ characters other than ( and )
\)- closing ).
Debuggex Demo
Here is the IDEONE demo:
import re
p = re.compile(r'^(\d+)\s[^(]+\(([^()]+)\)')
test_str = "70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30"
print(p.findall(test_str))
# or using re.search if the number is not at the beginning of the string
m = re.search(r'(\d+)\s[^(]+\(([^()]+)\)', test_str)
if m:
print("Number: {0}\nString: {1}".format(m.group(1), m.group(2)))
# [('70849', 'linux;u;android4.2.1;zh-cn')]
# Number: 70849
# String: linux;u;android4.2.1;zh-cn
You can use a negated class \(([^)]*)\) to match anything between ( and ):
>>> s=r'70849 mozilla/5.0(linux;u;android4.2.1;zh-cn)applewebkit/534.30(khtml,likegecko)version/4.0mobilesafari/534.30'
>>> m = re.search(r"(\d+)[^(]*\(([^)]*)\)", s)
>>> print m.group(1)
70849
>>> print m.group(2)
linux;u;android4.2.1;zh-cn

Categories

Resources