Here's the scenario:
import re
if __name__ == '__main__':
s = "s = \"456\";"
ss = re.sub(r'(.*s\s+=\s+").*?(".*)', r"\1123\2", s)
print ss
What I intend to do is to replace '456' with 123, but the result is 'J3";'. I try to print '\112', it turns out to be character 'J'. Thus, is there any method to specify that \1 is the group in regex, not something like a escape character in Python? Thanks in advance.
Just change \1 to \g<1>
>>> re.sub(r'(.*s\s+=\s+").*?(".*)', r"\g<1>123\2", s)
's = "123";'
If there was no numbers present next to the backreference (like \1,\2), you may use \1 or \2 but if you want to put a number next to \1 like \11, it would give you a garbage value . In-order to differntiate between the backreferences and the numbers, you should use \g<num> as backrefernce where num refers the capturing group index number.
Related
How can I replace a substring between page1/ and _type-A with 222.6 in the below-provided l string?
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
Expected result:
https://homepage.com/home/page1/222.6_type-A/go
I tried:
import re
re.sub('page1/.*?_type-A','',l, flags=re.DOTALL)
But it also removes page1/ and _type-A.
You may use re.sub like this:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub(r'(?<=page1/).*?(?=_type-A)', replace_with, l))
Output:
https://homepage.com/home/page1/222.6_type-A/go
RegEx Demo
RegEx Breakup:
(?<=page1/): Lookbehind to assert that we have page1/ at previous position
.*?: Match 0 or more of any string (lazy)
(?=_type-A): Lookahead to assert that we have _type-A at next position
You can use
import re
l = 'https://'+'homepage.com/home/page1/222.6 a_type-A/go'
replace_with = '222.6'
print (re.sub('(page1/).*?(_type-A)',fr'\g<1>{replace_with}\2',l, flags=re.DOTALL))
Output: https://homepage.com/home/page1/222.6_type-A/go
See the Python demo online
Note you used an empty string as the replacement argument. In the above snippet, the parts before and after .*? are captured and \g<1> refers to the first group value, and \2 refers to the second group value from the replacement pattern. The unambiguous backreference form (\g<X>) is used to avoid backreference issues since there is a digit right after the backreference.
Since the replacement pattern contains no backslashes, there is no need preprocessing (escaping) anything in it.
This works:
import re
l = 'https://homepage.com/home/page1/222.6 a_type-A/go'
pattern = r"(?<=page1/).*?(?=_type)"
replace_with = '222.6'
s = re.sub(pattern, replace_with, l)
print(s)
The pattern uses the positive lookahead and lookback assertions, ?<= and ?=. A match only occurs if a string is preceded and followed by the assertions in the pattern, but does not consume them. Meaning that re.sub looks for a string with page1/ in front and _type behind it, but only replaces the part in between.
I need to replace part of the string value with extra zeroes if it needs.
T-46-5-В,Г,6-В,Г ---> T-46-005-В,Г,006-В,Г or
T-46-55-В,Г,56-В,Г ---> T-46-055-В,Г,066-В,Г, for example.
I have Regex pattern ^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$ that retrieves 2 separate groups of the string, that i must change. The problem is I can't substitute back exact same groups with changed values if there is another occurrence of my re.search().group() in the whole string.
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$"
new_string_parts = ["005", "006"]
new_string = re.sub(re.search(my_pattern, my_string).group(1), new_string_parts[0], my_string)
new_string = re.sub(re.search(my_pattern, my_string).group(2), new_string_parts[1], new_string)
print(new_string)
I get T-4006-005-В,Г,006-В,Г instead of T-46-005-В,Г,006-В,Г because there is another "6" in my_string. How can i solve this?
Thanks for your answers!
Capture the parts you need to keep and use a single re.sub pass with unambiguous backreferences in the replacement part (because they are mixed with numeric string variables):
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$"
new_string_parts = ["005", "006"]
new_string = re.sub(my_pattern, fr"\g<1>{new_string_parts[0]}\g<2>{new_string_parts[1]}\3", my_string)
print(new_string)
# => T-46-005-В,Г,006-В,Г
See the Python demo. Note I also added ёЁ to the Russian letter ranges.
The pattern - ^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$ - now contains parentheses around the parts you do not need to change, and \g<1> refers to the string captured with (\D-\d{1,2}-), \g<2> refers to the value captured with (-[а-яёА-ЯЁ,]+,) and \3 - to (-[а-яёА-ЯЁ,]+).
I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)
I want to remove all the dots in a text that appear after a vowel character. how can I do that?
Here is the code I wish I had:
string = re.sub('[aeuio]\.', '[aeuio]', string)
Meaning like keep whatever vowel you have matched and remove the '.' next to it.
Capture the vowel and replace with a backreference to it:
import re
s = "Se.hi.mo."
s = re.sub(r'([aeuio])\.', r'\1', s)
print(s) # => Sehimo
See the Python demo and a regex demo.
Here, ([aeuio]) forms a capturing group and \1 in the replacement pattern is a numbered backreference referencing the text captured into Group 1.
Mind the usage of raw string literals where a backslash does not form an escape sequence: r'\1' = '\\1'.
How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'