Regex - replace word having plus or brackets [duplicate] - python

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?

Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.

Related

How can I use a variable as regex in python? [duplicate]

This question already has answers here:
How to use a variable inside a regular expression?
(12 answers)
Closed 4 years ago.
I use re to find a word on a file and I stored it as lattice_type
Now I want to use the word stored on lattice_type to make another regex
I tried using the name of the variable on this way
pnt_grp=re.match(r'+ lattice_type + (.*?) .*',line, re.M|re.I)
Here I look for the regex lattice_type= and store the group(1) in lattice_type
latt=open(cell_file,"r")
for types in latt:
line = types
latt_type = re.match(r'lattice_type = (.*)', line, re.M|re.I)
if latt_type:
lattice_type=latt_type.group(1)
Here is where I want to use the variable containing the word to find it on another file, but I got problems
pg=open(parameters,"r")
for lines in pg:
line=lines
pnt_grp=re.match(r'+ lattice_type + (.*?) .*',line, re.M|re.I)
if pnt_grp:
print(pnt_grp(1))
The r prefix is only needed when defining a string with a lot of backslashes, because both regex and Python string syntax attach meaning to backslashes. r'..' is just an alternative syntax that makes it easier to work with regex patterns. You don't have to use r'..' raw string literals. See The backslash plague in the Python regex howto for more information.
All that means that you certainly don't need to use the r prefix when already have a string value. A regex pattern is just a string value, and you can just use normal string formatting or concatenation techniques:
pnt_grp = re.match(lattice_type + '(.*?) .*', line, re.M|re.I)
I didn't use r in the string literal above, because there are no \ backslashes in the expression there to cause issues.
You may need to use the re.escape() function on your lattice_type value, if there is a possibility of that value containing regular expression meta-characters such as . or ? or [, etc. re.escape() escapes such metacharacters so that only literal text is matched:
pnt_grp = re.match(re.escape(lattice_type) + '(.*?) .*', line, re.M|re.I)

Python/Regex: Get all strings between any two characters [duplicate]

This question already has answers here:
Match text between two strings with regular expression
(3 answers)
Closed 5 years ago.
I have a use case that requires the identification of many different pieces of text between any two characters.
For example,
String between a single space and (: def test() would return
test
String between a word and space (paste), and a special character (/): #paste "game_01/01" would return "game_01
String between a single space and ( with multiple target strings: } def test2() { Hello(x, 1) would return test2 and Hello
To do this, I'm attempting to write something generic that will identify the shortest string between any two characters.
My current approach is (from chrisz):
pattern = '{0}(.*?){1}'.format(re.escape(separator_1), re.escape(separator_2))
And for the first use case, separator_1 = \s and separator_2 = (. This isn't working so evidently I am missing something but am not sure what.
tl;dr How can I write a generic regex to parse the shortest string between any two characters?
Note: I know there are many examples of this but they seem quite specific and I'm looking for a general solution if possible.
Let me know if this is what you are looking for:
import re
def smallest_between_two(a, b, text):
return min(re.findall(re.escape(a)+"(.*?)"+re.escape(b),text), key=len)
print(smallest_between_two(' ', '(', 'def test()'))
print(smallest_between_two('[', ']', '[this one][not this one]'))
print(smallest_between_two('paste ', '/', '#paste "game_01/01"'))
Output:
test
this one
"game_01
To add an explanation to what this does:
re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings
re.escape()
Escape all the characters in pattern except ASCII letters and numbers. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it
(.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
So our regular expression matches any character (not including line terminators) between two arbitrary escaped strings, and then returns the shortest length string from the list that re.findall() returns.

Python regex matching on strings I don't want [duplicate]

This question already has answers here:
Python- how do I use re to match a whole string [duplicate]
(4 answers)
Closed 5 years ago.
This is my first attempt at trying to use regex with Python or at all, and it is not working as expected. I want a regex to match any alphabetic character or underscore as the first character, then any number of alphanumeric characters or underscores after. The regex I am using is '^[a-z_,A-Z][a-z_A-Z0-9]*', which seems to produce what I want at pythex.org, but in my code it is matching strings that I do not want.
My code is as follows:
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*', someString)
return True if isMatch else False
Two examples of strings that are matching that I don't want are: "qq-q" and "va[r". What am I doing wrong?
I think that you just forgot the $ at the end of your regex to specify the end of the string.
isMatch = re.match('^[a-z_A-Z][a-z_A-Z0-9]*$', someString)
Without that, it will match the beginning of the string and not the entire string, which explains why it worked on "qq-q" ("qq" is a match) and "va[r" ("va" is a match).

understanding this python regular expression re.compile(r'[ :]') [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
Hi I am trying to understand python code which has this regular expression re.compile(r'[ :]'). I tried quite a few strings and couldnt find one. Can someone please give example where a text matches this pattern.
The expression simply matches a single space or a single : (or rather, a string containing either). That’s it. […] is a character class.
The [] matches any of the characters in the brackets. So [ :] will match one character that is either a space or a colon.
So these strings would have a match:
"Hello World"
"Field 1:"
etc...
These would not
"This_string_has_no_spaces_or_colons"
"100100101"
Edit:
For more info on regular expressions: https://docs.python.org/2/library/re.html

Why do I need to add DOTALL to python regular expression to match new line in raw string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
Why does one need to add the DOTALL flag for the python regular expression to match characters including the new line character in a raw string. I ask because a raw string is supposed to ignore the escape of special characters such as the new line character. From the docs:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is my situation:
string = '\nSubject sentence is: Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is : may have\npredicate sentence is: a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'
re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)
results in a match , However , when I remove the DOTALL flag, I get no match.
In regex . means any character except \n
So if you have newlines in your string, then .* will not pass that newline(\n).
But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .
Your source string is not raw, only your pattern string.
maybe try
string = r'\n...\n'
re.search("Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)

Categories

Resources