Searching for an exact match that contains brackets [duplicate] - python

This question already has an answer here:
Python re - escape coincidental parentheses in regex pattern
(1 answer)
Closed 5 years ago.
I am reading in lines from a file each of which are formatted like this:
array_name[0]
array_name[1]
How can I do an exact match on this string in python? I've tried this:
if re.match(line, "array_name[0]")
but it seems to match all the time without taking the parts in bracket ([0], [1], etc.) into account

re.escape from the re module is a useful tool for automatically escaping characters that the regex engine considers special. From the docs:
re.escape(pattern)
Escape all the characters in pattern except ASCII
letters and numbers. This is useful if you want to match an arbitrary
literal string that may have regular expression metacharacters in it.
In [1]: re.escape("array_name[0]")
Out[1]: 'array_name\\[0\\]'
Also, you've reversed the order of your arguments. You'll need your pattern to come first, followed by the text you want to match:
re.match(re.escape("array_name[0]"), line)
Example:
In [2]: re.match(re.escape("array_name[0]"), 'array_name[0] in a line')
Out[2]: <_sre.SRE_Match object; span=(0, 13), match='array_name[0]'>

Related

Don't raw strings treat backslashes as a literal character? [duplicate]

This question already has answers here:
Raw string and regular expression in Python
(4 answers)
Closed 2 years ago.
I have a question about the backslashes when using the re module in python. Consider the code:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
match = num_reg.search(message)
print(match.group())
In the code above, a raw string is passed into the re.compile method, but the backslash is still not treated as a literal character, as /d remain a placeholder for a digit. Why the raw string then?
The documentation for re and raw strings answers this question well.
So in your example the parameter passed to re.compile() ends up containing the original \. This is desirable when working with re because it has its own escape sequences that may or may not conflict with python's escape sequences. Typically it's much more convenient to use r'foo' when working with regex so you don't have to double escape your regex special characters.
Without the raw string, for the escape character to make it to re for processing you would need to use:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile('\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d')
match = num_reg.search(message)
print(match.group())
You may consider looking at regex quantifier/repetition syntax as it generally makes re more readable:
import re
message = 'My phone number is 345-298-2372'
num_reg = re.compile(r'\d{3}-\d{3}-\d{4}')
match = num_reg.search(message)
print(match.group())

Python's re.sub returns data in wrong encoding from unicode [duplicate]

This question already has answers here:
Handling backreferences to capturing groups in re.sub replacement pattern
(2 answers)
Closed 3 years ago.
>>> re.sub('\w', '\1', 'абвгдеёжз')
'\x01\x01\x01\x01\x01\x01\x01\x01\x01'
Why does re.sub return data in this format? I want it to return the unaltered string 'абвгдеёжз' in this case. Changing the string to u'абвгдеёжз' or passing flags=re.U doesn't do anything.
Because '\1' is the character with codepoint 1 (and its repr form is '\x01'). re.sub never saw your backslash, per the rules on string literals. Even if you did escape it, such as in r'\1' or '\\1', reference 1 isn't the right number; you need parenthesis to define groups. r'\g<0>' would work as described in the re.sub documentation.
Perhaps you meant to:
>>>> re.sub('(\w)', r'\1', 'абвгдеёжз')
'абвгдеёжз'

Python/Regex: Get all strings between any two characters [duplicate]

This question already has answers here:
Match text between two strings with regular expression
(3 answers)
Closed 5 years ago.
I have a use case that requires the identification of many different pieces of text between any two characters.
For example,
String between a single space and (: def test() would return
test
String between a word and space (paste), and a special character (/): #paste "game_01/01" would return "game_01
String between a single space and ( with multiple target strings: } def test2() { Hello(x, 1) would return test2 and Hello
To do this, I'm attempting to write something generic that will identify the shortest string between any two characters.
My current approach is (from chrisz):
pattern = '{0}(.*?){1}'.format(re.escape(separator_1), re.escape(separator_2))
And for the first use case, separator_1 = \s and separator_2 = (. This isn't working so evidently I am missing something but am not sure what.
tl;dr How can I write a generic regex to parse the shortest string between any two characters?
Note: I know there are many examples of this but they seem quite specific and I'm looking for a general solution if possible.
Let me know if this is what you are looking for:
import re
def smallest_between_two(a, b, text):
return min(re.findall(re.escape(a)+"(.*?)"+re.escape(b),text), key=len)
print(smallest_between_two(' ', '(', 'def test()'))
print(smallest_between_two('[', ']', '[this one][not this one]'))
print(smallest_between_two('paste ', '/', '#paste "game_01/01"'))
Output:
test
this one
"game_01
To add an explanation to what this does:
re.findall():
Return all non-overlapping matches of pattern in string, as a list of strings
re.escape()
Escape all the characters in pattern except ASCII letters and numbers. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it
(.*?)
.*? matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
So our regular expression matches any character (not including line terminators) between two arbitrary escaped strings, and then returns the shortest length string from the list that re.findall() returns.

Python Regular Expression newline is matched [duplicate]

This question already has answers here:
REGEX - Differences between `^`, `$` and `\A`, `\Z`
(1 answer)
Checking whole string with a regex
(5 answers)
Closed 5 years ago.
I want to match a string that has alphanumerics and some special characters but not the newline. But, whenever my string has a newline, it matches the newline character as well. I checked document for some flags but none of them looked relevant.
The following is a sample code in Python 3.6.2 REPL
>>> import re
>>> s = "do_not_match\n"
>>> p = re.compile(r"^[a-zA-Z\+\-\/\*\%\_\>\<=]*$")
>>> p.match(s)
<_sre.SRE_Match object; span=(0, 12), match='do_not_match'>
The expected result is that it shouldn't match as I have newline at the end.
https://regex101.com/r/qyRw5s/1
I am a bit confused on what I am missing here.
The problem is that $ matches at the end of the string before the newline (if any).
If you don't want to match the newline at the end, use \Z instead of $ in your regex.
See the re module's documentation:
'$'
Matches the end of the string or just before the newline at the end of the string,
\Z
Matches only at the end of the string.

Regex - replace word having plus or brackets [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?
Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.

Categories

Resources