How to escape certain characters of a string? - python

I have a string This is a Test and a passed parameter s.
How can I escape all s-characters of that first string?
It should result in Thi\s i\s a Te\sT
I tried it like this:
rstr = rstr.replace(esc, r"\\" + esc)
But it will result in \\ before each s

r'\\' produces a literal double backslash:
>>> r'\\'
'\\\\'
Don't use a raw string here, just use '\\':
>>> 'This is a Test'.replace('s', '\\s')
'Thi\\s i\\s a Te\\st'
Don't confuse the Python representation with the value. To make debugging and round-tripping easy, the Python interpreter uses the repr() function to echo back results.
The repr() of a string uses Python literal notation, including escaping any backslashes:
>>> r'\s'
'\\s'
>>> len(r'\s')
2
>>> print r'\s'
\s
The actual value contains just one backslash, but because a backslash can start characters with special meanings, such as \n (a newline), or \x00 (a null byte), they are represented escaped so that you can paste the value directly back into the interpreter.

Related

Searching for a raw string (r'\n') within a raw string full of ('\n') resulting in an empty result set [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

regular expression problems in re.findall python [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Confused about backslashes in regular expressions [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Python putting r before unicode string variable

For static strings, putting an r in front of the string would give the raw string (e.g. r'some \' string'). Since it is not possible to put r in front of a unicode string variable, what is the minimal approach to dynamically convert a string variable to its raw form? Should I manually substitute all backslashes with double backslashes?
str_var = u"some text with escapes e.g. \( \' \)"
raw_str_var = ???
If you really need to escape a string, let's say you want to print a newline as \n, you can use the encode method with the Python specific string_escape encoding:
>>> s = "hello\nworld"
>>> e = s.encode("string_escape")
>>> e
"hello\\nworld"
>>> print s
hello
world
>>> print e
hello\nworld
You didn't mention anything about unicode, or which Python version you are using, but if you are dealing with unicode strings you should use unicode_escape instead.
>>> u = u"föö\nbär"
>>> print u
föö
bär
>>> print u.encode('unicode_escape')
f\xf6\xf6\nb\xe4r
Your post originally had the regex tag, maybe re.escape is what you're actually looking for?
>>> re.escape(u"foo\nbar\'baz")
u"foo\\\nbar\\'baz"
Not the "double escapes", ie printing the above string yields:
foo\
bar\'baz
There is nothing to convert - the r prefix is only significant in source code notation, not for program logic.
As a rule, if you use a single backslash in a normal string, it will automatically be converted to a double backslash if it doesn't start a valid escape sequence:
>>> "\n \("
'\n \\('
Since it may be difficult to remember all the valid/invalid escape sequences, raw string notation was introduced. But there is no way and no need to convert a string after it has been defined.
In your case, the correct approach would be to use
str_var = ur"some text with escapes e.g. \( \' \)"
which happens to result in the same string here, but is more explicit.

Adding backslashes without escaping [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 8 years ago.
I need to escape a & (ampersand) character in a string. The problem is whenever I string = string.replace ('&', '\&') the result is '\\&'. An extra backslash is added to escape the original backslash. How do I remove this extra backslash?
The result '\\&' is only displayed - actually the string is \&:
>>> str = '&'
>>> new_str = str.replace('&', '\&')
>>> new_str
'\\&'
>>> print new_str
\&
Try it in a shell.
The extra backslash is not actually added; it's just added by the repr() function to indicate that it's a literal backslash. The Python interpreter uses the repr() function (which calls __repr__() on the object) when the result of an expression needs to be printed:
>>> '\\'
'\\'
>>> print '\\'
\
>>> print '\\'.__repr__()
'\\'
Python treats \ in literal string in a special way.
This is so you can type '\n' to mean newline or '\t' to mean tab
Since '\&' doesn't mean anything special to Python, instead of causing an error, the Python lexical analyser implicitly adds the extra \ for you.
Really it is better to use \\& or r'\&' instead of '\&'
The r here means raw string and means that \ isn't treated specially unless it is right before the quote character at the start of the string.
In the interactive console, Python uses repr to display the result, so that is why you see the double '\'. If you print your string or use len(string) you will see that it is really only the 2 characters
Some examples
>>> 'Here\'s a backslash: \\'
"Here's a backslash: \\"
>>> print 'Here\'s a backslash: \\'
Here's a backslash: \
>>> 'Here\'s a backslash: \\. Here\'s a double quote: ".'
'Here\'s a backslash: \\. Here\'s a double quote: ".'
>>> print 'Here\'s a backslash: \\. Here\'s a double quote: ".'
Here's a backslash: \. Here's a double quote ".
To Clarify the point Peter makes in his comment see this link
Unlike Standard C, all unrecognized
escape sequences are left in the
string unchanged, i.e., the backslash
is left in the string. (This behavior
is useful when debugging: if an escape
sequence is mistyped, the resulting
output is more easily recognized as
broken.) It is also important to note
that the escape sequences marked as
“(Unicode only)” in the table above
fall into the category of unrecognized
escapes for non-Unicode string
literals.
>>> '\\&' == '\&'
True
>>> len('\\&')
2
>>> print('\\&')
\&
Or in other words: '\\&' only contains one backslash. It's just escaped in the python shell's output for clarity.
printing a list can also cause this problem (im new in python, so it confused me a bit too):
>>>myList = ['\\']
>>>print myList
['\\']
>>>print ''.join(myList)
\
similarly:
>>>myList = ['\&']
>>>print myList
['\\&']
>>>print ''.join(myList)
\&
There is no extra backslash, it's just formatted that way in the interactive environment. Try:
print string
Then you can see that there really is no extra backslash.

Categories

Resources