Slash replacement inside a raw string - python

Just a simple question concerning raw string, regex pattern and replacement:
I have a string variable defined as follow:
> print repr(foo)
'\n\t\t\n\t\tIf (GUTIAttach>=1) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\tUECapInfo;//Mps("( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 )");
My problem are characters "(" and ")", I want to replace them by "\(" and "\)" inside the raw string because it will be used after as a regular expression pattern.
I tried to use this method:
foo_tmp= [inc.replace(')', '\)') for inc in foo]
foo_tmp= [inc.replace('(', '\)') for inc in foo_tmp]
foo = "".join(foo_tmp)
the result gives:
> print repr(foo)
'\n\t\t\n\t\tIf \\(GUTIAttach>=1\\) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\t{\n\t\t\tUECapInfo;//Mps\\("\\( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 \\)"\\);
Characters "(" and ")" have been replaced by "\\(" and "//)" instead of "\(" and "\)".
That's a bit unexpected for me, so do you know how I can proceed to get just a single slash without changing the other part of the string?
Note: The method .decode('string_escape') is also not working due to the rest of string. Double slashes already present in the original raw string must not change.
Thanks a lot for your help

Use the re.escape() function to escape regular expression meta characters for you.
What you are seeing is otherwise perfectly normal Python behaviour; you are looking at a python literal representation; the output can be pasted back into a Python interpreter and recreate the value. As such, anything that could be interpreted as an escape code is escaped for you; a single \ would normally be doubled to prevent it being interpreted as the start of an escape sequence:
>>> '\('
'\\('
>>> print '\\('
\(
You can see this at work in other places in your foo string; the \n character combination represents a newline character, not two separate characters \ and n. If you wanted to include a literal \ and n in the text, you'd have to double the backslash to \\n. Further on into the value of foo you'll find \\", which is a single backslash followed by a " quote.

Related

Escape Characters in Regex sub of Markdown Links to HTML Links

I'm trying to convert markdown of something like:
[Board Management](Boards/boardManagement.md)
to something like this using Python:
<a href='#' onclick='requestPage("Boards/boardManagement.md");'>Board Management</a>
I've found code for a re.sub as follows, but the only way I can get it to work is to not include any type of quotes around requestPage and the browser seems to automatically put them in...
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"<a href='#' onclick=requestPage('\2');>\1</a>", pageContent)
where pageContent is the markdown. Though it seems to work, it would seem best to not depend upon the browser to do the autoinsertion, but everytime I try to rewrite it with the quotes in, it doesn't produce the correct results. For example,
filteredPage = re.sub('\[(.+)\]\((.+)\)', r"\1", pageContent)
results in
Board Management
Is there a way to accomplish the desired link with quotes around the onclick function, other than depending upon the browser to do it?
Summary
The problem you're having is that when you escape a quote in a raw string literal (r"..."), the backslash is not removed from the string. To see what I mean, look at what this code outputs:
print( "abc \" def") # abc " def (the backslash is gone)
print(r"abc \" def") # abc \" def (the backslash is in the string)
In most cases, the solution is to use a triple-quoted string:
print( """abc \" def""") # abc " def (this is the same as the first one)
print(r"""abc " def""" ) # abc " def (this is how to get quotes in a raw string)
So your code becomes this:
re.sub(r'\[(.+)\]\((.+)\)',
r"""\1""",
pageContent)
Another option would be to use ' for your string, and put the href attribute in ": you could have something like r'<a href="#" onclick="request...">'.
Explanation
The key to understanding how raw string literals work may be this: if you use a backslash in a raw string literal, it will be included in the string.
Raw string literals are only mostly raw. The one exception is quotations. This lets you include quotation marks in your string. But unlike a regular string, if you escape a quotation in a raw string literal, the backslash will still be in the string.
This is specified in the last paragraph of the section on string literals:
Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote
The solution to your problem is to use a triple-quoted raw string literal and not escape the quote, as shown above.
In more extreme cases, you can use string literal concatenation to help with escaping strings, but this probably isn't a good use case for it. I'd only use it if (a) the string needed to contain both """ and ''', or (b) I was already using string literal concatenation for another reason (like splitting a long string across multiple lines).
And one last thing: You should be using raw string literals for your regular expressions. It isn't necessary for the regex you have here, but it makes it much easier to write (and read) regular expressions, because every backslash is always in the string, so you get to read exactly what the regex engine will read.
More importantly, unrecognized escape sequences (which include \( and \[) are being phased out and will eventually raise a SyntaxError, so if you want your code to keep working in as many future versions of Python as possible, put your regular expressions in raw literals.

Searching for a raw string (r'\n') within a raw string full of ('\n') resulting in an empty result set [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

regular expression problems in re.findall python [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Confused about backslashes in regular expressions [duplicate]

This question already has answers here:
Can't escape the backslash with regex?
(7 answers)
Closed 4 years ago.
I am confused with the backslash in regular expressions. Within a regex a \ has a special meaning, e.g. \d means a decimal digit. If you add a backslash in front of the backslash this special meaning gets lost. In the regex-howto one can read:
Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
So print(re.search('\d', '\d')) gives None because \d matches any decimal digit but there is none in \d.
I now would expect print(re.search('\\d', '\d')) to match \d but the answer is still None.
Only print(re.search('\\\d', '\d')) gives as output <_sre.SRE_Match object; span=(0, 2), match='\\d'>.
Does someone have an explanation?
The confusion is due to the fact that the backslash character \ is used as an escape at two different levels. First, the Python interpreter itself performs substitutions for \ before the re module ever sees your string. For instance, \n is converted to a newline character, \t is converted to a tab character, etc. To get an actual \ character, you can escape it as well, so \\ gives a single \ character. If the character following the \ isn't a recognized escape character, then the \ is treated like any other character and passed through, but I don't recommend depending on this. Instead, always escape your \ characters by doubling them, i.e. \\.
If you want to see how Python is expanding your string escapes, just print out the string. For example:
s = 'a\\b\tc'
print(s)
If s is part of an aggregate data type, e.g. a list or a tuple, and if you print that aggregate, Python will enclose the string in single quotes and will include the \ escapes (in a canonical form), so be aware of how your string is being printed. If you just type a quoted string into the interpreter, it will also display it enclosed in quotes with \ escapes.
Once you know how your string is being encoded, you can then think about what the re module will do with it. For instance, if you want to escape \ in a string you pass to the re module, you will need to pass \\ to re, which means you will need to use \\\\ in your quoted Python string. The Python string will end up with \\ and the re module will treat this as a single literal \ character.
An alternative way to include \ characters in Python strings is to use raw strings, e.g. r'a\b' is equivalent to "a\\b".
An r character before the regular expression in a call to search() specifies that the regular expression is a raw string. This allows backslashes to be used in the regular expression as regular characters rather than in an escape sequence of characters. Let me explain ...
Before the re module's search method processes the strings that are passed to it, the Python interpreter takes an initial pass over the string. If there are backslashes present in a string, the Python interpreter must decide if each is part of a Python escape sequence (e.g. \n or \t) or not.
Note: at this point Python does not care whether or not '\' is a regular expression meta-character.
If the '\' is followed by a recognized Python escape character (t,n, etc.), then the backslash and the escape character are replaced with the actual Unicode or 8-bit character. For example, '\t' would be replaced with the ASCII character for tab. Otherwise it is passed by and interpreted as a '\' character.
Consider the following.
>>> s = '\t'
>>> print ("[" + s + "]")
>>> [ ] // an actual tab character after preprocessing
>>> s = '\d'
>>> print ("[" + s + "]")
>>> [\d] // '\d' after preprocessing
Sometimes we want to include in a string a character sequence that includes '\' without it being interpreted by Python as an escape sequence. To do this we escape the '\' with a '\'. Now when Python sees '\' it replaces the two backslashes with a single '\' character.
>>> s = '\\t'
>>> print ("[" + s + "]")
>>> [\t] // '\t' after preprocessing
After the Python interpreter take a pass on both strings, they are passed to the re module's search method. The search method parses the regular expression string to identify the regular expression's meta-characters.
Now '\' is also a special regular expression meta-character and is interpreted as one UNLESS it is escaped at the time that the re search() method is executed.
Consider the following call.
>>> match = re.search('a\\t','a\\t') //Match is None
Here, match is None. Why? Lets look at the strings after the Python interpreter makes its pass.
String 1: 'a\t'
String 2: 'a\t'
So why is match equal to None? When search() interprets String 1, since it is a regular expression, the backslash is interpreted as a meta-character, not an ordinary character. The backslash in String 2 however is not in a regular expression and has already been processed by the Python interpreter, so it is interpreted as an ordinary character.
So the search() method is looking for 'a escape-t' in the string 'a\t' which are not a match.
To fix this we can tell the search() method to not interpret the '\' as a meta-character. We can do this by escaping it.
Consider the following call.
>>> match = re.search('a\\\\t','a\\t') // Match contains 'a\t'
Again, lets look at the strings after the Python interpreter has made its pass.
String 1: 'a\\t'
String 2: 'a\t'
Now when the search() method processes the regular expression, it sees that the second backslash is escaped by the first and should not be considered a meta-character. It therefore interprets the string as 'a\t', which matches String 2.
An alternate way to have search() consider '\' as a character is to place an r before the regular expression. This tells the Python interpreter to NOT preprocess the string.
Consider this.
>>> match = re.search(r'a\\t','a\\t') // match contains 'a\t'
Here the Python interpreter does not modify the first string but does process the second string. The strings passed to search() are:
String 1: 'a\\t'
String 2: 'a\t'
As in the previous example, search interprets the '\' as the single character '\' and not a meta-character, thus matches with String 2.
Python's own string parsing (partially) comes in your way.
If you want to see what re sees, type
print '\d'
print '\\d'
print '\\\d'
on the Python command prompt. You see that \d and \\d both result in \d, the latter one being taken care by the Python string parser.
If you want to avoid any hassle with these, use raw strings as suggested by the re module documentation: r'\\d' will result in \\d seen by the RE module.

Defining file paths in python with EOL string literal errors [duplicate]

Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
Given the confusion around the arbitrary-seeming restriction against an odd number of backslashes at the end of a Python raw-string, it's fair to say that this is a design mistake or legacy issue originating in a desire to have a simpler parser.
While workarounds (such as r'C:\some\path' '\\' yielding 'C:\\some\\path\\' (in Python notation) or C:\some\path\ (verbatim)) are simple, it's counterintuitive to be needing them. For comparison, let's have a look at C++ and Perl.
In C++, we can straightforwardly use raw string literal syntax
#include <iostream>
int main() {
std::cout << R"(Hello World!)" << std::endl;
std::cout << R"(Hello World!\)" << std::endl;
std::cout << R"(Hello World!\\)" << std::endl;
std::cout << R"(Hello World!\\\)" << std::endl;
}
to get the following output:
Hello World!
Hello World!\
Hello World!\\
Hello World!\\\
If we want to use the closing delimiter (above: )) within the string literal, we can even extend the syntax in an ad-hoc way to R"delimiterString(quotedMaterial)delimiterString". For example, R"asdf(some random delimiters: ( } [ ] { ) < > just for fun)asdf" produces the string some random delimiters: ( } [ ] { ) < > just for fun in the output. (Ain't that a good use of "asdf"!)
In Perl, this code
my $str = q{This is a test.\\};
print ($str);
print ("This is another test.\n");
will output the following: This is a test.\This is another test.
Replacing the first line by
my $str = q{This is a test.\};
would lead to an error message: Can't find string terminator "}" anywhere before EOF at main.pl line 1.
However, Perl treating a pre-delimiter \ as an escape character doesn't prevent the user from having an odd number of backslashes at the end of the resulting string; eg to place 3 backslashes \\\ into the end of $str, simply end the code with 6 backslashes: my $str = q{This is a test.\\\\\\};. Importantly, while we need to double the backslashes in the input, there is no Python-like inconsistent-seeming syntactic restriction.
Another way of looking at things is that these 3 languages use different ways to address the parsing issue of interaction between escape characters and closing delimiters:
Python: disallows an odd number of backslashes just before the closing delimiter; a simple workaround is r'stringWithoutFinalBackslash' '\\'
C++: allows essentially¹ everything between the delimiters
Perl: allows essentially² everything between the delimiters, but backslashes need to be consistently doubled
¹ The custom delimiterString itself cannot be more than 16 characters long, but that's hardly a limitation.
² If you need the delimiter itself, just escape it with \.
However, to be fair in a comparison to Python, we need to acknowledge that (1) C++ didn't have such string literals until C++11 and is famously hard to parse and (2) Perl is even harder to parse.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).

Categories

Resources