I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?
The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.
You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.
This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 4 years ago.
I am new to Python. Can someone tell me what is the difference between these two regex statements (re.findall(r"\d+","i am aged 35")) and (re.findall("\d+","i am aged 35")).
I had the understanding that the raw string in the first statement will make "\d+" inactive because that is the primarily role of a raw string - to make escape characters inactive. In other words "\d+" will not be a meta character for finding/searching/matching digits if a raw string is used. However, I now see that both statements return the same result.
Both the Python parser and the regular expression parser handle escape sequences. This means that any escape sequence that both engines support must either use double slashes, or you use a raw string literal so the Python parser doesn't try to interpret escape sequences.
In this case, \d has no meaning to Python, so the backslash is left in place for the re module to handle. So here specifically, there is no difference between the two snippets.
However, if you needed to match a literal backslash before other text like section in your regular expression, without raw strings, you'd have to use '\\\\section' to define the pattern! That's because the Python interpreter would see '\\section' as an escape sequence producing a single backslash, and then the regular expression parser sees the start of the escape sequence \s.
See the section on backslashes and raw string literals in the Python regular expression HOWTO.
This question already has answers here:
Why can't Python's raw string literals end with a single backslash?
(14 answers)
Closed 8 years ago.
I'm trying to understand why python has this unheard of behavior.
If I'm writing rawdata string, it is much more likely that I won't want escaping quotes.
This behavior forces us to write this weird code:
s = r'something' + '\\' instead of just 's = r'something\'
any ideas why python developers found this more sensible?
EDIT:
I'm not asking why it is so. I'm asking what makes this design decision, or if anyone finds any thing good in it.
The r prefix for a string literal doesn't disable escaping, it changes escaping so that the sequence \x (where x is any character) is "converted" to itself. So then, \' emits \' and your string is unterminated because there's no ' at the end of it.
The decision to disallow an unpaired ending backslash in a raw string is explained in this faq:
Raw strings were designed to ease creating input for processors (chiefly regular expression engines) that want to do their own backslash escape processing. Such processors consider an unmatched trailing backslash to be an error anyway, so raw strings disallow that. In return, they allow you to pass on the string quote character by escaping it with a backslash. These rules work well when r-strings are used for their intended purpose.
Technically, any odd number of backslashes, as described in the documentation.
>>> r'\'
File "<stdin>", line 1
r'\'
^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
File "<stdin>", line 1
r'\\\'
^
SyntaxError: EOL while scanning string literal
It seems like the parser could just treat backslashes in raw strings as regular characters (isn't that what raw strings are all about?), but I'm probably missing something obvious.
The whole misconception about python's raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python's tutorial sequence:
When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string
So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).
This way:
r'abc\d' comprises a, b, c, \, d
r'abc\'d' comprises a, b, c, \, ', d
r'abc\'' comprises a, b, c, \, '
and:
r'abc\' comprises a, b, c, \, ' but there is no terminating quote now.
Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will 'devour' string closing char.
The reason is explained in the part of that section which I highlighted in bold:
String quotes can be escaped with a
backslash, but the backslash remains
in the string; for example, r"\"" is a
valid string literal consisting of two
characters: a backslash and a double
quote; r"\" is not a valid string
literal (even a raw string cannot end
in an odd number of backslashes).
Specifically, a raw string cannot end
in a single backslash (since the
backslash would escape the following
quote character). Note also that a
single backslash followed by a newline
is interpreted as those two characters
as part of the string, not as a line
continuation.
So raw strings are not 100% raw, there is still some rudimentary backslash-processing.
That's the way it is! I see it as one of those small defects in python!
I don't think there's a good reason for it, but it's definitely not parsing; it's really easy to parse raw strings with \ as a last character.
The catch is, if you allow \ to be the last character in a raw string then you won't be able to put " inside a raw string. It seems python went with allowing " instead of allowing \ as the last character.
However, this shouldn't cause any trouble.
If you're worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don't do it with string concatenation, for it's not the right way to do it anyway! use os.path.join
>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'
In order for you to end a raw string with a slash I suggest you can use this trick:
>>> print r"c:\test"'\\'
test\
It uses the implicit concatenation of string literals in Python and concatenates one string delimited with double quotes with another that is delimited by single quotes. Ugly, but works.
Another trick is to use chr(92) as it evaluates to "\".
I recently had to clean a string of backslashes and the following did the trick:
CleanString = DirtyString.replace(chr(92),'')
I realize that this does not take care of the "why" but the thread attracts many people looking for a solution to an immediate problem.
Since \" is allowed inside the raw string. Then it can't be used to identify the end of the string literal.
Why not stop parsing the string literal when you encounter the first "?
If that was the case, then \" wouldn't be allowed inside the string literal. But it is.
The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.
But you could use:
'\\'
Another user who has since deleted their answer (not sure if they'd like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).
I thought it was an interesting idea and am including it as community wiki for posterity.
Naive raw strings
The naive idea of a raw string is
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
and it will mean itself.
Unfortunately, this does not work, because if the whatever
happens to contain a quote, the raw string would end at that point.
It is simply impossible that I can put "whatever I want"
between fixed delimiters, because some of it could look like
the terminating delimiter -- no matter what that delimiter is.
Real-world raw strings (variant 1)
One possible approach to this problem would be to say
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
This restriction sounds harsh, until one recognizes that
Python's large offering of quotes can accommodate most situations
with this rule. The following are all valid Python quotes:
'
"
'''
"""
With this many possibilities for the delimiter, almost anything
can be made to work.
About the only exception would be if the string
literal is supposed to contain a complete list of all allowed
Python quotes.
Real-world raw strings (variant 2, as in Python)
Python, however, takes a different route using
an extended version of the above rule.
It effectively states
If I put an r in front of a pair of quotes,
I can put whatever I want between the quotes
as long as it does not contain a quote
and it will mean itself.
If I insist on including a quote, even that is allowed,
but I have to put a backslash before it.
So the Python approach is, in a sense, even more liberal
than variant 1 above -- but it has the side effect of
"mis"interpreting the closing quote as part of the string
if the last intended character of the string is a backslash.
Variant 2 is not helpful:
If I want the quote in my string,
but not the backslash, the allowed version of my string literal
will not be what I need.
However, given the three different other kinds of quotes I have
at my disposal, I will probably just pick one of those and my
problem will be solved -- so this is not problematic case.
The problematic case is this one:
If I want my string to end with a backslash, I am at a loss.
I need to resort to concatenating a non-raw string literal
containing the backslash.
Conclusion
After writing this, I go with several of the other posters
that variant 1 would have been easier to understand and to accept
and therefore more pythonic. That's life!
Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.
That does indeed disallow \ as last character since it will escape the " and make the parser choke. But as pointed out earlier \ is legal.
some tips :
1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :
os.path.normpath('c:/folder1/')
2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use 'r' prefix before your literal string). for example :
r'\one \two \three'
3) if you need to prefix a string in a variable X with a backslash then you can do this :
X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X # X2 now contains \dummy
4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :
voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name
now lilypond_statement contains "\DisplayLilyMusic \upper"
long live python ! :)
n3on
Despite its role, even a raw string cannot end in a single
backslash, because the backslash escapes the following quote
character—you still must escape the surrounding quote character to
embed it in the string. That is, r"...\" is not a valid string
literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use
two and slice off the second.
Given the confusion around the arbitrary-seeming restriction against an odd number of backslashes at the end of a Python raw-string, it's fair to say that this is a design mistake or legacy issue originating in a desire to have a simpler parser.
While workarounds (such as r'C:\some\path' '\\' yielding 'C:\\some\\path\\' (in Python notation) or C:\some\path\ (verbatim)) are simple, it's counterintuitive to be needing them. For comparison, let's have a look at C++ and Perl.
In C++, we can straightforwardly use raw string literal syntax
#include <iostream>
int main() {
std::cout << R"(Hello World!)" << std::endl;
std::cout << R"(Hello World!\)" << std::endl;
std::cout << R"(Hello World!\\)" << std::endl;
std::cout << R"(Hello World!\\\)" << std::endl;
}
to get the following output:
Hello World!
Hello World!\
Hello World!\\
Hello World!\\\
If we want to use the closing delimiter (above: )) within the string literal, we can even extend the syntax in an ad-hoc way to R"delimiterString(quotedMaterial)delimiterString". For example, R"asdf(some random delimiters: ( } [ ] { ) < > just for fun)asdf" produces the string some random delimiters: ( } [ ] { ) < > just for fun in the output. (Ain't that a good use of "asdf"!)
In Perl, this code
my $str = q{This is a test.\\};
print ($str);
print ("This is another test.\n");
will output the following: This is a test.\This is another test.
Replacing the first line by
my $str = q{This is a test.\};
would lead to an error message: Can't find string terminator "}" anywhere before EOF at main.pl line 1.
However, Perl treating a pre-delimiter \ as an escape character doesn't prevent the user from having an odd number of backslashes at the end of the resulting string; eg to place 3 backslashes \\\ into the end of $str, simply end the code with 6 backslashes: my $str = q{This is a test.\\\\\\};. Importantly, while we need to double the backslashes in the input, there is no Python-like inconsistent-seeming syntactic restriction.
Another way of looking at things is that these 3 languages use different ways to address the parsing issue of interaction between escape characters and closing delimiters:
Python: disallows an odd number of backslashes just before the closing delimiter; a simple workaround is r'stringWithoutFinalBackslash' '\\'
C++: allows essentially¹ everything between the delimiters
Perl: allows essentially² everything between the delimiters, but backslashes need to be consistently doubled
¹ The custom delimiterString itself cannot be more than 16 characters long, but that's hardly a limitation.
² If you need the delimiter itself, just escape it with \.
However, to be fair in a comparison to Python, we need to acknowledge that (1) C++ didn't have such string literals until C++11 and is famously hard to parse and (2) Perl is even harder to parse.
I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:
x = 'a string\\'
x
'a string\\'
# Now save it in a text file and it will appear with a single backslash:
with open("my_file.txt", 'w') as h:
h.write(x)
BTW it is not working with json if you dump it using python's json library.
Finally, I work with Spyder, and I noticed that if I open the variable in spider's text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it's not very helpful for most needs but maybe for some..).
I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').