Related
While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.
A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False
While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.
A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False
While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.
A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False
I'm creating a dictionary for which one of the value is a string with backslash,I know that python is automatically adding escape sequence.But at the end when you print the dictionary it's still printing that value with muliple backslashes,Now i have to pass this dictionary to another tool which is not expecting multiple backslashes,So currently I'm forced to manually remove a backslash is there a way to remove the backslash from the value of dictionary automatically.
(Pdb) print value2
"\x01\x02\x03\x04\x05"
(Pdb) value2
'"\\x01\\x02\\x03\\x04\\x05"'
You are confusing string representations with string values.
When you echo a string object in the Python interpreter, the output is really produced by printing the result of the repr() function. This function outputs debugging friendly representations, and for strings, that representation is valid Python syntax you can copy and paste back into Python.
print on the other hand, just writes the actual value in the string to the terminal. That's rather different from the Python syntax that creates the value. A backslash in the string prints as a backslash, you wouldn't see a backslash if there wasn't one in the value.
In Python string literal syntax, the \ backslash character has special meaning, it's the first character of an escape sequence. So if you want to have an actual backslash in the value of the string, you would need to use \\ to 'escape the escape'. There is other syntax where the backslash would not have special meaning, but the repr() representation of a string doesn't use that other syntax. So it'll output any backslash in the value as the escape sequence \\.
That doesn't mean that the value has two backslashes. It just means that you can copy the output, and paste it into Python, and it'll produce the same string value.
You can see that your string value doesn't have double backslashes by looking at individual characters:
>>> value2 = '"\\x01\\x02\\x03\\x04\\x05"'
>>> value2
'"\\x01\\x02\\x03\\x04\\x05"'
>>> print value2
"\x01\x02\x03\x04\x05"
>>> print value2[0]
"
>>> print value2[1]
\
>>> print value2[2]
x
>>> value2[0]
'"'
>>> value2[1]
'\\'
>>> value2[2]
'x'
Printing value2[1] shows that that single character is a backslash. Echoing that single character shows '\\', the Python syntax to recreate a string with a single character.
When you echo dictionaries or lists or other standard Python containers, they too are echoed using valid Python syntax, so their contents all are shown by using repr() on them, including strings:
>>> d = {'foo': value2}
>>> d
{'foo': '"\\x01\\x02\\x03\\x04\\x05"'}
Again, that's not the value, that's the representation of the string contents.
On top of that, container types have no string value, so printing a dictionary or list or other standard container type will only ever show their representation:
>>> print d # shows a dictionary representation
{'foo': '"\\x01\\x02\\x03\\x04\\x05"'}
>>> print d['foo'] # shows the value of the d['foo'] string
"\x01\x02\x03\x04\x05"
You'd have to print individual values (such as d['foo'] above), or create your own string value from the components (which involves accessing all the contents and building a new string from that). Containers are not meant to be end-user-friendly values, so Python doesn't provide you with a string value for them either.
Strings can also contain non-printable characters, characters that don't have a human-readable value, such as a newline, or a tab character, and even the BELL character that'll make most terminals go beep when you write one of those to them. And in Python 2, the str type holds bytes, really, and only printable characters in the ASCII range (values 00 - 7F) considered when producing the repr() output. Anything outside is always considered unprintable, even if you could decode those bytes as Latin-1 or another commonly used codec.
So when you do have special characters other than \ in the string, you'd see this in the representation:
>>> value_with_no_backslashes = "This is mostly ASCII with a \b bell and a newline:\nSome UTF-8 data: 🦊"
>>> print value_with_no_backslashes # works because my terminal accepts UTF-8
This is mostly ASCII with a bell and a newline:
Some UTF-8 data: 🦊
>>> value_with_no_backslashes
'This is mostly ASCII with a \x08 bell and a newline:\nSome UTF-8 data: \xf0\x9f\xa6\x8a'
Now, when I echo the value, there are backslashes, to make sure the non-printable characters can easily be copied and reproduce the same value again. Note that those backslashes are not doubled in the echoed syntax.
Note that representations are Python specific and should only be used to aid debugging. Writing them to logs is fine, using them to pass values between programs is not. Always use a serialisation format to communicate between programs, including command-line tools started as subprocesses or by writing output to the terminal. Python comes with JSON support built in, and for Python-to-Python serialisation with no chance of third-party interference, pickle can be used for almost any Python data structure.
While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?
There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.
A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False