I found the sequence \newline in a list of escape sequences in the python documentation. I wonder how it is used and for what. At least in my interpreter it seems this is just interpreted as '\n' + 'ewline':
>>> print('\newline')
ewline
It refers to the actual newline character - the one with character code "10" (0x0a) - not the text sequence "newline".
So, an example is like:
print("a\
b")
Here, the backslash is succeeded by the newline, inside a string, and what is printed is just "ab" with nothing apart.
it differs from \n - in here, the characer following the backslash is n (0x6e), and this sequence is translated to \x0a on parsing the string. On \<newline>, the source string contains the \x0a character and that is replaced by an empty string.
Maybe the documentation on that page would be more clear if it would read \<newline> instead of just \newline.
The documentation you are alluding to is explaining how a backslash followed by a literal newline is ignored, as if the next line were physically joined with the line on which the starting backslash was found.
The string \newline' has no special meaning; it is exactly what you say you think it is.
Related
In the following:
>>> r'\d+','\d+', '\\d+'
('\\d+', '\\d+', '\\d+')
Why does the backslash in '\d+' not need to be escaped? Why does this give the same result as the other two literals?
Similarly:
>>> r'[a-z]+\1', '[a-z]+\1'
('[a-z]+\\1', '[a-z]+\x01')
Why does the \1 get converted into a hex escape?
String and Bytes literals has tables showing which backslash combinations are actually escape sequences that have a special meaning. Combinations outside of these tables are not escapes, are not part of the raw string rules and are treated as regular characters. "\d" is two characters as is r"\d". You'll find, for instance, that "\n" (a single newline character) will work differently than \d.
\1 is an \ooo octal escape. When printed, python shows the same character value as a hex escape. Interestingly, \8 isn't octal but instead of raising an error, python just treats it as two characters (because its not an escape).
Because \d is not an escape code. So, however you type it, it is interpreted as a literal \ then a d.
If you type \\d, then the \\ is interpreted as an escaped \, followed by a d.
The situation is different if you choose a letter part of an escape code.
r'\n+','\n+', '\\n+'
⇒ ('\\n+', '\n+', '\\n+')
The first one (because raw) and the last one (because \ is escaped) is a 3-letter string containing a \ a n and a +.
The second one is a 2 letter string, containing a '\n' (a newline) and a +
The second one is even more straightforward. Nothing strange here. r'\1' is a backslash then a one. '\1' is the character whose ASCII code is 1, whose canonical representation is '\x01'
'\1', '\x01' or '\001' are the same thing. Python cannot remember what specific syntax you used to type it. All it knows is it that is the character of code 1. So, it displays it in the "canonical way".
Exactly like 'A' '\x41' or '\101' are the same thing. And would all be printed with the canonical representation, which is 'A'
When I write print('\') or print("\") or print("'\'"), Python doesn't print the backslash \ symbol. Instead it errors for the first two and prints '' for the third. What should I do to print a backslash?
This question is about producing a string that has a single backslash in it. This is particularly tricky because it cannot be done with raw strings. For the related question about why such a string is represented with two backslashes, see Why do backslashes appear twice?. For including literal backslashes in other strings, see using backslash in python (not to escape).
You need to escape your backslash by preceding it with, yes, another backslash:
print("\\")
And for versions prior to Python 3:
print "\\"
The \ character is called an escape character, which interprets the character following it differently. For example, n by itself is simply a letter, but when you precede it with a backslash, it becomes \n, which is the newline character.
As you can probably guess, \ also needs to be escaped so it doesn't function like an escape character. You have to... escape the escape, essentially.
See the Python 3 documentation for string literals.
A hacky way of printing a backslash that doesn't involve escaping is to pass its character code to chr:
>>> print(chr(92))
\
print(fr"\{''}")
or how about this
print(r"\ "[0])
For completeness: A backslash can also be escaped as a hex sequence: "\x5c"; or a short Unicode sequence: "\u005c"; or a long Unicode sequence: "\U0000005c". All of these will produce a string with a single backslash, which Python will happily report back to you in its canonical representation - '\\'.
I found the sequence \newline in a list of escape sequences in the python documentation. I wonder how it is used and for what. At least in my interpreter it seems this is just interpreted as '\n' + 'ewline':
>>> print('\newline')
ewline
It refers to the actual newline character - the one with character code "10" (0x0a) - not the text sequence "newline".
So, an example is like:
print("a\
b")
Here, the backslash is succeeded by the newline, inside a string, and what is printed is just "ab" with nothing apart.
it differs from \n - in here, the characer following the backslash is n (0x6e), and this sequence is translated to \x0a on parsing the string. On \<newline>, the source string contains the \x0a character and that is replaced by an empty string.
Maybe the documentation on that page would be more clear if it would read \<newline> instead of just \newline.
The documentation you are alluding to is explaining how a backslash followed by a literal newline is ignored, as if the next line were physically joined with the line on which the starting backslash was found.
The string \newline' has no special meaning; it is exactly what you say you think it is.
I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:
searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
However - say pojo is 'MyObject' - the regex is not matching it to this line:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
If I print the string (while stopped in Pdb) I'm searching with, I see this:
'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'
I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:
searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)
And the string searched becomes
'<class name="(.*\\.|)MyObject".*table="(.*?)"'
It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?
Given this:
re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
The the first part of the pattern is interpreted like this:
1. class name=" a literal string beginning with c and ending with "
2. ( the beginning of a group
3. .* zero or more of any characters
4. \\ a literal single slash
5. . any single character
6. OR
7. nothing
8. ) end of the group
Since the string you're searching for does not have a literal backslash, it won't match.
If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.
Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".
This seems to work for me:
import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"
pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'
assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))
On Pythex, I tried this regex:
<class name="(.*)\.MyObject" table="([^"]*)"
on this string:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
and got these two match captures:
com.place.package
my_cool_object
So I think in your case, this line
searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)
will produce the result you want.
About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:
'\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".
Just a simple question concerning raw string, regex pattern and replacement:
I have a string variable defined as follow:
> print repr(foo)
'\n\t\t\n\t\tIf (GUTIAttach>=1) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\tUECapInfo;//Mps("( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 )");
My problem are characters "(" and ")", I want to replace them by "\(" and "\)" inside the raw string because it will be used after as a regular expression pattern.
I tried to use this method:
foo_tmp= [inc.replace(')', '\)') for inc in foo]
foo_tmp= [inc.replace('(', '\)') for inc in foo_tmp]
foo = "".join(foo_tmp)
the result gives:
> print repr(foo)
'\n\t\t\n\t\tIf \\(GUTIAttach>=1\\) //In case of GUTI attach Enodeb should not ask RRCUecapa again\n\t\t{\n\t\t\tUECapInfo;//Mps\\("\\( \\"rat_Type\\":0 \\"ueCapabilitiesRAT_Container\\":hex:011c0000000080 \\)"\\);
Characters "(" and ")" have been replaced by "\\(" and "//)" instead of "\(" and "\)".
That's a bit unexpected for me, so do you know how I can proceed to get just a single slash without changing the other part of the string?
Note: The method .decode('string_escape') is also not working due to the rest of string. Double slashes already present in the original raw string must not change.
Thanks a lot for your help
Use the re.escape() function to escape regular expression meta characters for you.
What you are seeing is otherwise perfectly normal Python behaviour; you are looking at a python literal representation; the output can be pasted back into a Python interpreter and recreate the value. As such, anything that could be interpreted as an escape code is escaped for you; a single \ would normally be doubled to prevent it being interpreted as the start of an escape sequence:
>>> '\('
'\\('
>>> print '\\('
\(
You can see this at work in other places in your foo string; the \n character combination represents a newline character, not two separate characters \ and n. If you wanted to include a literal \ and n in the text, you'd have to double the backslash to \\n. Further on into the value of foo you'll find \\", which is a single backslash followed by a " quote.