This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
I want to remove the line returns of a text that is wrapped to a certain width. e.g.
import re
x = 'the meaning\nof life'
re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
I want to return the meaning of life. What am I doing wrong?
You need escape that \ like this:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
>>> re.sub("([,\w])\n(\w)", "\\1 \\2", x)
'the meaning of life'
>>> re.sub("([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
>>>
If you don't escape it, the output is \1, so:
>>> '\1'
'\x01'
>>>
That's why we need use '\\\\' or r'\\'to display a signal \ in Python RegEx.
However about that, from this answer:
If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).
And the document:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python's usage of the same character for the same purpose in string literals.
Let's say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.
Another way as brittenb suggested, you don't need RegEx in this case:
>>> x = 'the meaning\nof life'
>>> x.replace("\n", " ")
'the meaning of life'
>>>
Use raw string literals; both Python string literal syntax and regex interpret backslashes; \1 in a python string literal is interpreted as an octal escape, but not in a raw string literal:
re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
The alternative would be to double all backslashes so that they reach the regex engine as such.
See the Backslash plague section of the Python regex HOWTO.
Demo:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
It might be easier just to split on newlines; use the str.splitlines() method, then re-join with spaces using str.join():
' '.join(ex.splitlines())
but admittedly this won't distinguish between newlines between words and extra newlines elsewhere.
Related
I have a string: s = "we are \xaf\x06OK\x03family, good", and I want to substitute the \xaf,\x06 and \x03 with '', the regexpresion is pat = re.compile(r'\\[xX][0-9a-fA-F]+'), but it cannnot match anything. The code is in belows:
pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
The result is
we are ¯OKfamily, good
we are ¯OKfamily, good,
But how can I get we are OK family, good
You are making the basic but common mistake of confusing the representation of a string in Python source code with its actual value.
There are a number of escape codes in Python which do not represent themselves verbatim in regular strings in source code. For example, "\n" represents a single newline character, even though the Python notation occupies two characters. The backslash is used to introduce this notation. There are a number of dedicated escape codes like \r, \a, etc, and a generalized notation \x01 which allows you to write any character code in hex notation (\n is equivalent to \x0a, \r is equivalent to \x0d, etc). To represent a literal backslash character, you need to escape it with another backslash: "\\".
In a "raw string", no backslash escapes are supported; so r"\n" represents a string containing two characters, a literal backslash \ and a literal lowercase n. You could equivalently write "\\n" using non-raw string notation. The r prefix is not part of the string, it just tells Python how to interpret the string between the following quotes (i.e. no interpretation at all; every character represents itself verbatim).
It is not clear from your question which of these interpretations you actually need, so I will present solutions for both.
Here is a literal string containing actual backslashes:
pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
Here is a string containing control characters and non-ASCII characters, and a regex substitution to remove them:
pat = re.compile(r'[\x00-\x1f\x80-\xff]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
An additional complication is that the regex engine has its own internal uses for backslashes; we generally prefer to use raw strings for regexes in order to not have Python and the regex engine both interpreting backslashes (sometimes in incompatible ways).
you have to consider your input string s as raw string then this work, see below example:
pat = re.compile(r'\\[xX][0-9a-fA-F].')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
Another approach:
pat = re.compile(r'[^\w\d\s,]+')
s = "we are \xaf\x06OK\x03family, good"
print(' '.join(map(lambda x: x.strip(), pat.split(s))))
#=> we are OK family, good
Used reverse match, remove(split by) any characters that are not what you wanted.
This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 5 years ago.
>>> import re
>>> a='''\\n5
... 8'''
>>> b=re.findall('\\n[0-9]',a)
>>> print(b)
['\n8']
Why does it show \n8 and not \n5?
I used a \ in front of \n the first time.
I am finding the use of raw string in regex in python a bit confusing. To me it does not seem to be making any changes to the result
This is because in strings, the newline character is considered that, a single character.
When you do \\n5 you're escaping the \, so that's literally printing \n5, and not a newline by Python standards.
When you search for a regex such as \\n[0-9] though, in the first \ you're escaping the \n regex expression, so in the end you're looking for \n which is Python's newline. That matches the actual newline in your string, but not \\n which is two separate characters, an escaped \ and an n.
\\n is not a newline, it's an escaped backslash with an n.
>>> import re
>>> a = '''\n5
... 8'''
>>> a=re.findall('\\n[0-9]',a)
>>> print(a)
['\n5', '\n8']
because \\n5 is not valid new line, it will print \n5
I'm encountering confusing and seemingly contradictory rules regarding raw strings. Consider the following example:
>>> text = 'm\n'
>>> match = re.search('m\n', text)
>>> print match.group()
m
>>> print text
m
This works, which is fine.
>>> text = 'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
m
>>> print text
m
Again, this works. But shouldn't this throw an error, because the raw string contains the characters m\n and the actual text contains a newline?
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> print text
m\n
The above, surprisingly, throws an error, even though both are raw strings. This means both contain just the text m\n with no newlines.
>>> text = r'm\n'
>>> match = re.search(r'm\\n', text)
>>> print text
m\n
>>> print match.group()
m\n
The above works, surprisingly. Why do I have to escape the backslash in the re.search, but not in the text itself?
Then there's backslash with normal characters that have no special behavior:
>>> text = 'm\&'
>>> match = re.search('m\&', text)
>>> print text
m\&
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
This doesn't match, even though both the pattern and the string lack special characters.
In this situation, no combination of raw strings works (text as a raw string, patterns as a raw string, both or none).
However, consider the last example. Escaping in the text variable, 'm\\&', doesn't work, but escaping in the pattern does. This parallels the behavior above--even stranger, I feel, considering that \& is of no special meaning to either Python or re:
>>> text = 'm\&'
>>> match = re.search(r'm\\&', text)
>>> print text
m\&
>>> print match.group()
m\&
My understanding of raw strings is that they inhibit the behavior of the backslash in python. For regular expressions, this is important because it allows re.search to apply its own internal backslash behavior, and prevent conflicts with Python. However, in situations like the above, where backslash effectively means nothing, I'm not sure why it seems necessary. Worse yet, I don't understand why I need to backslash for the pattern, but not the text, and when I make both a raw string, it doesn't seem to work.
The docs don't provide much guidance in this regard. They focus on examples with obvious problems, such as '\section', where \s is a meta-character. Looking for a complete answer to prevent unanticipated behavior such as this.
In the regular Python string, 'm\n', the \n represents a single newline character, whereas in the raw string r'm\n' the \ and n are just themselves. So far, so simple.
If you pass the string 'm\n' as a pattern to re.search(), you're passing a two-character string (m followed by newline), and re will happily go and find instances of that two-character string for you.
If you pass the three-character string r'm\n', the re module itself will interpret the two characters \ n as having the special meaning "match a newline character", so that the whole pattern means "match an m followed by a newline", just as before.
In your third example, since the string r'm\n' doesn't contain a newline, there's no match:
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print(match)
None
With the pattern r'm\\n', you're passing two actual backslashes to re.search(), and again, the re module itself is interpreting the double backslash as "match a single backslash character".
In the case of 'm\&', something slightly different is going on. Python treats the backslash as a regular character, because it isn't part of an escape sequence. re, on the other hand, simply discards the \, so the pattern is effectively m&. You can see that this is true by testing the pattern against 'm&':
>>> re.search('m\&', 'm&').group()
'm&'
As before, doubling the backslash tells re to search for an actual backslash character:
>>> re.search(r'm\\&', 'm\&').group()
'm\\&'
... and just to make things a little more confusing, the single backslash is represented by Python doubled. You can see that it's actually a single backslash by printing it:
>>> print(re.search(r'm\\&', 'm\&').group())
m\&
To explain it in simple terms, \<character> has a special meaning in regular expressions. For example \s for whitespace characters, \d for decimal digits, \n for new-line characters, etc.
When you define a string as
s = 'foo\n'
This string contains the characters f, o, o and the new-line character (length 4).
However, when defining a raw string:
s = r'foo\n'
This string contains the characters f, o, o, \ and n (length 5).
When you compile a regexp with raw \n (i.e. r'\n'), it'll match all new lines. Similarly, just using the new-line character (i.e. '\n') it's going to match new-line characters just like a matches a and so on.
Once you understand this concept, you should be able to figure out the rest.
To elaborate a bit further. In order to match the back-slash character \ using regex, the valid regular expression is \\ which in Python would be r'\\' or its equivalent '\\\\'.
text = r'm\n'
match = re.search(r'm\\n', text)
First line using r stops python from interpreting \n as single byte.
Second line using r plays the same role as first.Using \ prevents regex from interpreting as \n .Regex also uses \ like \s, \d.
The following characters are the meta characters that give special meaning to the regular expression search syntax:
\ the backslash escape character.
The backslash gives special meaning to the character following it. For example, the combination "\n" stands for the newline, one of the control characters. The combination "\w" stands for a "word" character, one of the convenience escape sequences while "\1" is one of the substitution special characters.
Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself.
Example: "a+" matches "a+" and not a series of one or "a"s.
In order to understand the internal representation of the strings you're confused about. I'd recommend you using repr and len builtin functions. Using those you'll be able to understand exactly how the strings are and you won't be confused anymore about pattern matching because you'll exactly know the internal representation. For instance, let's say you wanna analize the strings you're having troubles with:
use_cases = [
'm\n',
r'm\n',
'm\\n',
r'm\\n',
'm\&',
r'm\&',
'm\\&',
r'm\\&',
]
for u in use_cases:
print('-' * 10)
print(u, repr(u), len(u))
The output would be:
----------
m
'm\n' 2
----------
m\n 'm\\n' 3
----------
m\n 'm\\n' 3
----------
m\\n 'm\\\\n' 4
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\\& 'm\\\\&' 4
So you can see exactly the differences between normal/raw strings.
Say, I have a simple string
str = "hello hello hello 123"
In Python, I want to replace all words called "hello" with "<>", I use
re.sub("\bhello\b",'<>',str)
In Ruby 1.8.7 , I use
str.gsub!(/\bhello\b/,'<>')
However, the Ruby Interpreter works as expected changing all WORDS called hello properly. But, Python doesn't - it doesn't even recognize a single word called hello.
My questions are:
Why the difference?
How do I get the same functionality in Python?
Python strings interpret backslashes as escape codes; \b is a backspace character. Either double the backslash or use a raw string literal:
re.sub("\\bhello\\b", '<>', inputstring)
or
re.sub(r"\bhello\b", '<>', inputstring)
Compare:
>>> print "\bhello\b"
hello
>>> print r"\bhello\b"
\bhello\b
>>> len("\bhello\b"), len(r"\bhello\b")
(7, 9)
See The Backslash Plague section of the Python regex HOWTO:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.
[...]
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
Demo:
>>> import re
>>> inputstring = "hello hello hello 123"
>>> re.sub("\bhello\b", '<>', inputstring)
'hello hello hello 123'
>>> re.sub(r"\bhello\b", '<>', inputstring)
'<> <> <> 123'
You have to make it a raw string as python interprets \b and <> differently
>>> s = "hello hello hello 123"
>>> import re
>>> re.sub(r"\bhello\b",r'<>',s)
'<> <> <> 123'*
Note - Never name your string as str as it over-rides the built in functionality.
I'm trying to use a Python regular expression to get the first token of a character-separated string. I don't want to treat backslashed separators as real separators, so I'm using a negative lookbehind assertion. When the separator is a comma, it works without problem.
>>> import re
>>> re.match("(.*?)(?<!\\\\),.*", "Hello\, world!,This is a comma separated string,Third value").groups(1)[0]
'Hello\\, world!'
Whereas the exact same code by replacing the comma with an apostrophe does not work at all.
>>> import re
>>> re.match("(.*?)(?<!\\\\)'.*", "Hello\' world!'This is an apostrophe separated string'Third value").groups(1)[0]
'Hello'
>>>
I'm using python 2.7.2, but I have the same behavior with Python 3 (tested on Ideone). The Python re documentation does not indicate that ' is a special character, so I'm really wondering, why is my ' treated differently?
(Please, no comments: Who would want to have an apostrophe-separated file. Well... I do...)
print(repr("\'"),repr("\,"))
Results in:
"'" '\\,'
As you can see "\'" doesn't actually have a \\ in it. Hence when you change it to "\\'" the pattern matches producing:
Hello\' world!
"\'" is actually an escape sequence:
\' Single quote (')
Clearly, the reason
>>> ord("\'") == ord("'")
True
Is because "\'" is equivalent to "'". It makes sense \' is an escape sequence:
>>> 'i\'ll'
"i'll"