Regular Expressions working differently in Python and Ruby - python

Say, I have a simple string
str = "hello hello hello 123"
In Python, I want to replace all words called "hello" with "<>", I use
re.sub("\bhello\b",'<>',str)
In Ruby 1.8.7 , I use
str.gsub!(/\bhello\b/,'<>')
However, the Ruby Interpreter works as expected changing all WORDS called hello properly. But, Python doesn't - it doesn't even recognize a single word called hello.
My questions are:
Why the difference?
How do I get the same functionality in Python?

Python strings interpret backslashes as escape codes; \b is a backspace character. Either double the backslash or use a raw string literal:
re.sub("\\bhello\\b", '<>', inputstring)
or
re.sub(r"\bhello\b", '<>', inputstring)
Compare:
>>> print "\bhello\b"
hello
>>> print r"\bhello\b"
\bhello\b
>>> len("\bhello\b"), len(r"\bhello\b")
(7, 9)
See The Backslash Plague section of the Python regex HOWTO:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.
[...]
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
Demo:
>>> import re
>>> inputstring = "hello hello hello 123"
>>> re.sub("\bhello\b", '<>', inputstring)
'hello hello hello 123'
>>> re.sub(r"\bhello\b", '<>', inputstring)
'<> <> <> 123'

You have to make it a raw string as python interprets \b and <> differently
>>> s = "hello hello hello 123"
>>> import re
>>> re.sub(r"\bhello\b",r'<>',s)
'<> <> <> 123'*
Note - Never name your string as str as it over-rides the built in functionality.

Related

regexpresion cannot match special symbols in python

I have a string: s = "we are \xaf\x06OK\x03family, good", and I want to substitute the \xaf,\x06 and \x03 with '', the regexpresion is pat = re.compile(r'\\[xX][0-9a-fA-F]+'), but it cannnot match anything. The code is in belows:
pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
The result is
we are ¯OKfamily, good
we are ¯OKfamily, good,
But how can I get we are OK family, good
You are making the basic but common mistake of confusing the representation of a string in Python source code with its actual value.
There are a number of escape codes in Python which do not represent themselves verbatim in regular strings in source code. For example, "\n" represents a single newline character, even though the Python notation occupies two characters. The backslash is used to introduce this notation. There are a number of dedicated escape codes like \r, \a, etc, and a generalized notation \x01 which allows you to write any character code in hex notation (\n is equivalent to \x0a, \r is equivalent to \x0d, etc). To represent a literal backslash character, you need to escape it with another backslash: "\\".
In a "raw string", no backslash escapes are supported; so r"\n" represents a string containing two characters, a literal backslash \ and a literal lowercase n. You could equivalently write "\\n" using non-raw string notation. The r prefix is not part of the string, it just tells Python how to interpret the string between the following quotes (i.e. no interpretation at all; every character represents itself verbatim).
It is not clear from your question which of these interpretations you actually need, so I will present solutions for both.
Here is a literal string containing actual backslashes:
pat = re.compile(r'\\[xX][0-9a-fA-F]+')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
Here is a string containing control characters and non-ASCII characters, and a regex substitution to remove them:
pat = re.compile(r'[\x00-\x1f\x80-\xff]+')
s = "we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
An additional complication is that the regex engine has its own internal uses for backslashes; we generally prefer to use raw strings for regexes in order to not have Python and the regex engine both interpreting backslashes (sometimes in incompatible ways).
you have to consider your input string s as raw string then this work, see below example:
pat = re.compile(r'\\[xX][0-9a-fA-F].')
s = r"we are \xaf\x06OK\x03family, good"
print(s)
print(re.sub(pat, '', s))
Another approach:
pat = re.compile(r'[^\w\d\s,]+')
s = "we are \xaf\x06OK\x03family, good"
print(' '.join(map(lambda x: x.strip(), pat.split(s))))
#=> we are OK family, good
Used reverse match, remove(split by) any characters that are not what you wanted.

Removing wrapped line returns [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
I want to remove the line returns of a text that is wrapped to a certain width. e.g.
import re
x = 'the meaning\nof life'
re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
I want to return the meaning of life. What am I doing wrong?
You need escape that \ like this:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
>>> re.sub("([,\w])\n(\w)", "\\1 \\2", x)
'the meaning of life'
>>> re.sub("([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
>>>
If you don't escape it, the output is \1, so:
>>> '\1'
'\x01'
>>>
That's why we need use '\\\\' or r'\\'to display a signal \ in Python RegEx.
However about that, from this answer:
If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).
And the document:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python's usage of the same character for the same purpose in string literals.
Let's say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.
Another way as brittenb suggested, you don't need RegEx in this case:
>>> x = 'the meaning\nof life'
>>> x.replace("\n", " ")
'the meaning of life'
>>>
Use raw string literals; both Python string literal syntax and regex interpret backslashes; \1 in a python string literal is interpreted as an octal escape, but not in a raw string literal:
re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
The alternative would be to double all backslashes so that they reach the regex engine as such.
See the Backslash plague section of the Python regex HOWTO.
Demo:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
It might be easier just to split on newlines; use the str.splitlines() method, then re-join with spaces using str.join():
' '.join(ex.splitlines())
but admittedly this won't distinguish between newlines between words and extra newlines elsewhere.

Python putting r before unicode string variable

For static strings, putting an r in front of the string would give the raw string (e.g. r'some \' string'). Since it is not possible to put r in front of a unicode string variable, what is the minimal approach to dynamically convert a string variable to its raw form? Should I manually substitute all backslashes with double backslashes?
str_var = u"some text with escapes e.g. \( \' \)"
raw_str_var = ???
If you really need to escape a string, let's say you want to print a newline as \n, you can use the encode method with the Python specific string_escape encoding:
>>> s = "hello\nworld"
>>> e = s.encode("string_escape")
>>> e
"hello\\nworld"
>>> print s
hello
world
>>> print e
hello\nworld
You didn't mention anything about unicode, or which Python version you are using, but if you are dealing with unicode strings you should use unicode_escape instead.
>>> u = u"föö\nbär"
>>> print u
föö
bär
>>> print u.encode('unicode_escape')
f\xf6\xf6\nb\xe4r
Your post originally had the regex tag, maybe re.escape is what you're actually looking for?
>>> re.escape(u"foo\nbar\'baz")
u"foo\\\nbar\\'baz"
Not the "double escapes", ie printing the above string yields:
foo\
bar\'baz
There is nothing to convert - the r prefix is only significant in source code notation, not for program logic.
As a rule, if you use a single backslash in a normal string, it will automatically be converted to a double backslash if it doesn't start a valid escape sequence:
>>> "\n \("
'\n \\('
Since it may be difficult to remember all the valid/invalid escape sequences, raw string notation was introduced. But there is no way and no need to convert a string after it has been defined.
In your case, the correct approach would be to use
str_var = ur"some text with escapes e.g. \( \' \)"
which happens to result in the same string here, but is more explicit.

How to escape certain characters of a string?

I have a string This is a Test and a passed parameter s.
How can I escape all s-characters of that first string?
It should result in Thi\s i\s a Te\sT
I tried it like this:
rstr = rstr.replace(esc, r"\\" + esc)
But it will result in \\ before each s
r'\\' produces a literal double backslash:
>>> r'\\'
'\\\\'
Don't use a raw string here, just use '\\':
>>> 'This is a Test'.replace('s', '\\s')
'Thi\\s i\\s a Te\\st'
Don't confuse the Python representation with the value. To make debugging and round-tripping easy, the Python interpreter uses the repr() function to echo back results.
The repr() of a string uses Python literal notation, including escaping any backslashes:
>>> r'\s'
'\\s'
>>> len(r'\s')
2
>>> print r'\s'
\s
The actual value contains just one backslash, but because a backslash can start characters with special meanings, such as \n (a newline), or \x00 (a null byte), they are represented escaped so that you can paste the value directly back into the interpreter.

Apostrophe within Python lookbehind assertion

I'm trying to use a Python regular expression to get the first token of a character-separated string. I don't want to treat backslashed separators as real separators, so I'm using a negative lookbehind assertion. When the separator is a comma, it works without problem.
>>> import re
>>> re.match("(.*?)(?<!\\\\),.*", "Hello\, world!,This is a comma separated string,Third value").groups(1)[0]
'Hello\\, world!'
Whereas the exact same code by replacing the comma with an apostrophe does not work at all.
>>> import re
>>> re.match("(.*?)(?<!\\\\)'.*", "Hello\' world!'This is an apostrophe separated string'Third value").groups(1)[0]
'Hello'
>>>
I'm using python 2.7.2, but I have the same behavior with Python 3 (tested on Ideone). The Python re documentation does not indicate that ' is a special character, so I'm really wondering, why is my ' treated differently?
(Please, no comments: Who would want to have an apostrophe-separated file. Well... I do...)
print(repr("\'"),repr("\,"))
Results in:
"'" '\\,'
As you can see "\'" doesn't actually have a \\ in it. Hence when you change it to "\\'" the pattern matches producing:
Hello\' world!
"\'" is actually an escape sequence:
\' Single quote (')
Clearly, the reason
>>> ord("\'") == ord("'")
True
Is because "\'" is equivalent to "'". It makes sense \' is an escape sequence:
>>> 'i\'ll'
"i'll"

Categories

Resources