Why do Python regex strings sometimes work without using raw strings? - python

Python recommends using raw strings when defining regular expressions in the re module. From the Python documentation:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
However, in many cases this is not necessary, and you get the same result whether you use a raw string or not:
$ ipython
In [1]: import re
In [2]: m = re.search("\s(\d)\s", "a 3 c")
In [3]: m.groups()
Out[3]: ('3',)
In [4]: m = re.search(r"\s(\d)\s", "a 3 c")
In [5]: m.groups()
Out[5]: ('3',)
Yet, in some cases this is not the case:
In [6]: m = re.search("\s(.)\1\s", "a 33 c")
In [7]: m.groups()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-84a8d9c174e2> in <module>()
----> 1 m.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
In [8]: m = re.search(r"\s(.)\1\s", "a 33 c")
In [9]: m.groups()
Out[9]: ('3',)
And you must escape the special characters when not using a raw string:
In [10]: m = re.search("\\s(.)\\1\\s", "a 33 c")
In [11]: m.groups()
Out[11]: ('3',)
My question is why do the non-escaped, non-raw regex strings work at all with special characters (as in command [2] above)?

The example above works because \s and \d are not escape sequences in python. According to the docs:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. 
But it's best to just use raw strings and not worry about what is or isn't a python escape, or worry about changing it later if you change the regex.

It is because \s and \d are not escape sequences:
>>> print('\s')
\s
>>> print('\d')
\d
>>>
So, they are treated literally as \s and \d. \1 however is an escape sequence:
>>> print('\1')
☺
>>>
This means that it is being interpreted as ☺ instead of \1.
For a complete list of Python's escape sequences, see String and Bytes literals in the documentation.

Related

Why is raw string performing inconsistently in parenthesis

For example:
a = (r'''\n1''')
b = (r'''
2''')
print(a)
print(b)
The output of this example is this:
\n1
2
Meaning that even if b is supposed to be a raw string, it does not seem to work like one, why is this?
I also checked:
if '\n' in b:
print('yes')
The output of this is yes meaning that b is a string, and indeed has \n string inside of it.
In the raw string syntax, escape sequences have no special meaning (apart from a backslash before a quote). The characters \ plus n form two characters in a raw string literal, unlike a regular string literal, where those two characters are replaced by a newline character.
An actual newline character, on the other hand, is not an escape sequence. It is just a newline character, and is included in the string as such.
Compare this to using 1 versus \x31; the latter is an escape sequence for the ASCII codepoint for the digit 1. In a regular string literal, both would give you the character 1, in a raw string literal, the escape sequence would not be interpreted:
>>> print('1\x31')
11
>>> print(r'1\x31')
1\x31
All this has nothing to do with parentheses. The parentheses do not alter the behaviour of a r'''...''' raw string. The exact same thing happens when you remove the parentheses:
>>> a = r'''\n1'''
>>> a
'\\n1'
>>> print(a)
\n1
>>> b = r'''
... 2'''
>>> b
'\n2'
>>> print(b)
2

Python re.sub not returning match

In my brain, the following:
>>> re.sub('([eo])', '_\1_', 'aeiou')
should return:
'a_e_i_o_u'
instead it returns:
'a_\x01_i_\x01_u'
I'm sure I'm having a brain cramp, but I can't for the life of me figure out what's wrong.
\1 produces \x01 in Python string literals. Double the slash, or use a raw string literal:
>>> import re
>>> re.sub('([eo])', '_\1_', 'aeiou')
'a_\x01_i_\x01_u'
>>> re.sub('([eo])', '_\\1_', 'aeiou')
'a_e_i_o_u'
>>> re.sub('([eo])', r'_\1_', 'aeiou')
'a_e_i_o_u'
See The Backslash Plague in the Python regex HOWTO:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.
Use raw string r:
re.sub('([eo])', r'_\1_', 'aeiou')
Output:
In [3]: re.sub('([eo])', r'_\1_', 'aeiou')
Out[3]: 'a_e_i_o_u'
In [4]: "\1"
Out[4]: '\x01'
In [5]: r"\1"
Out[5]: '\\1'

Regex Findall Hang in Linux

I make a program to find a float inside string using re.findall, as follows:
string1 = 'Voltage = 3.0 - 4.0 V'
string2 = '3.66666'
float1 = re.findall('\d+.\d+', string1)
float2 = re.findall('\d+.\d+', string2)
This program runs well on windows, but when I tried to run the program on Linux, the program keep being stuck on the second re.findall. Any idea what cause this problem? How to solve this?
Thank you
You need to define your regex as raw string and also you need to escape the dot. Dot is a special meta character in regex which matches any character except line breaks. Escaping the dot in your regex will match a literal dot.
float1 = re.findall(r'\d+\.\d+', string1)
float2 = re.findall(r'\d+\.\d+', string2)
From re doc.
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
>>> string1 = 'Voltage = 3.0 - 4.0 V'
>>> string2 = '3.66666'
>>> float1 = re.findall(r'\d+\.\d+', string1)
>>> float2 = re.findall(r'\d+\.\d+', string2)
>>> float1
['3.0', '4.0']
>>> float2
['3.66666']

python Regex match exact word

I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>
re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).

simplejson - encoding regexp \d+

I have some missunderstanding with encoding regexp:
>>> simplejson.dumps({'title':r'\d+'})
'{"title": "\\\\d+"}'
>>> simplejson.loads('{"title": "\\\\d+"}')
{u'title': u'\\d+'}
>>> print simplejson.loads('{"title": "\\\\d+"}')['title']
\d+
So, without using print I see \\, with using print I see \. So, what the value loaded dict contains - with \\ or with \?
Here is a trick: Use list to see what characters are really in the string:
In [3]: list(u'\\d+')
Out[3]: [u'\\', u'd', u'+']
list breaks up the string into individual characters. So u'\\' is one character. (The double backslash in u'\\' is an escape sequence.) It represents one backslash character. This is correct since r'\d+' also has only one backslash:
In [4]: list(r'\d+')
Out[4]: ['\\', 'd', '+']

Categories

Resources