Python re.sub not returning match - python

In my brain, the following:
>>> re.sub('([eo])', '_\1_', 'aeiou')
should return:
'a_e_i_o_u'
instead it returns:
'a_\x01_i_\x01_u'
I'm sure I'm having a brain cramp, but I can't for the life of me figure out what's wrong.

\1 produces \x01 in Python string literals. Double the slash, or use a raw string literal:
>>> import re
>>> re.sub('([eo])', '_\1_', 'aeiou')
'a_\x01_i_\x01_u'
>>> re.sub('([eo])', '_\\1_', 'aeiou')
'a_e_i_o_u'
>>> re.sub('([eo])', r'_\1_', 'aeiou')
'a_e_i_o_u'
See The Backslash Plague in the Python regex HOWTO:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

Use raw string r:
re.sub('([eo])', r'_\1_', 'aeiou')
Output:
In [3]: re.sub('([eo])', r'_\1_', 'aeiou')
Out[3]: 'a_e_i_o_u'
In [4]: "\1"
Out[4]: '\x01'
In [5]: r"\1"
Out[5]: '\\1'

Related

regex unicode characters

The following regex working online but not working in python code and shows no matches:
https://regex101.com/r/lY1kY8/2
s=re.sub(r'\x.+[0-9]',' ',s)
required:
re.sub(r'\x.+[0-9]* ',' ',r'cats\xe2\x80\x99 faces')
Out[23]: 'cats faces'
basically wanted to remove the unicode special characters "\xe2\x80\x99"
As another option that doesn't require regex, you could instead remove the unicode characters by removing anything not listed in string.printable
>>> import string
>>> ''.join(i for i in 'cats\xe2\x80\x99 faces' if i in string.printable)
'cats faces'
print re.findall(r'\\x.*?[0-9]* ',r'cats\xe2\x80\x99 faces')
^^
Use raw mode flag.Use findall as match starts matching from beginning
print re.sub(ur'\\x.*?[0-9]+','',r'cats\xe2\x80\x99 faces')
with re.sub
s=r'cats\xe2\x80\x99 faces'
print re.sub(r'\\x.+?[0-9]*','',s)
EDIT:
The correct way would be to decode to utf-8 and then apply regex.
s='cats\xe2\x80\x99 faces'
\xe2\x80\x99 is U+2019
print re.sub(u'\u2019','',s.decode('utf-8'))
Assume you use Python 2.x
>>> s = 'cats\xe2\x80\x99 f'
>>> len(s), s[4]
(9, 'â')
Means chars like \xe2 is with 1 length, instead 3. So that you cannot match it with r'\\x.+?[0-9]*' to match it.
>>> s = '\x63\x61\x74\x73\xe2\x80\x99 f'
>>> ''.join([c for c in s if c <= 'z'])
'cats f'
Help this help a bit.

Why do Python regex strings sometimes work without using raw strings?

Python recommends using raw strings when defining regular expressions in the re module. From the Python documentation:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
However, in many cases this is not necessary, and you get the same result whether you use a raw string or not:
$ ipython
In [1]: import re
In [2]: m = re.search("\s(\d)\s", "a 3 c")
In [3]: m.groups()
Out[3]: ('3',)
In [4]: m = re.search(r"\s(\d)\s", "a 3 c")
In [5]: m.groups()
Out[5]: ('3',)
Yet, in some cases this is not the case:
In [6]: m = re.search("\s(.)\1\s", "a 33 c")
In [7]: m.groups()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-84a8d9c174e2> in <module>()
----> 1 m.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
In [8]: m = re.search(r"\s(.)\1\s", "a 33 c")
In [9]: m.groups()
Out[9]: ('3',)
And you must escape the special characters when not using a raw string:
In [10]: m = re.search("\\s(.)\\1\\s", "a 33 c")
In [11]: m.groups()
Out[11]: ('3',)
My question is why do the non-escaped, non-raw regex strings work at all with special characters (as in command [2] above)?
The example above works because \s and \d are not escape sequences in python. According to the docs:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. 
But it's best to just use raw strings and not worry about what is or isn't a python escape, or worry about changing it later if you change the regex.
It is because \s and \d are not escape sequences:
>>> print('\s')
\s
>>> print('\d')
\d
>>>
So, they are treated literally as \s and \d. \1 however is an escape sequence:
>>> print('\1')
☺
>>>
This means that it is being interpreted as ☺ instead of \1.
For a complete list of Python's escape sequences, see String and Bytes literals in the documentation.

Why does my python regex not work?

I wanna replace all the chars which occur more than one time,I used Python's re.sub and my regex looks like this data=re.sub('(.)\1+','##',data), But nothing happened...
Here is my Text:
Text
※※※※※※※※※※※※※※※※※Chapter One※※※※※※※※※※※※※※※※※※
This is the begining...
You need to use raw string here, 1 is interpreted as octal and then its ASCII value present at its integer equivalent is used in the string.
>>> '\1'
'\x01'
>>> chr(01)
'\x01'
>>> '\101'
'A'
>>> chr(0101)
'A'
Use raw string to fix this:
>>> '(.)\1+'
'(.)\x01+'
>>> r'(.)\1+' #Note the `r`
'(.)\\1+'
Use a raw string, so the regex engine interprets backslashes instead of the Python parser. Just put an r in front of the string:
data=re.sub(r'(.)\1+', '##', data)
^ this r is the important bit
Otherwise, \1 is interpreted as character value 1 instead of a backreference.

python Regex match exact word

I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>
re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).

simplejson - encoding regexp \d+

I have some missunderstanding with encoding regexp:
>>> simplejson.dumps({'title':r'\d+'})
'{"title": "\\\\d+"}'
>>> simplejson.loads('{"title": "\\\\d+"}')
{u'title': u'\\d+'}
>>> print simplejson.loads('{"title": "\\\\d+"}')['title']
\d+
So, without using print I see \\, with using print I see \. So, what the value loaded dict contains - with \\ or with \?
Here is a trick: Use list to see what characters are really in the string:
In [3]: list(u'\\d+')
Out[3]: [u'\\', u'd', u'+']
list breaks up the string into individual characters. So u'\\' is one character. (The double backslash in u'\\' is an escape sequence.) It represents one backslash character. This is correct since r'\d+' also has only one backslash:
In [4]: list(r'\d+')
Out[4]: ['\\', 'd', '+']

Categories

Resources