simplejson - encoding regexp \d+ - python

I have some missunderstanding with encoding regexp:
>>> simplejson.dumps({'title':r'\d+'})
'{"title": "\\\\d+"}'
>>> simplejson.loads('{"title": "\\\\d+"}')
{u'title': u'\\d+'}
>>> print simplejson.loads('{"title": "\\\\d+"}')['title']
\d+
So, without using print I see \\, with using print I see \. So, what the value loaded dict contains - with \\ or with \?

Here is a trick: Use list to see what characters are really in the string:
In [3]: list(u'\\d+')
Out[3]: [u'\\', u'd', u'+']
list breaks up the string into individual characters. So u'\\' is one character. (The double backslash in u'\\' is an escape sequence.) It represents one backslash character. This is correct since r'\d+' also has only one backslash:
In [4]: list(r'\d+')
Out[4]: ['\\', 'd', '+']

Related

Regex: parsing differently if character is escaped

Given this string "foo-bar=369,337,234,123", I'm able to parse it to ['foo-bar', '369', '337', '234', '123] with this regular expression:
re.findall(r'[a-zA-Z0-9\-_\+;]+', 'foo-bar=369,337,234,123')
Now, if I escape some of the , in the string, e.g. "foo-bar=369\,337\,234,123", I would like it to be parsed a bit differently: ['foo-bar', '369\,337\,234', '123']. I tried the below regex but it doesn't work:
r'[a-zA-Z0-9\-_\+;(\\,)]+'
basically trying to add the sequence of characters \, to the list of characters to match.
You may use
[a-zA-Z0-9_+;-]+(?:\\,[a-zA-Z0-9_+;-]+)*
See the regex demo
If you pass re.A or re.ASCII to re.compile, you may shorten it to
[\w+;-]+(?:\\,[\w+;-]+)*
Regex details
[\w+;-]+ - one or more word, +, ; or - chars
(?:\\,[\w+;-]+)* - 0 or more occurrences of a \, substring followed with 1+ word, +, ; or - chars.
Python demo:
import re
strings = [r'foo-bar=369,337,234,123', r'foo-bar=369\,337\,234,123']
rx = re.compile(r"[\w+;-]+(?:\\,[\w+;-]+)*", re.A)
for s in strings:
print(f"Parsing {s}")
print(rx.findall(s))
Output:
Parsing foo-bar=369,337,234,123
['foo-bar', '369', '337', '234', '123']
Parsing foo-bar=369\,337\,234,123
['foo-bar', '369\\,337\\,234', '123']
Note the double backslashes here, inside string literals, denote a single literal backslash.

regex match&replace for raw string

I don't get why following behavior of the re.sub. Anybody can explain how the strings are processed in re.sub? why statement 2 doesn't match&replace? thanks
1. >>> re.sub(r'\$abc', 'ABC', r'\$abcdefg')
'\\ABCdefg'
2. >>> re.sub(r'\\$abc', 'ABC', r'\\$abcdefg')
'\\\\$abcdefg'
3. >>> r'\\$abc' in r'\\$abcdefg'
True
4. >>> re.sub(r'\\\$abc', 'ABC', r'\\\$abcdefg')
'\\\\ABCdefg'
It is because in the pattern (first arg) the double back slashes \\ turns into only one, and thus won't match.
It occurs because when the regex engine see one slash \ it automatically escape the next character resulting \\ in just one literal \.
To do the regex match the two slashes you should add two more slashes:
2. >>> re.sub(r'\\\\\$abc', 'ABC', r'\\\$abcdefg')
'\ABCdefg'
Reference: re.sub
Hope it helps.

Why do Python regex strings sometimes work without using raw strings?

Python recommends using raw strings when defining regular expressions in the re module. From the Python documentation:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
However, in many cases this is not necessary, and you get the same result whether you use a raw string or not:
$ ipython
In [1]: import re
In [2]: m = re.search("\s(\d)\s", "a 3 c")
In [3]: m.groups()
Out[3]: ('3',)
In [4]: m = re.search(r"\s(\d)\s", "a 3 c")
In [5]: m.groups()
Out[5]: ('3',)
Yet, in some cases this is not the case:
In [6]: m = re.search("\s(.)\1\s", "a 33 c")
In [7]: m.groups()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-12-84a8d9c174e2> in <module>()
----> 1 m.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
In [8]: m = re.search(r"\s(.)\1\s", "a 33 c")
In [9]: m.groups()
Out[9]: ('3',)
And you must escape the special characters when not using a raw string:
In [10]: m = re.search("\\s(.)\\1\\s", "a 33 c")
In [11]: m.groups()
Out[11]: ('3',)
My question is why do the non-escaped, non-raw regex strings work at all with special characters (as in command [2] above)?
The example above works because \s and \d are not escape sequences in python. According to the docs:
Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. 
But it's best to just use raw strings and not worry about what is or isn't a python escape, or worry about changing it later if you change the regex.
It is because \s and \d are not escape sequences:
>>> print('\s')
\s
>>> print('\d')
\d
>>>
So, they are treated literally as \s and \d. \1 however is an escape sequence:
>>> print('\1')
☺
>>>
This means that it is being interpreted as ☺ instead of \1.
For a complete list of Python's escape sequences, see String and Bytes literals in the documentation.

Python regex to replace double backslash with single backslash

I'm trying to replace all double backslashes with just a single backslash. I want to replace 'class=\\"highlight' with 'class=\"highlight'. I thought that python treats '\\' as one backslash and r'\\+' as a string with two backslashes. But when I try
In [5]: re.sub(r'\\+', '\\', string)
sre_constants.error: bogus escape (end of line)
So I tried switching the replace string with a raw string:
In [6]: re.sub(r'\\+', r'\\', string)
Out [6]: 'class=\\"highlight'
Which isn't what I need. So I tried only one backslash in the raw string:
In [7]: re.sub(r'\\+', r'\', string)
SyntaxError: EOL while scanning string literal
why not use string.replace()?
>>> s = 'some \\\\ doubles'
>>> print s
some \\ doubles
>>> print s.replace('\\\\', '\\')
some \ doubles
Or with "raw" strings:
>>> s = r'some \\ doubles'
>>> print s
some \\ doubles
>>> print s.replace('\\\\', '\\')
some \ doubles
Since the escape character is complicated, you still need to escape it so it does not escape the '
You only got one backslash in string:
>>> string = 'class=\\"highlight'
>>> print string
class=\"highlight
Now lets put another one in there
>>> string = 'class=\\\\"highlight'
>>> print string
class=\\"highlight
and then remove it again
>>> print re.sub('\\\\\\\\', r'\\', string)
class=\"highlight
Just use .replace() twice!
I had the following path: C:\\Users\\XXXXX\\Desktop\\PMI APP GIT\\pmi-app\\niton x5 test data
To convert \ to single backslashes, i just did the following:
path_to_file = path_to_file.replace('\\','*')
path_to_file = path_to_file.replace('**', '\\')
first operation creates ** for every \ and second operation escapes the first slash, replacing ** with a single \.
Result:
C:**Users**z0044wmy**Desktop**PMI APP GIT**pmi-app**GENERATED_REPORTS
C:\Users\z0044wmy\Desktop\PMI APP GIT\pmi-app\GENERATED_REPORTS
I just realized that this might be the simplest answer:
import os
os.getcwd()
The above outputs a path with \ (2 back slashes)
BUT if you wrap it with a print function, i.e.,
print(os.getcwd())
it will output the 2 slashes with 1 slash so you can then copy and paste into an address bar!

Python regex - r prefix

Can anyone explain why example 1 below works, when the r prefix is not used?
I thought the r prefix must be used whenever escape sequences are used.
Example 2 and example 3 demonstrate this.
# example 1
import re
print (re.sub('\s+', ' ', 'hello there there'))
# prints 'hello there there' - not expected as r prefix is not used
# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
# prints 'hello there' - as expected as r prefix is used
# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello there there'))
# prints 'hello there there' - as expected as r prefix is not used
Because \ begin escape sequences only when they are valid escape sequences.
>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'
>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s
Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:
Escape Sequence Meaning Notes
\newline Ignored
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\N{name} Character named name in the Unicode database (Unicode only)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\uxxxx Character with 16-bit hex value xxxx (Unicode only)
\Uxxxxxxxx Character with 32-bit hex value xxxxxxxx (Unicode only)
\v ASCII Vertical Tab (VT)
\ooo Character with octal value ooo
\xhh Character with hex value hh
Never rely on raw strings for path literals, as raw strings have some rather peculiar inner workings, known to have bitten people in the ass:
When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.
To better illustrate this last point:
>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>>
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\
the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character.
http://docs.python.org/reference/lexical_analysis.html#literals
so '\n' is a single newline
and r'\n' is two characters - a backslash and the letter 'n'
another way to write it would be '\\n' because the first backslash escapes the second
an equivalent way of writing this
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello there there'))
is
print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello there there'))
Because of the way Python treats characters that are not valid escape characters, not all of those double backslashes are necessary - eg '\s'=='\\s' however the same is not true for '\b' and '\\b'. My preference is to be explicit and double all the backslashes.
Not all sequences involving backslashes are escape sequences. \t and \f are, for example, but \s is not. In a non-raw string literal, any \ that is not part of an escape sequence is seen as just another \:
>>> "\s"
'\\s'
>>> "\t"
'\t'
\b is an escape sequence, however, so example 3 fails. (And yes, some people consider this behaviour rather unfortunate.)
Try that:
a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'
Check below example:
print r"123\n123"
#outputs>>>
123\n123
print "123\n123"
#outputs>>>
123
123

Categories

Resources