Apostrophe within Python lookbehind assertion - python

I'm trying to use a Python regular expression to get the first token of a character-separated string. I don't want to treat backslashed separators as real separators, so I'm using a negative lookbehind assertion. When the separator is a comma, it works without problem.
>>> import re
>>> re.match("(.*?)(?<!\\\\),.*", "Hello\, world!,This is a comma separated string,Third value").groups(1)[0]
'Hello\\, world!'
Whereas the exact same code by replacing the comma with an apostrophe does not work at all.
>>> import re
>>> re.match("(.*?)(?<!\\\\)'.*", "Hello\' world!'This is an apostrophe separated string'Third value").groups(1)[0]
'Hello'
>>>
I'm using python 2.7.2, but I have the same behavior with Python 3 (tested on Ideone). The Python re documentation does not indicate that ' is a special character, so I'm really wondering, why is my ' treated differently?
(Please, no comments: Who would want to have an apostrophe-separated file. Well... I do...)

print(repr("\'"),repr("\,"))
Results in:
"'" '\\,'
As you can see "\'" doesn't actually have a \\ in it. Hence when you change it to "\\'" the pattern matches producing:
Hello\' world!
"\'" is actually an escape sequence:
\' Single quote (')
Clearly, the reason
>>> ord("\'") == ord("'")
True
Is because "\'" is equivalent to "'". It makes sense \' is an escape sequence:
>>> 'i\'ll'
"i'll"

Related

Raw Strings, Python and re, Normal vs Special Characters

I'm encountering confusing and seemingly contradictory rules regarding raw strings. Consider the following example:
>>> text = 'm\n'
>>> match = re.search('m\n', text)
>>> print match.group()
m
>>> print text
m
This works, which is fine.
>>> text = 'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
m
>>> print text
m
Again, this works. But shouldn't this throw an error, because the raw string contains the characters m\n and the actual text contains a newline?
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> print text
m\n
The above, surprisingly, throws an error, even though both are raw strings. This means both contain just the text m\n with no newlines.
>>> text = r'm\n'
>>> match = re.search(r'm\\n', text)
>>> print text
m\n
>>> print match.group()
m\n
The above works, surprisingly. Why do I have to escape the backslash in the re.search, but not in the text itself?
Then there's backslash with normal characters that have no special behavior:
>>> text = 'm\&'
>>> match = re.search('m\&', text)
>>> print text
m\&
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
This doesn't match, even though both the pattern and the string lack special characters.
In this situation, no combination of raw strings works (text as a raw string, patterns as a raw string, both or none).
However, consider the last example. Escaping in the text variable, 'm\\&', doesn't work, but escaping in the pattern does. This parallels the behavior above--even stranger, I feel, considering that \& is of no special meaning to either Python or re:
>>> text = 'm\&'
>>> match = re.search(r'm\\&', text)
>>> print text
m\&
>>> print match.group()
m\&
My understanding of raw strings is that they inhibit the behavior of the backslash in python. For regular expressions, this is important because it allows re.search to apply its own internal backslash behavior, and prevent conflicts with Python. However, in situations like the above, where backslash effectively means nothing, I'm not sure why it seems necessary. Worse yet, I don't understand why I need to backslash for the pattern, but not the text, and when I make both a raw string, it doesn't seem to work.
The docs don't provide much guidance in this regard. They focus on examples with obvious problems, such as '\section', where \s is a meta-character. Looking for a complete answer to prevent unanticipated behavior such as this.
In the regular Python string, 'm\n', the \n represents a single newline character, whereas in the raw string r'm\n' the \ and n are just themselves. So far, so simple.
If you pass the string 'm\n' as a pattern to re.search(), you're passing a two-character string (m followed by newline), and re will happily go and find instances of that two-character string for you.
If you pass the three-character string r'm\n', the re module itself will interpret the two characters \ n as having the special meaning "match a newline character", so that the whole pattern means "match an m followed by a newline", just as before.
In your third example, since the string r'm\n' doesn't contain a newline, there's no match:
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print(match)
None
With the pattern r'm\\n', you're passing two actual backslashes to re.search(), and again, the re module itself is interpreting the double backslash as "match a single backslash character".
In the case of 'm\&', something slightly different is going on. Python treats the backslash as a regular character, because it isn't part of an escape sequence. re, on the other hand, simply discards the \, so the pattern is effectively m&. You can see that this is true by testing the pattern against 'm&':
>>> re.search('m\&', 'm&').group()
'm&'
As before, doubling the backslash tells re to search for an actual backslash character:
>>> re.search(r'm\\&', 'm\&').group()
'm\\&'
... and just to make things a little more confusing, the single backslash is represented by Python doubled. You can see that it's actually a single backslash by printing it:
>>> print(re.search(r'm\\&', 'm\&').group())
m\&
To explain it in simple terms, \<character> has a special meaning in regular expressions. For example \s for whitespace characters, \d for decimal digits, \n for new-line characters, etc.
When you define a string as
s = 'foo\n'
This string contains the characters f, o, o and the new-line character (length 4).
However, when defining a raw string:
s = r'foo\n'
This string contains the characters f, o, o, \ and n (length 5).
When you compile a regexp with raw \n (i.e. r'\n'), it'll match all new lines. Similarly, just using the new-line character (i.e. '\n') it's going to match new-line characters just like a matches a and so on.
Once you understand this concept, you should be able to figure out the rest.
To elaborate a bit further. In order to match the back-slash character \ using regex, the valid regular expression is \\ which in Python would be r'\\' or its equivalent '\\\\'.
text = r'm\n'
match = re.search(r'm\\n', text)
First line using r stops python from interpreting \n as single byte.
Second line using r plays the same role as first.Using \ prevents regex from interpreting as \n .Regex also uses \ like \s, \d.
The following characters are the meta characters that give special meaning to the regular expression search syntax:
\ the backslash escape character.
The backslash gives special meaning to the character following it. For example, the combination "\n" stands for the newline, one of the control characters. The combination "\w" stands for a "word" character, one of the convenience escape sequences while "\1" is one of the substitution special characters.
Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself.
Example: "a+" matches "a+" and not a series of one or "a"s.
In order to understand the internal representation of the strings you're confused about. I'd recommend you using repr and len builtin functions. Using those you'll be able to understand exactly how the strings are and you won't be confused anymore about pattern matching because you'll exactly know the internal representation. For instance, let's say you wanna analize the strings you're having troubles with:
use_cases = [
'm\n',
r'm\n',
'm\\n',
r'm\\n',
'm\&',
r'm\&',
'm\\&',
r'm\\&',
]
for u in use_cases:
print('-' * 10)
print(u, repr(u), len(u))
The output would be:
----------
m
'm\n' 2
----------
m\n 'm\\n' 3
----------
m\n 'm\\n' 3
----------
m\\n 'm\\\\n' 4
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\\& 'm\\\\&' 4
So you can see exactly the differences between normal/raw strings.

Removing wrapped line returns [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
I want to remove the line returns of a text that is wrapped to a certain width. e.g.
import re
x = 'the meaning\nof life'
re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
I want to return the meaning of life. What am I doing wrong?
You need escape that \ like this:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'
>>> re.sub("([,\w])\n(\w)", "\\1 \\2", x)
'the meaning of life'
>>> re.sub("([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
>>>
If you don't escape it, the output is \1, so:
>>> '\1'
'\x01'
>>>
That's why we need use '\\\\' or r'\\'to display a signal \ in Python RegEx.
However about that, from this answer:
If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).
And the document:
As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python's usage of the same character for the same purpose in string literals.
Let's say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.
Another way as brittenb suggested, you don't need RegEx in this case:
>>> x = 'the meaning\nof life'
>>> x.replace("\n", " ")
'the meaning of life'
>>>
Use raw string literals; both Python string literal syntax and regex interpret backslashes; \1 in a python string literal is interpreted as an octal escape, but not in a raw string literal:
re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
The alternative would be to double all backslashes so that they reach the regex engine as such.
See the Backslash plague section of the Python regex HOWTO.
Demo:
>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
It might be easier just to split on newlines; use the str.splitlines() method, then re-join with spaces using str.join():
' '.join(ex.splitlines())
but admittedly this won't distinguish between newlines between words and extra newlines elsewhere.

python regex search pattern

I'm searching a block of text for a newline followed by a period.
pat = '\n\.'
block = 'Some stuff here. And perhaps another sentence here.\n.Some more text.'
For some reason when I use regex to search for my pattern it changes the value of pat (using Python 2.7).
import re
mysrch = re.search(pat, block)
Now the value of pat has been changed to:
'\n\\.'
Which is messing with the next search that I use pat for. Why is this happening, and how can I avoid it?
Thanks very much in advance in advance.
The extra slash isn't actually part of the string - the string itself hasn't changed at all.
Here's an example:
>>> pat = '\n\.'
>>> pat
'\n\\.'
>>> print pat
\.
As you can see, when you print pat, it's only got one \ in it. When you dump the value of a string it uses the __repr__ function which is designed to show you unambiguously what is in the string, so it shows you the escaped version of characters. Like \n is the escaped version of a newline, \\ is the escaped version of \.
Your regex is probably not matching how you expect because it has an actual newline character in it, not the literal string "\n" (as a repr: "\\n").
You should either make your regex a raw string (as suggested in the comments).
>>> pat = r"\n\."
>>> pat
'\\n\\.'
>>> print pat
\n\.
Or you could just escape the slashes and use
pat = "\\n\\."

python replace single backslash with double backslash [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
In python, I am trying to replace a single backslash ("\") with a double backslash("\"). I have the following code:
directory = string.replace("C:\Users\Josh\Desktop\20130216", "\", "\\")
However, this gives an error message saying it doesn't like the double backslash. Can anyone help?
No need to use str.replace or string.replace here, just convert that string to a raw string:
>>> strs = r"C:\Users\Josh\Desktop\20130216"
^
|
notice the 'r'
Below is the repr version of the above string, that's why you're seeing \\ here.
But, in fact the actual string contains just '\' not \\.
>>> strs
'C:\\Users\\Josh\\Desktop\\20130216'
>>> s = r"f\o"
>>> s #repr representation
'f\\o'
>>> len(s) #length is 3, as there's only one `'\'`
3
But when you're going to print this string you'll not get '\\' in the output.
>>> print strs
C:\Users\Josh\Desktop\20130216
If you want the string to show '\\' during print then use str.replace:
>>> new_strs = strs.replace('\\','\\\\')
>>> print new_strs
C:\\Users\\Josh\\Desktop\\20130216
repr version will now show \\\\:
>>> new_strs
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Let me make it simple and clear. Lets use the re module in python to escape the special characters.
Python script :
import re
s = "C:\Users\Josh\Desktop"
print s
print re.escape(s)
Output :
C:\Users\Josh\Desktop
C:\\Users\\Josh\\Desktop
Explanation :
Now observe that re.escape function on escaping the special chars in the given string we able to add an other backslash before each backslash, and finally the output results in a double backslash, the desired output.
Hope this helps you.
Use escape characters: "full\\path\\here", "\\" and "\\\\"
In python \ (backslash) is used as an escape character. What this means that in places where you wish to insert a special character (such as newline), you would use the backslash and another character (\n for newline)
With your example string you would notice that when you put "C:\Users\Josh\Desktop\20130216" in the repl you will get "C:\\Users\\Josh\\Desktop\x8130216". This is because \2 has a special meaning in a python string. If you wish to specify \ then you need to put two \\ in your string.
"C:\\Users\\Josh\\Desktop\\28130216"
The other option is to notify python that your entire string must NOT use \ as an escape character by pre-pending the string with r
r"C:\Users\Josh\Desktop\20130216"
This is a "raw" string, and very useful in situations where you need to use lots of backslashes such as with regular expression strings.
In case you still wish to replace that single \ with \\ you would then use:
directory = string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\\\")
Notice that I am not using r' in the last two strings above. This is because, when you use the r' form of strings you cannot end that string with a single \
Why can't Python's raw string literals end with a single backslash?
https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-are-escape-characters/
Maybe a syntax error in your case,
you may change the line to:
directory = str(r"C:\Users\Josh\Desktop\20130216").replace('\\','\\\\')
which give you the right following output:
C:\\Users\\Josh\\Desktop\\20130216
The backslash indicates a special escape character. Therefore, directory = path_to_directory.replace("\", "\\") would cause Python to think that the first argument to replace didn't end until the starting quotation of the second argument since it understood the ending quotation as an escape character.
directory=path_to_directory.replace("\\","\\\\")
Given the source string, manipulation with os.path might make more sense, but here's a string solution;
>>> s=r"C:\Users\Josh\Desktop\\20130216"
>>> '\\\\'.join(filter(bool, s.split('\\')))
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Note that split treats the \\ in the source string as a delimited empty string. Using filter gets rid of those empty strings so join won't double the already doubled backslashes. Unfortunately, if you have 3 or more, they get reduced to doubled backslashes, but I don't think that hurts you in a windows path expression.
You could use
os.path.abspath(path_with_backlash)
it returns the path with \
Use:
string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\")
Escape the \ character.

Adding backslashes without escaping [duplicate]

This question already has answers here:
Why do backslashes appear twice?
(2 answers)
Closed 8 years ago.
I need to escape a & (ampersand) character in a string. The problem is whenever I string = string.replace ('&', '\&') the result is '\\&'. An extra backslash is added to escape the original backslash. How do I remove this extra backslash?
The result '\\&' is only displayed - actually the string is \&:
>>> str = '&'
>>> new_str = str.replace('&', '\&')
>>> new_str
'\\&'
>>> print new_str
\&
Try it in a shell.
The extra backslash is not actually added; it's just added by the repr() function to indicate that it's a literal backslash. The Python interpreter uses the repr() function (which calls __repr__() on the object) when the result of an expression needs to be printed:
>>> '\\'
'\\'
>>> print '\\'
\
>>> print '\\'.__repr__()
'\\'
Python treats \ in literal string in a special way.
This is so you can type '\n' to mean newline or '\t' to mean tab
Since '\&' doesn't mean anything special to Python, instead of causing an error, the Python lexical analyser implicitly adds the extra \ for you.
Really it is better to use \\& or r'\&' instead of '\&'
The r here means raw string and means that \ isn't treated specially unless it is right before the quote character at the start of the string.
In the interactive console, Python uses repr to display the result, so that is why you see the double '\'. If you print your string or use len(string) you will see that it is really only the 2 characters
Some examples
>>> 'Here\'s a backslash: \\'
"Here's a backslash: \\"
>>> print 'Here\'s a backslash: \\'
Here's a backslash: \
>>> 'Here\'s a backslash: \\. Here\'s a double quote: ".'
'Here\'s a backslash: \\. Here\'s a double quote: ".'
>>> print 'Here\'s a backslash: \\. Here\'s a double quote: ".'
Here's a backslash: \. Here's a double quote ".
To Clarify the point Peter makes in his comment see this link
Unlike Standard C, all unrecognized
escape sequences are left in the
string unchanged, i.e., the backslash
is left in the string. (This behavior
is useful when debugging: if an escape
sequence is mistyped, the resulting
output is more easily recognized as
broken.) It is also important to note
that the escape sequences marked as
“(Unicode only)” in the table above
fall into the category of unrecognized
escapes for non-Unicode string
literals.
>>> '\\&' == '\&'
True
>>> len('\\&')
2
>>> print('\\&')
\&
Or in other words: '\\&' only contains one backslash. It's just escaped in the python shell's output for clarity.
printing a list can also cause this problem (im new in python, so it confused me a bit too):
>>>myList = ['\\']
>>>print myList
['\\']
>>>print ''.join(myList)
\
similarly:
>>>myList = ['\&']
>>>print myList
['\\&']
>>>print ''.join(myList)
\&
There is no extra backslash, it's just formatted that way in the interactive environment. Try:
print string
Then you can see that there really is no extra backslash.

Categories

Resources