I'm searching a block of text for a newline followed by a period.
pat = '\n\.'
block = 'Some stuff here. And perhaps another sentence here.\n.Some more text.'
For some reason when I use regex to search for my pattern it changes the value of pat (using Python 2.7).
import re
mysrch = re.search(pat, block)
Now the value of pat has been changed to:
'\n\\.'
Which is messing with the next search that I use pat for. Why is this happening, and how can I avoid it?
Thanks very much in advance in advance.
The extra slash isn't actually part of the string - the string itself hasn't changed at all.
Here's an example:
>>> pat = '\n\.'
>>> pat
'\n\\.'
>>> print pat
\.
As you can see, when you print pat, it's only got one \ in it. When you dump the value of a string it uses the __repr__ function which is designed to show you unambiguously what is in the string, so it shows you the escaped version of characters. Like \n is the escaped version of a newline, \\ is the escaped version of \.
Your regex is probably not matching how you expect because it has an actual newline character in it, not the literal string "\n" (as a repr: "\\n").
You should either make your regex a raw string (as suggested in the comments).
>>> pat = r"\n\."
>>> pat
'\\n\\.'
>>> print pat
\n\.
Or you could just escape the slashes and use
pat = "\\n\\."
Related
I have a list of strings and I want to print out the ones that don't match the regex but I'm having some trouble. The regex seems to match strings that it should not, if there is a substring that starts at the beginning of the string that matches the regex. I'm not sure how to fix this.
Example
>>> import re
>>> pattern = re.compile(r'\d+')
>>> string = u"1+*"
>>> bool(pattern.match(string))
True
I get true because of the 1 at the start. How should I change my regex to account for this?
Note I'm on python 2.6.6
Have your regex start with \A and end with \Z. This will make sure that the match begins at the start of the input string, and also make sure that the match ends at the end of the input string.
So for the example you gave, it would look like:
pattern = re.compile(r'\A\d+\Z')
You should append \Z to the end of the regex, so the regex pattern is '\d+\Z'.
Your code then becomes:
>>> import re
>>> pattern = re.compile(r'\d+\Z')
>>> string = u"1+*"
>>> bool(pattern.match(string))
False
This works because \Z forces matching at only the end of the string. You may also use $, which forces a match at a newline before the end of the string or at the end of the string. If you would like to force the string to only contain numeric values (irrelevant if using re.match, but maybe useful if using other regular expression libraries), you may add a ^ to the front of the pattern, forcing a match at the start of the string. The pattern would then be '^\d+\Z'.
I'm observing the following behaviour in python 2.7.5:
>>> import re
>>> re.match(r'[,-_]', '=') # This matches
<_sre.SRE_Match object at 0x7f24d4981308>
>>> re.match(r'[-,_]', '=') # This doesn't match
>>> re.match(r'[-_,]', '=') # Nor does this
Can someone explain what I'm seeing here? I can't seem to find anything about ,-_ being special in python regexes (or raw strings for that matter).
This is the same idiom as in [A-Z] which matches everything from A to Z. In this case, it will match everything from , (ASCII #44) to _ (ASCII #95), which includes = (ASCII #61).
See the full ASCII table.
Because the hyphen (-) defines a range and = is between , and _ in the ASCII table. You need to escape it so that the regex engine treats it as a literal hyphen, like so r'[,\-_]'. Raw strings are skipped by the interpreter, but not treated as literals from the regex engine that's why you need to escape special characters.
I'm encountering confusing and seemingly contradictory rules regarding raw strings. Consider the following example:
>>> text = 'm\n'
>>> match = re.search('m\n', text)
>>> print match.group()
m
>>> print text
m
This works, which is fine.
>>> text = 'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
m
>>> print text
m
Again, this works. But shouldn't this throw an error, because the raw string contains the characters m\n and the actual text contains a newline?
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> print text
m\n
The above, surprisingly, throws an error, even though both are raw strings. This means both contain just the text m\n with no newlines.
>>> text = r'm\n'
>>> match = re.search(r'm\\n', text)
>>> print text
m\n
>>> print match.group()
m\n
The above works, surprisingly. Why do I have to escape the backslash in the re.search, but not in the text itself?
Then there's backslash with normal characters that have no special behavior:
>>> text = 'm\&'
>>> match = re.search('m\&', text)
>>> print text
m\&
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
This doesn't match, even though both the pattern and the string lack special characters.
In this situation, no combination of raw strings works (text as a raw string, patterns as a raw string, both or none).
However, consider the last example. Escaping in the text variable, 'm\\&', doesn't work, but escaping in the pattern does. This parallels the behavior above--even stranger, I feel, considering that \& is of no special meaning to either Python or re:
>>> text = 'm\&'
>>> match = re.search(r'm\\&', text)
>>> print text
m\&
>>> print match.group()
m\&
My understanding of raw strings is that they inhibit the behavior of the backslash in python. For regular expressions, this is important because it allows re.search to apply its own internal backslash behavior, and prevent conflicts with Python. However, in situations like the above, where backslash effectively means nothing, I'm not sure why it seems necessary. Worse yet, I don't understand why I need to backslash for the pattern, but not the text, and when I make both a raw string, it doesn't seem to work.
The docs don't provide much guidance in this regard. They focus on examples with obvious problems, such as '\section', where \s is a meta-character. Looking for a complete answer to prevent unanticipated behavior such as this.
In the regular Python string, 'm\n', the \n represents a single newline character, whereas in the raw string r'm\n' the \ and n are just themselves. So far, so simple.
If you pass the string 'm\n' as a pattern to re.search(), you're passing a two-character string (m followed by newline), and re will happily go and find instances of that two-character string for you.
If you pass the three-character string r'm\n', the re module itself will interpret the two characters \ n as having the special meaning "match a newline character", so that the whole pattern means "match an m followed by a newline", just as before.
In your third example, since the string r'm\n' doesn't contain a newline, there's no match:
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print(match)
None
With the pattern r'm\\n', you're passing two actual backslashes to re.search(), and again, the re module itself is interpreting the double backslash as "match a single backslash character".
In the case of 'm\&', something slightly different is going on. Python treats the backslash as a regular character, because it isn't part of an escape sequence. re, on the other hand, simply discards the \, so the pattern is effectively m&. You can see that this is true by testing the pattern against 'm&':
>>> re.search('m\&', 'm&').group()
'm&'
As before, doubling the backslash tells re to search for an actual backslash character:
>>> re.search(r'm\\&', 'm\&').group()
'm\\&'
... and just to make things a little more confusing, the single backslash is represented by Python doubled. You can see that it's actually a single backslash by printing it:
>>> print(re.search(r'm\\&', 'm\&').group())
m\&
To explain it in simple terms, \<character> has a special meaning in regular expressions. For example \s for whitespace characters, \d for decimal digits, \n for new-line characters, etc.
When you define a string as
s = 'foo\n'
This string contains the characters f, o, o and the new-line character (length 4).
However, when defining a raw string:
s = r'foo\n'
This string contains the characters f, o, o, \ and n (length 5).
When you compile a regexp with raw \n (i.e. r'\n'), it'll match all new lines. Similarly, just using the new-line character (i.e. '\n') it's going to match new-line characters just like a matches a and so on.
Once you understand this concept, you should be able to figure out the rest.
To elaborate a bit further. In order to match the back-slash character \ using regex, the valid regular expression is \\ which in Python would be r'\\' or its equivalent '\\\\'.
text = r'm\n'
match = re.search(r'm\\n', text)
First line using r stops python from interpreting \n as single byte.
Second line using r plays the same role as first.Using \ prevents regex from interpreting as \n .Regex also uses \ like \s, \d.
The following characters are the meta characters that give special meaning to the regular expression search syntax:
\ the backslash escape character.
The backslash gives special meaning to the character following it. For example, the combination "\n" stands for the newline, one of the control characters. The combination "\w" stands for a "word" character, one of the convenience escape sequences while "\1" is one of the substitution special characters.
Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself.
Example: "a+" matches "a+" and not a series of one or "a"s.
In order to understand the internal representation of the strings you're confused about. I'd recommend you using repr and len builtin functions. Using those you'll be able to understand exactly how the strings are and you won't be confused anymore about pattern matching because you'll exactly know the internal representation. For instance, let's say you wanna analize the strings you're having troubles with:
use_cases = [
'm\n',
r'm\n',
'm\\n',
r'm\\n',
'm\&',
r'm\&',
'm\\&',
r'm\\&',
]
for u in use_cases:
print('-' * 10)
print(u, repr(u), len(u))
The output would be:
----------
m
'm\n' 2
----------
m\n 'm\\n' 3
----------
m\n 'm\\n' 3
----------
m\\n 'm\\\\n' 4
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\\& 'm\\\\&' 4
So you can see exactly the differences between normal/raw strings.
I'm currently writing an application that uses a framework to match certain phrases, currently it is supposed to match the following regex pattern:
Say \"(.*)\"
However, I've notices that my users are complaining about the fact that their OS sometimes copies and pastes 'curly quotes' in, what ends up happening is that users provide the following sentence:
Say "Hello world!" <-- Matches
Say “Hello world!” <-- Doesn't match!
Is there any way I can tell Python's regular expressions to treat these curly quotes the same as regular quotes?
Edit:
Turns out you can very easily tell Python to read your Regular Expression with a unicode string, I changed my code to the following and it worked:
u'Say (?:["“”])(.*)(?:["“”])'
# (?:["“”]) <-- Start non-capturing group, and match one of the three possible quote typesnot return it
# (.*) <-- Start a capture group, match anything and return it
# (?:["“”]) <-- Stop matching the string until another quote is found
You could just include the curly quotes in the regex:
Say [\"“”](.*)[\"“”]
As something you can replicate in the Python repl, it's like this:
>>> import re
>>> test_str = r'"Hello"'
>>> reg = r'["“”](.*)["“”]'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'Hello'
>>> test_str = r'“Hello world!”'
>>> m = re.search(reg, test_str)
>>> m.group(1)
'\x80\x9cHello world!\xe2\x80'
As an alternative to Kyle's answer you can prepare string to your current regex by replacing curly quotes:
string.replace('“', '"').replace('”', '"')
I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'