Python re: why does [,-_] match "="? - python

I'm observing the following behaviour in python 2.7.5:
>>> import re
>>> re.match(r'[,-_]', '=') # This matches
<_sre.SRE_Match object at 0x7f24d4981308>
>>> re.match(r'[-,_]', '=') # This doesn't match
>>> re.match(r'[-_,]', '=') # Nor does this
Can someone explain what I'm seeing here? I can't seem to find anything about ,-_ being special in python regexes (or raw strings for that matter).

This is the same idiom as in [A-Z] which matches everything from A to Z. In this case, it will match everything from , (ASCII #44) to _ (ASCII #95), which includes = (ASCII #61).
See the full ASCII table.

Because the hyphen (-) defines a range and = is between , and _ in the ASCII table. You need to escape it so that the regex engine treats it as a literal hyphen, like so r'[,\-_]'. Raw strings are skipped by the interpreter, but not treated as literals from the regex engine that's why you need to escape special characters.

Related

Python Regex - Matching mixed Unicode and ASCII characters in a string

I've tried in several different ways and none of them work.
Suppose I have a string s defined as follows:
s = '[မန္း],[aa]'.decode('utf-8')
Suppose I want to parse the two strings within the square brackes. I've compiled the following regex:
pattern = re.compile(r'\[(\w+)\]', re.UNICODE)
and then I look for occurrences using:
pattern.findall(s, re.UNICODE)
The result is basically just [] instead of the expected list of two matches. Furthermore if I remove the re.UNICODE from the findall call I get the single string [u'aa'], i.e. the non-unicode one:
pattern.findall(s)
Of course
s = '[bb],[aa]'.decode('utf-8')
pattern.findall(s)
returns [u'bb', u'aa']
And to make things even more interesting:
s = '[မနbb],[aa]'.decode('utf-8')
pattern.findall(s)
returns [u'\u1019\u1014bb', u'aa']
It's actually rather simple. \w matches all alphanumeric characters and not all of the characters in your initial string are alphanumeric.
If you still want to match all characters between the brackets, one solution is to match everything but a closing bracket (]). This can be made as
import re
s = '[မန္း],[aa]'.decode('utf-8')
pattern = re.compile('\[([^]]+)\]', re.UNICODE)
re.findall(pattern, s)
where the [^]] creates a matching pattern of all characters except the ones following the circumflex (^) character.
Also, note that the re.UNICODE argument to re.compile is not necessary, since the pattern itself does not contain any unicode characters.
First, note that the following only works in Python 2.x if you've saved the source file in UTF-8 encoding, and you declare the source code encoding at the top of the file; otherwise, the default encoding of the source is assumed to be ascii:
#coding: utf8
s = '[မန္း],[aa]'.decode('utf-8')
A shorter way to write it is to code a Unicode string directly:
#coding: utf8
s = u'[မန္း],[aa]'
Next, \w matches alphanumeric characters. With the re.UNICODE flag it matches characters that are categorized as alphanumeric in the Unicode database.
Not all of the characters in မန္း are alphanumeric. If you want whatever is between the brackets, use something like the following. Note the use of .*? for a non-greedy match of everything. It's also a good habit to use Unicode strings for all text, and raw strings in particular for regular expressions.
#coding:utf8
import re
s = u'[မန္း],[aa],[မနbb]'
pattern = re.compile(ur'\[(.*?)\]')
print re.findall(pattern,s)
Output:
[u'\u1019\u1014\u1039\u1038', u'aa', u'\u1019\u1014bb']
Note that Python 2 displays an unambiguous version of the strings in lists with escape codes for non-ASCII and non-printable characters.
To see the actual string content, print the strings, not the list:
for item in re.findall(pattern,s):
print item
Output:
မန္း
aa
မနbb

How to write a regular expression in Python that accepts alphabets, numbers and a few selected special characters(,.-|;!_?)?

A Regular Expression in Python that accepts letters,numbers and only these special characters (,.-|;!_?).
I have tried solving the problem through the following regular expressions but it didn't work:
'([a-zA-Z0-9,.-|;!_?])$'
'([a-zA-Z0-9][.-|;!_?])$'
Can someone please help me write the regular expression.
I think the following should work (tested on RegExr against Foo123,.-|;!_?):
^[\w,.\-|;!_?]*$
In your regular expressions, you forget to escape the '-' character, which is interpreted as a range of characters to match against.
Use this for only one character:
'[a-zA-Z0-9,.\-|;!_?]' or '[\w,.\-|;!_?]'
Use this for all characters:
'[a-zA-Z0-9,.\-|;!_?]*' or '[\w,.\-|;!_?]*'
Use this for an equal check:
'^[a-zA-Z0-9,.\-|;!_?]*$' or '^[\w,.\-|;!_?]*$'
Try this (you should escape - like this \-):
^[a-zA-Z0-9,.\-|;!_?]+$
+ to prevent matching empty strings, to allow them, you can use * instead.
Examples:
>>> import re
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '12.0')
<_sre.SRE_Match object at 0x00000000027EB850>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', '')
>>>
>>> re.match('^[a-zA-Z0-9,.\-|;!_?]+$', 'test!?')
<_sre.SRE_Match object at 0x00000000027EB7E8>
You could use \w (bonus: unicode and locale support!):
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
See Python's documentation. Also, you might want to use a raw string when specifying your regular expression pattern:
m = re.match(r'[\w,.-|;!?]+', your_string)
Notice the use of + (repeat once or more). You also used $ to match the end of the string but I did not include it in mine. YMMV.

Regular expression including and excluding characters

I have the following regular expression that almost works fine.
WORD_REGEXP = re.compile(r"[a-zA-Zá-úÁ-Úñ]+")
It includes lower and upper case letters with and without an accent plus the Spanish letter «ñ». Unfortunately, it also includes (I don't know why) characters that are also used in Spanish like «¡» or «¿» which I would like to remove as well.
In a line like ¡España, olé! I would like to extract just España and olé, by means of the regular expression.
How can I exclude these two characters («¿», «¡») in the regular expression?
According to stribizhe, it seems as if the regex was OK. So the problem must be other. I include the full Python code:
import re
linea = "¡Arriba Éspáña, ¿olé!"
WORD_REGEXP = re.compile(r"([a-zA-Zá-úÁ-Úñ]+)", re.UNICODE)
palabras = WORD_REGEXP.findall(linea)
for pal in palabras:
pal = unicode(pal,'latin1').encode('latin1', 'replace')
print pal
The result is the following:
¡Arriba
Éspáña
¿olé
Use the special sequence '\w', according to documentation:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Note, however that your string must be a unicode string:
import re
linea = u"¡Arriba Éspáña, ¿olé!"
regex = re.compile(r"\w+", re.UNICODE)
regex.findall(linea)
# [u'Arriba', u'\xc9sp\xe1\xf1a', u'ol\xe9']
NOTE: The cause of your error is that your regex is being interpreted as UTF-8, e.g.:
You pattern r'([a-zA-Zá-úÁ-Úñ]+)' is not defined as a unicode string, so it's encoded to UTF-8 by your text editor and read by python as '([a-zA-Z\xc3\xa1-\xc3\xba\xc3\x81-\xc3\x9a\xc3\xb1]+)', note the patterns starting with \xc3 (that is the unicode start byte).
You can confirm that by printing the repr of WORD_REGEXP. So the actual pattern used by the re module is:
patt = r"([a-zA-Zá-úÁ-Úñ]+)"
print patt.decode('latin1')
Or:
a-z
A-Z
\xc3
\xa1-\xc3
\xba
\xc3
\x81-\xc3
\x9a
\xc3
\xb1
Simplifying it, you are actually using pattern
a-zA-Z\x81-\xc3
That last range, covers a lot of characters!!
It's better to use code points. The codepoint's for those characters are
¡ - \x{A1}
¿ - \x{BF}
which seem to fall outside the range of your accent characters.
[a-zA-Z\x{E1}-\x{FA}\x{C1}-\x{DA}\x{F1}]+

python regex search pattern

I'm searching a block of text for a newline followed by a period.
pat = '\n\.'
block = 'Some stuff here. And perhaps another sentence here.\n.Some more text.'
For some reason when I use regex to search for my pattern it changes the value of pat (using Python 2.7).
import re
mysrch = re.search(pat, block)
Now the value of pat has been changed to:
'\n\\.'
Which is messing with the next search that I use pat for. Why is this happening, and how can I avoid it?
Thanks very much in advance in advance.
The extra slash isn't actually part of the string - the string itself hasn't changed at all.
Here's an example:
>>> pat = '\n\.'
>>> pat
'\n\\.'
>>> print pat
\.
As you can see, when you print pat, it's only got one \ in it. When you dump the value of a string it uses the __repr__ function which is designed to show you unambiguously what is in the string, so it shows you the escaped version of characters. Like \n is the escaped version of a newline, \\ is the escaped version of \.
Your regex is probably not matching how you expect because it has an actual newline character in it, not the literal string "\n" (as a repr: "\\n").
You should either make your regex a raw string (as suggested in the comments).
>>> pat = r"\n\."
>>> pat
'\\n\\.'
>>> print pat
\n\.
Or you could just escape the slashes and use
pat = "\\n\\."

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.
The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'
Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.
If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)
not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Categories

Resources