how to properly use a unicode string in python regex

how to properly use a unicode string in python regex - python

I am getting an input regular expression from a user which is saved as a unicode string. Do I have to turn the input string into a raw string before compliling it as a regex object? Or is it unnecessary? Am I converting it to raw string properly?
import re
input_regex_as_unicode = u"^(.){1,36}$"
string_to_check = "342342dedsfs"
# leave as unicode
compiled_regex = re.compile(input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)
# convert to raw
compiled_regex = re.compile(r'' + input_regex_as_unicode)
match_string = re.match(compiled_regex, string_to_check)
#Ahsanul Haque, my question is more regular expression specific, whether the regex handles the unicode string properly when converting it into a regex object

The re module handles both unicode strings and normal strings properly, you do not need to convert them to anything (but you should be consistent in your use of strings).
There is no such a thing like "raw strings". You can use raw string notation in your code if it helps you with strings containing backslashes. For instance to match a newline character you could use '\\n', u'\\n', r'\n' or ur'\n'.
Your use of the raw string notation in your example does nothing since r'' and '' evaluate to the same string.

Related

Python Regex - Matching mixed Unicode and ASCII characters in a string

I've tried in several different ways and none of them work.
Suppose I have a string s defined as follows:
s = '[မန္း],[aa]'.decode('utf-8')
Suppose I want to parse the two strings within the square brackes. I've compiled the following regex:
pattern = re.compile(r'\[(\w+)\]', re.UNICODE)
and then I look for occurrences using:
pattern.findall(s, re.UNICODE)
The result is basically just [] instead of the expected list of two matches. Furthermore if I remove the re.UNICODE from the findall call I get the single string [u'aa'], i.e. the non-unicode one:
pattern.findall(s)
Of course
s = '[bb],[aa]'.decode('utf-8')
pattern.findall(s)
returns [u'bb', u'aa']
And to make things even more interesting:
s = '[မနbb],[aa]'.decode('utf-8')
pattern.findall(s)
returns [u'\u1019\u1014bb', u'aa']

It's actually rather simple. \w matches all alphanumeric characters and not all of the characters in your initial string are alphanumeric.
If you still want to match all characters between the brackets, one solution is to match everything but a closing bracket (]). This can be made as
import re
s = '[မန္း],[aa]'.decode('utf-8')
pattern = re.compile('\[([^]]+)\]', re.UNICODE)
re.findall(pattern, s)
where the [^]] creates a matching pattern of all characters except the ones following the circumflex (^) character.
Also, note that the re.UNICODE argument to re.compile is not necessary, since the pattern itself does not contain any unicode characters.

First, note that the following only works in Python 2.x if you've saved the source file in UTF-8 encoding, and you declare the source code encoding at the top of the file; otherwise, the default encoding of the source is assumed to be ascii:
#coding: utf8
s = '[မန္း],[aa]'.decode('utf-8')
A shorter way to write it is to code a Unicode string directly:
#coding: utf8
s = u'[မန္း],[aa]'
Next, \w matches alphanumeric characters. With the re.UNICODE flag it matches characters that are categorized as alphanumeric in the Unicode database.
Not all of the characters in မန္း are alphanumeric. If you want whatever is between the brackets, use something like the following. Note the use of .*? for a non-greedy match of everything. It's also a good habit to use Unicode strings for all text, and raw strings in particular for regular expressions.
#coding:utf8
import re
s = u'[မန္း],[aa],[မနbb]'
pattern = re.compile(ur'\[(.*?)\]')
print re.findall(pattern,s)
Output:
[u'\u1019\u1014\u1039\u1038', u'aa', u'\u1019\u1014bb']
Note that Python 2 displays an unambiguous version of the strings in lists with escape codes for non-ASCII and non-printable characters.
To see the actual string content, print the strings, not the list:
for item in re.findall(pattern,s):
print item
Output:
မန္း
aa
မနbb

What does 'r' mean before a Regex pattern?

I found the following regex substitution example from the documentation for Regex. I'm a little bit confused as to what the prefix r does before the string?
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
... r'static PyObject*\npy_\1(void)\n{',
... 'def myfunc():')

Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences (\n, \b, etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.
Below is a demonstration:
>>> print('\n') # Prints a newline character
>>> print(r'\n') # Escape sequence is not processed
\n
>>> print('\b') # Prints a backspace character
>>> print(r'\b') # Escape sequence is not processed
\b
>>>
The only other option would be to double every backslash:
re.sub('def\\s+([a-zA-Z_][a-zA-Z_0-9]*)\\s*\\(\\s*\\):',
... 'static PyObject*\\npy_\\1(void)\\n{',
... 'def myfunc():')
which is just tedious.

The r means that the string is to be treated as a raw string, which means all escape codes will be ignored.
The Python document says this precisely:
String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Current re module docs gives explanation regarding raw-string usage
Regular expressions use the backslash character ('\') to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python’s usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\\\'
as the pattern string, because the regular expression must be \\, and
each backslash must be expressed as \\ inside a regular Python string
literal. Also, please note that any invalid escape sequences in
Python’s usage of the backslash in string literals now generate a
DeprecationWarning and in the future this will become a SyntaxError.
This behaviour will happen even if it is a valid escape sequence for a
regular expression.
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.

python re find string that may contain brackets

I am trying to search for a string that may contain brackets or other characters that may not be interpreted as plain strings.
def findstring(string, text):
match = re.search(string, text)
I do not control the string as it is derived from another module. My problem is that the string may contain "xyz)", which raises an error telling me that there are unmatched brackets.
I already tried this without success
match = re.search(r'%s' % string, text)

You can use re.escape() to escape the string:
match = re.search(re.escape(string), text)
From docs:
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file.
But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).
I want only text, so is there any way to remove this escape sequence.I have done something like this
escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)
But this do not remove escape sequence
Thanks in advance.

The problem is that \xXX is just a representation of a control character, not the character itself. Therefore, you can't literally match \x unless you're working with the repr of the string.
You can remove nonprintable characters using a character class:
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', text)
Example:
>>> re.sub(r'[\x00-\x1f\x7f-\xff]', '', ''.join(map(chr, range(256))))
' !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~'

Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.
The solution is just to double the backslash, which makes a pattern that matches a single backslash.
I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.
Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.
import re
s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)
Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!
Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:
escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')
And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).
escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.
import re
s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("#", s)
print("Result: '%s'" % s)
This prints: Result: '+#1+'
NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by #nneonneo instead of this one.

If you're working with 8-bit char values, it's possible to forgo regex's by building some simple tables beforehand and then use them inconjunction with str.translate() method to remove unwanted characters in strings very quickly and easily:
import random
import string
allords = [i for i in xrange(256)]
allchars = ''.join(chr(i) for i in allords)
printableords = [ord(ch) for ch in string.printable]
deletechars = ''.join(chr(i) for i in xrange(256) if i not in printableords)
test = ''.join(chr(random.choice(allords)) for _ in xrange(10, 40)) # random string
print test.translate(allchars, deletechars)

not enough reputation to comment, but the accepted answer removes printable characters as well.
s = "pörféct änßwer"
re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', s)
'prfct nwer'
For non-English strings, please use answer https://stackoverflow.com/a/62530464/3021668
import unicodedata
''.join(c for c in s if not unicodedata.category(c).startswith('C'))
'pörféct änßwer'

Regular expression and unicode literals

I'd like to remove some characters from a string (either byte string or unicode string) using a regular expression like this:
pattern = re.compile(ur'\u00AE|\u2122', re.UNICODE)
If the characters are specified as unicode literals the resulting regexp does not work properly on byte string.
q = 'Canon\xc2\xae EOS 7D'
pattern.sub('', q) # 'Canon\xc2 EOS 7D'
If I convert the argument of the substitution to a unicode string, however, it works as expected...
pattern.sub('', unicode(q)) # u'Canon EOS 7D'
Can someone please explain to me why this is the case?
thanks,
Peter

Because a standard (byte) string is not a Unicode string. Python does not know what encoding it's in (or if it's even Unicode at all!), and so has no way to determine whether a particular Unicode character matches some character in it. The solution is to tell Python it's Unicode, using the unicode() function, as you have figured out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to properly use a unicode string in python regex - python

Related

Python Regex - Matching mixed Unicode and ASCII characters in a string

What does 'r' mean before a Regex pattern?

python re find string that may contain brackets

How to remove escape sequence like '\xe2' or '\x0c' in python

Regular expression and unicode literals

Categories

Resources