Python regex \w doesn't match combining diacritics? - python

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.
>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz
(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)
Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).
It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).
Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.
It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).
See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

You can use unicodedata.normalize to compose the combining diacritics into one unicode character.
>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>
I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

Python regex sub function with matching group

I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.

PHP PCRE regex to Python compatible [duplicate]

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

python re codecs ä ö, finnish language, define that is word

IS it possible to define that specific languages characters would be considered as word.
I.e. re do not accept ä,ö as word characters if i search them in following way
Ft=codecs.open('c:\\Python27\\Scripts\\finnish2\\textfields.txt','r','utf–8')
word=Ft.readlines()
word=smart_str(word, encoding='utf-8', strings_only=False, errors='replace')
word=re.sub('[^äÄöÖåÅA-Za-z0-9]',"""\[^A-Za-z0-9]*""", word) ; print 'word= ', word #works in skipping ö,ä,å characters
I would like that these character would be included to [A-Za-z].
How to define this?
[A-Za-z0-9] will only match the characters listed here, but the docs also mention some other special constructs like:
\w which stands for alphanumeric characters (namely [a-zA-Z0-9_] plus all unicode characters which are declared to be alphanumeric
\W which stands for all nun-alphanumeric characters [^a-zA-Z0-9_] plus unicode
\d which stands for digits
\b which matches word boundaries (including all rules from the unicode tables)
So, you will to (a) use this constructs instead (which are shorter and maybe easier to read), and (b) tell re that you want to "localize" those strings with the current locale by setting the UNICODE flag like:
re_word = re.compile(r'\w+', re.U)
For a start, you appear to be slightly confused about the args for re.sub.
The first arg is the pattern. You have '[^äÄöÖåÅA-Za-z0-9]' which matches each character which is NOT in the Finnish alphabet nor a digit.
The second arg is the replacement. You have """[^A-Za-z0-9]*""" ... so each of those non-Finnish-alphanumeric characters is going to be replaced by the literal string [^A-Za-z0-9]*. It's reasonable to assume that this is not what you want.
What do you want to do?
You need to explain your third line; after your first 2 lines, word will be a list of unicode objects, which is A Good Thing. However the encoding= and the errors= indicate that the unknown (to us) smart_str() is converting your lovely unicode back to UTF-8. Processing data in UTF-8 bytes instead of Unicode characters is EVIL, unless you know what you are doing.
What encoding directive do you have at the top of your source file?
Advice: Get your data into unicode. Work on it in unicode. All your string constants should have the u prefix; if you consider that too much wear and tear on your typing fingers, at least put it on the non-ASCII constants e.g. u'[^äÄöÖåÅA-Za-z0-9]'. When you have done all the processing, encode your results for display or storage using an appropriate encoding.
When working with re, consider \w which will match any alphanumeric (and also the underscore) instead of listing out what is alphabetic in one language. Do use the re.UNICODE flag; docs here.
Something like this might do the trick:
pattern = re.compile("(?u)pattern")
or
pattern = re.compile("pattern", re.UNICODE)

Python regex matching Unicode properties

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

Categories

Resources