Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.
Related
This question applies to Python 3 regular expressions. I think it might apply to other languages as well.
The question could easily be misunderstood so I'll be careful in describing it.
As background, \w means "a word character." In certain circumstances Python 3 will treat this as just [a-zA-Z0-9_] but if the regular expression is a string, it will be Unicode-aware so that \w means "any Unicode word character." This is generally a good thing as people use different languages, and it would be hard to construct a range like [a-zA-Z0-9_] for all languages at once. I think \w is therefore most useful in a multilingual setting.
But there is a problem: What if you don't want to match underscores because you don't think they're really a word character (for your particular application)?
If you're only focused on English applications, the best solution is probably to skip \w entirely and just use [a-zA-Z0-9]. But if you're focused on global applications and you don't want underscores, it seems like you might be in a really unfortunate situation. I haven't done it, but I assume it would be really tough to write a range that represents 100 languages at once just so you can avoid that underscore.
So my question is: Is there any way to use \w to match any Unicode word character, but somehow also exclude underscores (or some other undesirable character) from the class? I don't think I've seen anything like this described, but it would be highly useful. Something like [\w^_]. Of course that won't actually work, but what I mean is "use a character class that starts with everything represented by \w, but then go ahead and remove underscores from that class."
Thoughts?
I have two options.
[^\W_]
This is very effective and does exactly what you want. It's also straightforward.
With regex: [[\w]--[_]], note you need "V1" flag set, so you need
r = regex.compile(r"(?V1)[\w--_]")
or
r = regex.compile(r"[\w--_]", flags=regex.V1)
This looks better (readability) IMO if you're familiar with Matthew Barnett's regex module, which is more powerful than Python's stock re.
With the stock Python 3.5-3.x regular expression engine, I have exhaustively tested that the regex
re.compile(r"[\x00-\x7F]", re.UNICODE)
matches all single characters with code points U+0000 through U+007F, and no others, and similarly, the regex
re.compile(r"[^\x00-\x7F]", re.UNICODE)
matches all single characters with code points U+0080 through U+10FFFF, and no others. However, what I do not know is whether this is guaranteed or just an accident. Have the Python maintainers made any kind of official statement about the meaning of range expressions in regex character classes in Unicode mode?
The official re module documentation is fairly vague about the exact semantics of ranges, and in other regex implementations, e.g. POSIX BREs and EREs, the interaction between range expressions and characters outside the ASCII range is explicitly unspecified.
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.
I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.
>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz
(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)
Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.
I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).
It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).
Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.
It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).
See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).
You can use unicodedata.normalize to compose the combining diacritics into one unicode character.
>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>
I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.
I have a regular expression which works perfectly well (although I am sure it is weak) in .NET/C#:
((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))
I am trying to move it over to Python, but I seem to be running into a formatting issue (invalid expression exception).
It is a lame question/request, but I have been staring at this for a while, but nothing obvious is jumping out at me.
Note: I am simply trying
r = re.compile('((^|\s))(?<tag>\#(?<tagname>(\w|\+)+))(?($|\s|\.))')
Thanks,
Scott
There are some syntax incompatibilities between .NET regexps and PCRE/Python regexps :
(?<name>...) is (?P<name>...)
(?...) does not exist, and as I don't know what it is used for in .NET I can't guess any equivalent. A Google codesearch do not give me any pointer to what it could be used for.
Besides, you should use Python raw strings (r"I am a raw string") instead of normal strings when expressing regexps : raw strings do not interpret escape sequences (like \n). But it is not the problem in your example as you're not using any known escape sequence which could be replaced (\s does not mean anything as an escape sequence, so it is not replaced).
Is "(?" there to prevent creation of a separate group? In Python's re's, this is "(:?". Try this:
r = re.compile(r'((^|\s))(:?<tag>\#(:?<tagname>(\w|\+)+))(:?($|\s|\.))')
Also, note the use of a raw string literal (the "r" character just before the quotes). Raw literals suppress '\' escaping, so that your '\' characters pass straight through to re (otherwise, you'd need '\\' for every '\').