With the stock Python 3.5-3.x regular expression engine, I have exhaustively tested that the regex
re.compile(r"[\x00-\x7F]", re.UNICODE)
matches all single characters with code points U+0000 through U+007F, and no others, and similarly, the regex
re.compile(r"[^\x00-\x7F]", re.UNICODE)
matches all single characters with code points U+0080 through U+10FFFF, and no others. However, what I do not know is whether this is guaranteed or just an accident. Have the Python maintainers made any kind of official statement about the meaning of range expressions in regex character classes in Unicode mode?
The official re module documentation is fairly vague about the exact semantics of ranges, and in other regex implementations, e.g. POSIX BREs and EREs, the interaction between range expressions and characters outside the ASCII range is explicitly unspecified.
Related
Could anyone tell me what the [Cc] in this code is called? I know what it does but I have no idea what it is called.
#!/usr/bin/perl
$sentence = "Big cat sat.";
$sentence =~ /[Cc]at/;
print "$`, $&, $'\n"; #prints Big, cat, sat.
Also does anyone know what is the perl equivalent of python 2.7's re.search? All I keep finding is something about python's replace being mutable and does not really say anything about search.
Bracketed groups of characters are called character classes or character sets.
Regular expressions have a simple formal definition with just a few operations. One of these operations is alternation. Alternations allow you to match against the union of two sets of strings. Character sets are syntax for an alternation over a group of single character strings. More commonly when we talk about alternations in regular expressions we are referring to the use of the vertical bar | which matches the union of the expressions on either side of the bar.
I don't really understand the close votes, but you've made the mistake of asking more than one question!
It's hard to know what's tripping you up, but this may help
The pattern /[Cc]at/ as a whole is a regular expression, regexp or regex, while the particular component [Cc] is called a character class, which matches any of a set of characters; in this case an upper or lower-case C character. It's documented in the Python documentation for Regular Expression Syntax, which calls it just a "set of characters", and speaks about things like \d (numeric digits) and \w ("word" characters) as character classes. In Perl, the square-bracket construct is also a character class
The documentation for re.search on the same page is fairly simple, and you seem to have used its Perl equivalent in your code so I don't understand the problem you're having
In Python,
object = re.search(pattern, string)
checks for the occurrence of pattern anywhere in string and sets object to a match object if one is found, or None otherwise
This is the same in Perl as using the binding operator =~ like this
my $result = $string =~ /pattern/
which sets $result to a true value if a match was found, or false otherwise
Take a look at the Python documentation for search() vs. match()
re.match is identical to re.search, except that the match must occur at the very start of the string
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.
I recently learned a little Python and I couldnt find a good list of the RegEx's (don't know if that is the correct plural tense...) with complete explanations even a rookie will understand :)
Anybody know a such list?
Vide:
Well, for starters - hit up the python docs on the re module. Good list of features and methods, as well as info about special regex characters such as \w. There's also a chapter in Dive into Python about regular expressions that uses the aforementioned module.
Check out the re module docs for some basic RegEx syntax.
For more, read Introduction To RegEx, or other of the many guides online. (or books!)
You could also try RegEx Buddy, which helps you learn regular expressions by telling you what they do an parsing them.
The Django Book http://www.djangobook.com/en/2.0/chapter03/ chapter on urls/views has a great "newbie" friendly table explaining the gist of regexes. combine that with the info on the python.docs http://docs.python.org/library/re.html and you'll master RegEx in no time.
an excerpt:
Regular Expressions
Regular expressions (or regexes) are a compact way of specifying patterns in text. While Django URLconfs allow arbitrary regexes for powerful URL matching, you’ll probably only use a few regex symbols in practice. Here’s a selection of common symbols:
Symbol Matches
. (dot) Any single character
\d Any single digit
[A-Z] Any character between A and Z (uppercase)
[a-z] Any character between a and z (lowercase)
[A-Za-z] Any character between a and z (case-insensitive)
+ One or more of the previous expression (e.g., \d+ matches one or more digits)
? Zero or one of the previous expression (e.g., \d? matches zero or one digits)
* Zero or more of the previous expression (e.g., \d* matches zero, one or more than one >digit)
{1,3} Between one and three (inclusive) of the previous expression (e.g., \d{1,3} matches >one, two or three digits)
But it's turtles all the way down!
I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.
>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz
(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)
Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.
I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).
It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).
Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.
It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).
See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).
You can use unicodedata.normalize to compose the combining diacritics into one unicode character.
>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>
I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.
Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.