Regular expression for names with national characters - python

I am looking for a regular expression to validate names (using Python standard module re).
The expression should work for names with standard latin characters (a-z), space, dash, names with western european characters (æøåüöä etc.), but also Chinese, Thai, Arab etc.
All these can be considered "letters" - they are ok, but special characters such as !##$%&*() and quotes etc. should fail.
I haven't really found something that can do this - anybody who knows how to do this?
PS: there are thousands of characters that qualify - it's not realistic to simply list them all.

Well the question really is what do you need this for? Maybe the opposite approach might be better for you, i.e. specify which characters are not allowed: e.g. [^ \t] etc.
You should also take a look in the manual at things like \s, \w and others, combined with setting the LOCALE.

You can create a character class which will match all the languages you want to match:
for example
[\p{Cyrillic}\p{Latin}]
will match all cyrilic and latin letters. Not sure if this is the best solution, but it works.
Here is the full reference

Related

Issues with re.search and unicode in python [duplicate]

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?
The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.
You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.
You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.
Or, if you can only work with re, use
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.
A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).
Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:
[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
This is also a possible solution, if your regex engine allows it
/\p{Dash}/u
This will include all these characters.

regex: \w EXCEPT underscore (add to class and then exclude from class)

This question applies to Python 3 regular expressions. I think it might apply to other languages as well.
The question could easily be misunderstood so I'll be careful in describing it.
As background, \w means "a word character." In certain circumstances Python 3 will treat this as just [a-zA-Z0-9_] but if the regular expression is a string, it will be Unicode-aware so that \w means "any Unicode word character." This is generally a good thing as people use different languages, and it would be hard to construct a range like [a-zA-Z0-9_] for all languages at once. I think \w is therefore most useful in a multilingual setting.
But there is a problem: What if you don't want to match underscores because you don't think they're really a word character (for your particular application)?
If you're only focused on English applications, the best solution is probably to skip \w entirely and just use [a-zA-Z0-9]. But if you're focused on global applications and you don't want underscores, it seems like you might be in a really unfortunate situation. I haven't done it, but I assume it would be really tough to write a range that represents 100 languages at once just so you can avoid that underscore.
So my question is: Is there any way to use \w to match any Unicode word character, but somehow also exclude underscores (or some other undesirable character) from the class? I don't think I've seen anything like this described, but it would be highly useful. Something like [\w^_]. Of course that won't actually work, but what I mean is "use a character class that starts with everything represented by \w, but then go ahead and remove underscores from that class."
Thoughts?
I have two options.
[^\W_]
This is very effective and does exactly what you want. It's also straightforward.
With regex: [[\w]--[_]], note you need "V1" flag set, so you need
r = regex.compile(r"(?V1)[\w--_]")
or
r = regex.compile(r"[\w--_]", flags=regex.V1)
This looks better (readability) IMO if you're familiar with Matthew Barnett's regex module, which is more powerful than Python's stock re.

python regex with unicode to match a city name

I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York".
I searched a little bit and tried something attached below, but could not figure out how?
I tried this regex "([\u0000-\uFFFF]+)" on this website:http://regex101.com/#python and it works, but could not get it working in python.
Thanks in advance!!
city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
Unlike \x, \u is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.
To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:
mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)
(If you don't want to double-backslash the \s, you could also use a ur string, in which \u still works as an escape but the other escapes like \x don't. This is a bit confusing though.)
This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+ would be a simpler way of putting it.)
Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.

PHP PCRE regex to Python compatible [duplicate]

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

Python regex matching Unicode properties

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].
People would thank you. :)
Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'.
The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

Categories

Resources