Python 3 regex with diacritics and ligatures,

Python 3 regex with diacritics and ligatures, - python

Names in the form: Ceasar, Julius are to be split into First_name Julius Surname Ceasar.
Names may contain diacritics (á à é ..), and ligatures (æ, ø)
This code seems to work OK in Python 3.3
import re
def doesmatch(pat, str):
try:
yup = re.search(pat, str)
print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1)))
except AttributeError:
print('no match for {0}'.format(str))
s = 'Révèrberë, Harry'
t = 'Åapö, Renée'
u = 'C3po, Robby'
v = 'Mærsk, Efraïm'
w = 'MacDønald, Ron'
x = 'Sträßle, Mpopo'
pat = r'^([^\d\s]+), ([^\d\s]+)'
# matches any letter, diacritic or ligature, but not digits or punctuation inside the ()
for i in s, t, u, v, w, x:
doesmatch(pat, i)
All except u match. (no match for numbers in names), but I wonder if there isn't a better way than the non-digit non-space approach.
More important though: I'd like to refine the pattern so it distinquishes capitals from lowercase letters, but including capital diacritics and ligatures, preferably using regex also. As if ([A-Z][a-z]+), would match accented and combined characters.
Is this possible?
(what I've looked at so far:
Dive into python 3 on UTF-8 vs Unicode; This Regex tutorial on Unicode (which I'm not using); I think I don't need new regex but I admit I haven't read all its documentation)

If you want to distinguish uppercase and lowercase letters using the standard library's re module, then I'm afraid you'll have to build a character class of all the relevant Unicode codepoints manually.
If you don't really need to do this, use
[^\W\d_]
to match any Unicode letter. This character class matches anything that's "not a non-alphanumeric character" (which is the same as "an alphanumeric character") that's also not a digit nor an underscore.

Related

Regex to match capital/special/unicode/vietnamese characters

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter).
When I use the 're' module, my function (temp) does not catch word like "Đà".
The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.
Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.
I have 2 ways :
def temp(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
lis=word_tokenize(sentence)
def temp2(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
Input:
'nous avons 2 Đồng et 3 Euro'
Expected output :
['Đồng','Euro']
Thank you!

You may use this regex:
\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b
Regex Demo

The answer of #Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.
lis=word_tokenize(sentence)
def temp(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
def temp2(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
def temp3(sentence):
return re.findall(capital_letter,sentence)
By this way:
start_time = time.time()
for k in range(100000):
temp2(sentence)
print("%s seconds" % (time.time() - start_time))
Here are the results:
>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds
>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds
>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds

To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use
import re, sys, unicodedata
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))
sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']
The pLu is a Unicode letter character class pattern built dynamically using unicodedata. It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too). The [^\W\d_] is a construct matching any Unicode letter. So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.
Note that your original r'[a-z]*[A-Z]+[a-z]*' will only find Next in this input:
print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']
See the Python demo
To match the words as whole words, use \b word boundary:
p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))
In case you want to use Python 2.x, do not forget to use re.U flag to make the \W, \d and \b Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]] / \p{Lu} constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.

Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line

You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Python regex for unicode capitalized words

I have a set of words in different languages (English, Polish, Finnish, Russian etc.) and need to check, what of them written with a capital letter.
I tried to use simple regular expression: ^[A-Z], but it matches only latinate letters, then I've added the russian capital letters: ^[A-ZА-Я].
But many unicode letters with diacritics remains. How I can add all capital letters to my regex?
It's possible to make this without enumerations of symbols?
P.S. I know, how to make this in Ruby, but now I'm using Python.

If you need to use a regex, you have 2 options:
Install PyPi regex module and use \p{Lu} or [[:upper:]] (having more uppercase chars in it) class (make sure you have the latest version installed)
Use re with a character class containing all uppercase letter ranges, either using Python utilities (and then the amount of the Unicode letters matched will depend on the Python version, the latest having up-to-date data) or by manually creating/updating the range from the Unicode Utilities CLDR page.
Here is a solution with a regex containing all uppercase letter ranges taken from Unicode Utilities CLDR reference page:
import re
pLu = "[A-Z\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178\u0179\u017B\u017D\u0181\u0182\u0184\u0186\u0187\u0189-\u018B\u018E-\u0191\u0193\u0194\u0196-\u0198\u019C\u019D\u019F\u01A0\u01A2\u01A4\u01A6\u01A7\u01A9\u01AC\u01AE\u01AF\u01B1-\u01B3\u01B5\u01B7\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A\u023B\u023D\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E\u213F\u2145\u2160-\u216F\u2183\u24B6-\u24CF\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uFF21-\uFF3A\U00010400-\U00010427\U000104B0-\U000104D3\U00010C80-\U00010CB2\U000118A0-\U000118BF\U0001D400-\U0001D419\U0001D434-\U0001D44D\U0001D468-\U0001D481\U0001D49C\U0001D49E\U0001D49F\U0001D4A2\U0001D4A5\U0001D4A6\U0001D4A9-\U0001D4AC\U0001D4AE-\U0001D4B5\U0001D4D0-\U0001D4E9\U0001D504\U0001D505\U0001D507-\U0001D50A\U0001D50D-\U0001D514\U0001D516-\U0001D51C\U0001D538\U0001D539\U0001D53B-\U0001D53E\U0001D540-\U0001D544\U0001D546\U0001D54A-\U0001D550\U0001D56C-\U0001D585\U0001D5A0-\U0001D5B9\U0001D5D4-\U0001D5ED\U0001D608-\U0001D621\U0001D63C-\U0001D655\U0001D670-\U0001D689\U0001D6A8-\U0001D6C0\U0001D6E2-\U0001D6FA\U0001D71C-\U0001D734\U0001D756-\U0001D76E\U0001D790-\U0001D7A8\U0001D7CA\U0001E900-\U0001E921\U0001F130-\U0001F149\U0001F150-\U0001F169\U0001F170-\U0001F189]"
p = re.compile(pLu)
if p.match("Żółw"):
print("Capitalized!")
See the IDEONE demo. To make it work in Python 2.x, make sure you add u prefix to the string literals.
There are other ways to get the Unicode upper-case letter character class in Python using unicodedata and sys packages like
# Python 3
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
# Python 2
pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]))
However, this range does not match all uppercase letters displayed at the Unicode Utilities: UnicodeSet page for [:upper:] POSIX character class.
Cf.:
Python 2.7 len([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]) displays 1427
Python 3.5 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1751
Python 3.6 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1822
Current Unicode Utilities CLDR page displays 1,822 uppercase letters for [:upper:] class, and 1,702 for the \p{Lu}.
With PyPi regex module, it is simpler:
import regex
p = regex.compile(r"\p{Lu}") # To support (currently) 1702 uppercase letters
# p = regex.compile(r"[[:upper:]]") # To support (currently) 1822 uppercase letters
if p.match("Żółw"):
print("Capitalized!")
In Python 2.x you should use:
p = regex.compile(ur"\p{Lu}")
p = regex.compile(ur"[[:upper:]]")
or
p = regex.compile(r"\p{Lu}", regex.U)
p = regex.compile(r"[[:upper:]]", regex.U)

re is excessive when you can just use word[0].isupper().
>>>> 'żółć'[0].isupper()
False
>>>> 'Żółw'[0].isupper()
True
>>>> 'ćMa'[0].isupper()
False

>>> words
'Does not match Äh Oh Äi Üs üx Öjjj'
>>> re.findall(r"(\b[A-ZÜÖÄ][a-z.-]+\b)", words, re.UNICODE)
['Does', 'Äh', 'Oh', 'Äi', 'Üs', 'Öjjj']
Just add to the list all the Unicode letters which are not in the range A-Z, I added the german umlauts only.
You can find all non ASCII letters (A-Z) like this:
>>> [c for c in words if not c.isalpha() and not c.isdigit() and not c.isspace()]
['\xc3', '\x84', '\xc3', '\x84', '\xc3', '\x9c', '\xc3', '\xbc', '\xc3', '\x96']
Now you'll have to figure which are capitals.

You can compare string with its capitalized version to tell if it's capitalized or not:
>>>> s = 'żółć Żółw ćMa'
>>>> l = s.split()
>>>> [word for word in l if word == word.capitalize()]
['Żółw']
>>>> frozenset(l).intersection(s.capitalize() for s in l)
frozenset({'Żółw'})
Note that in Python2 you need to use unicode strings for it to work correctly.

You should try \W and \w :
Example in pythex.org

Replacing punctuation except intra-word dashes with a space

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold # ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold # ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'

R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).
In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).
The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold # ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space.
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+")

How to check unicode string to be letters, spaces and dashes only

I've read a great solution for unicode strings here, but I need to check entire string to be letters or spaces or dashes and I can't think of any solution. The example is not working as I want.
name = u"Василий Соловьев-Седой"
r = re.compile(r'^([\s\-^\W\d_]+)$', re.U)
r.match(name) -> None

r = re.compile(r'^(?:[^\W\d_]|[\s-])+$', re.U)
[^\W\d_] matches any letter (by matching any alphanumeric character except for digits and underscore).
[\s-] of course matches whitespace and dashes.

if you ONLY want to check:
name = u"Василий Соловьев-Седой";
name = name.replace("-","").replace(" ","");
name.isalpha()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3 regex with diacritics and ligatures, - python

Related

Regex to match capital/special/unicode/vietnamese characters

Match charactes and whitespaces, but not numbers

Python regex for unicode capitalized words

Replacing punctuation except intra-word dashes with a space

How to check unicode string to be letters, spaces and dashes only

Categories

Resources