Python regex for unicode capitalized words - python

I have a set of words in different languages (English, Polish, Finnish, Russian etc.) and need to check, what of them written with a capital letter.
I tried to use simple regular expression: ^[A-Z], but it matches only latinate letters, then I've added the russian capital letters: ^[A-ZА-Я].
But many unicode letters with diacritics remains. How I can add all capital letters to my regex?
It's possible to make this without enumerations of symbols?
P.S. I know, how to make this in Ruby, but now I'm using Python.

If you need to use a regex, you have 2 options:
Install PyPi regex module and use \p{Lu} or [[:upper:]] (having more uppercase chars in it) class (make sure you have the latest version installed)
Use re with a character class containing all uppercase letter ranges, either using Python utilities (and then the amount of the Unicode letters matched will depend on the Python version, the latest having up-to-date data) or by manually creating/updating the range from the Unicode Utilities CLDR page.
Here is a solution with a regex containing all uppercase letter ranges taken from Unicode Utilities CLDR reference page:
import re
pLu = "[A-Z\u00C0-\u00D6\u00D8-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178\u0179\u017B\u017D\u0181\u0182\u0184\u0186\u0187\u0189-\u018B\u018E-\u0191\u0193\u0194\u0196-\u0198\u019C\u019D\u019F\u01A0\u01A2\u01A4\u01A6\u01A7\u01A9\u01AC\u01AE\u01AF\u01B1-\u01B3\u01B5\u01B7\u01B8\u01BC\u01C4\u01C7\u01CA\u01CD\u01CF\u01D1\u01D3\u01D5\u01D7\u01D9\u01DB\u01DE\u01E0\u01E2\u01E4\u01E6\u01E8\u01EA\u01EC\u01EE\u01F1\u01F4\u01F6-\u01F8\u01FA\u01FC\u01FE\u0200\u0202\u0204\u0206\u0208\u020A\u020C\u020E\u0210\u0212\u0214\u0216\u0218\u021A\u021C\u021E\u0220\u0222\u0224\u0226\u0228\u022A\u022C\u022E\u0230\u0232\u023A\u023B\u023D\u023E\u0241\u0243-\u0246\u0248\u024A\u024C\u024E\u0370\u0372\u0376\u037F\u0386\u0388-\u038A\u038C\u038E\u038F\u0391-\u03A1\u03A3-\u03AB\u03CF\u03D2-\u03D4\u03D8\u03DA\u03DC\u03DE\u03E0\u03E2\u03E4\u03E6\u03E8\u03EA\u03EC\u03EE\u03F4\u03F7\u03F9\u03FA\u03FD-\u042F\u0460\u0462\u0464\u0466\u0468\u046A\u046C\u046E\u0470\u0472\u0474\u0476\u0478\u047A\u047C\u047E\u0480\u048A\u048C\u048E\u0490\u0492\u0494\u0496\u0498\u049A\u049C\u049E\u04A0\u04A2\u04A4\u04A6\u04A8\u04AA\u04AC\u04AE\u04B0\u04B2\u04B4\u04B6\u04B8\u04BA\u04BC\u04BE\u04C0\u04C1\u04C3\u04C5\u04C7\u04C9\u04CB\u04CD\u04D0\u04D2\u04D4\u04D6\u04D8\u04DA\u04DC\u04DE\u04E0\u04E2\u04E4\u04E6\u04E8\u04EA\u04EC\u04EE\u04F0\u04F2\u04F4\u04F6\u04F8\u04FA\u04FC\u04FE\u0500\u0502\u0504\u0506\u0508\u050A\u050C\u050E\u0510\u0512\u0514\u0516\u0518\u051A\u051C\u051E\u0520\u0522\u0524\u0526\u0528\u052A\u052C\u052E\u0531-\u0556\u10A0-\u10C5\u10C7\u10CD\u13A0-\u13F5\u1E00\u1E02\u1E04\u1E06\u1E08\u1E0A\u1E0C\u1E0E\u1E10\u1E12\u1E14\u1E16\u1E18\u1E1A\u1E1C\u1E1E\u1E20\u1E22\u1E24\u1E26\u1E28\u1E2A\u1E2C\u1E2E\u1E30\u1E32\u1E34\u1E36\u1E38\u1E3A\u1E3C\u1E3E\u1E40\u1E42\u1E44\u1E46\u1E48\u1E4A\u1E4C\u1E4E\u1E50\u1E52\u1E54\u1E56\u1E58\u1E5A\u1E5C\u1E5E\u1E60\u1E62\u1E64\u1E66\u1E68\u1E6A\u1E6C\u1E6E\u1E70\u1E72\u1E74\u1E76\u1E78\u1E7A\u1E7C\u1E7E\u1E80\u1E82\u1E84\u1E86\u1E88\u1E8A\u1E8C\u1E8E\u1E90\u1E92\u1E94\u1E9E\u1EA0\u1EA2\u1EA4\u1EA6\u1EA8\u1EAA\u1EAC\u1EAE\u1EB0\u1EB2\u1EB4\u1EB6\u1EB8\u1EBA\u1EBC\u1EBE\u1EC0\u1EC2\u1EC4\u1EC6\u1EC8\u1ECA\u1ECC\u1ECE\u1ED0\u1ED2\u1ED4\u1ED6\u1ED8\u1EDA\u1EDC\u1EDE\u1EE0\u1EE2\u1EE4\u1EE6\u1EE8\u1EEA\u1EEC\u1EEE\u1EF0\u1EF2\u1EF4\u1EF6\u1EF8\u1EFA\u1EFC\u1EFE\u1F08-\u1F0F\u1F18-\u1F1D\u1F28-\u1F2F\u1F38-\u1F3F\u1F48-\u1F4D\u1F59\u1F5B\u1F5D\u1F5F\u1F68-\u1F6F\u1FB8-\u1FBB\u1FC8-\u1FCB\u1FD8-\u1FDB\u1FE8-\u1FEC\u1FF8-\u1FFB\u2102\u2107\u210B-\u210D\u2110-\u2112\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u2130-\u2133\u213E\u213F\u2145\u2160-\u216F\u2183\u24B6-\u24CF\u2C00-\u2C2E\u2C60\u2C62-\u2C64\u2C67\u2C69\u2C6B\u2C6D-\u2C70\u2C72\u2C75\u2C7E-\u2C80\u2C82\u2C84\u2C86\u2C88\u2C8A\u2C8C\u2C8E\u2C90\u2C92\u2C94\u2C96\u2C98\u2C9A\u2C9C\u2C9E\u2CA0\u2CA2\u2CA4\u2CA6\u2CA8\u2CAA\u2CAC\u2CAE\u2CB0\u2CB2\u2CB4\u2CB6\u2CB8\u2CBA\u2CBC\u2CBE\u2CC0\u2CC2\u2CC4\u2CC6\u2CC8\u2CCA\u2CCC\u2CCE\u2CD0\u2CD2\u2CD4\u2CD6\u2CD8\u2CDA\u2CDC\u2CDE\u2CE0\u2CE2\u2CEB\u2CED\u2CF2\uA640\uA642\uA644\uA646\uA648\uA64A\uA64C\uA64E\uA650\uA652\uA654\uA656\uA658\uA65A\uA65C\uA65E\uA660\uA662\uA664\uA666\uA668\uA66A\uA66C\uA680\uA682\uA684\uA686\uA688\uA68A\uA68C\uA68E\uA690\uA692\uA694\uA696\uA698\uA69A\uA722\uA724\uA726\uA728\uA72A\uA72C\uA72E\uA732\uA734\uA736\uA738\uA73A\uA73C\uA73E\uA740\uA742\uA744\uA746\uA748\uA74A\uA74C\uA74E\uA750\uA752\uA754\uA756\uA758\uA75A\uA75C\uA75E\uA760\uA762\uA764\uA766\uA768\uA76A\uA76C\uA76E\uA779\uA77B\uA77D\uA77E\uA780\uA782\uA784\uA786\uA78B\uA78D\uA790\uA792\uA796\uA798\uA79A\uA79C\uA79E\uA7A0\uA7A2\uA7A4\uA7A6\uA7A8\uA7AA-\uA7AE\uA7B0-\uA7B4\uA7B6\uFF21-\uFF3A\U00010400-\U00010427\U000104B0-\U000104D3\U00010C80-\U00010CB2\U000118A0-\U000118BF\U0001D400-\U0001D419\U0001D434-\U0001D44D\U0001D468-\U0001D481\U0001D49C\U0001D49E\U0001D49F\U0001D4A2\U0001D4A5\U0001D4A6\U0001D4A9-\U0001D4AC\U0001D4AE-\U0001D4B5\U0001D4D0-\U0001D4E9\U0001D504\U0001D505\U0001D507-\U0001D50A\U0001D50D-\U0001D514\U0001D516-\U0001D51C\U0001D538\U0001D539\U0001D53B-\U0001D53E\U0001D540-\U0001D544\U0001D546\U0001D54A-\U0001D550\U0001D56C-\U0001D585\U0001D5A0-\U0001D5B9\U0001D5D4-\U0001D5ED\U0001D608-\U0001D621\U0001D63C-\U0001D655\U0001D670-\U0001D689\U0001D6A8-\U0001D6C0\U0001D6E2-\U0001D6FA\U0001D71C-\U0001D734\U0001D756-\U0001D76E\U0001D790-\U0001D7A8\U0001D7CA\U0001E900-\U0001E921\U0001F130-\U0001F149\U0001F150-\U0001F169\U0001F170-\U0001F189]"
p = re.compile(pLu)
if p.match("Żółw"):
print("Capitalized!")
See the IDEONE demo. To make it work in Python 2.x, make sure you add u prefix to the string literals.
There are other ways to get the Unicode upper-case letter character class in Python using unicodedata and sys packages like
# Python 3
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
# Python 2
pLu = u'[{}]'.format(u"".join([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]))
However, this range does not match all uppercase letters displayed at the Unicode Utilities: UnicodeSet page for [:upper:] POSIX character class.
Cf.:
Python 2.7 len([unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()]) displays 1427
Python 3.5 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1751
Python 3.6 len([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]) shows 1822
Current Unicode Utilities CLDR page displays 1,822 uppercase letters for [:upper:] class, and 1,702 for the \p{Lu}.
With PyPi regex module, it is simpler:
import regex
p = regex.compile(r"\p{Lu}") # To support (currently) 1702 uppercase letters
# p = regex.compile(r"[[:upper:]]") # To support (currently) 1822 uppercase letters
if p.match("Żółw"):
print("Capitalized!")
In Python 2.x you should use:
p = regex.compile(ur"\p{Lu}")
p = regex.compile(ur"[[:upper:]]")
or
p = regex.compile(r"\p{Lu}", regex.U)
p = regex.compile(r"[[:upper:]]", regex.U)

re is excessive when you can just use word[0].isupper().
>>>> 'żółć'[0].isupper()
False
>>>> 'Żółw'[0].isupper()
True
>>>> 'ćMa'[0].isupper()
False

>>> words
'Does not match Äh Oh Äi Üs üx Öjjj'
>>> re.findall(r"(\b[A-ZÜÖÄ][a-z.-]+\b)", words, re.UNICODE)
['Does', 'Äh', 'Oh', 'Äi', 'Üs', 'Öjjj']
Just add to the list all the Unicode letters which are not in the range A-Z, I added the german umlauts only.
You can find all non ASCII letters (A-Z) like this:
>>> [c for c in words if not c.isalpha() and not c.isdigit() and not c.isspace()]
['\xc3', '\x84', '\xc3', '\x84', '\xc3', '\x9c', '\xc3', '\xbc', '\xc3', '\x96']
Now you'll have to figure which are capitals.

You can compare string with its capitalized version to tell if it's capitalized or not:
>>>> s = 'żółć Żółw ćMa'
>>>> l = s.split()
>>>> [word for word in l if word == word.capitalize()]
['Żółw']
>>>> frozenset(l).intersection(s.capitalize() for s in l)
frozenset({'Żółw'})
Note that in Python2 you need to use unicode strings for it to work correctly.

You should try \W and \w :
Example in pythex.org

Related

How to find word with subscript?

Input: s = "test1 this is a sample subscript o₁"
I've tried: re.compile(r'\b[^\W\d_]{2,}\b').findall(s)
It finds the word with more than 2 chars and doesn't contain number
'this', 'is', 'sample', 'subscript', 'o₁',
but it still has the subscript number.
Is there a way to remove word that contains subscript in it?
Desire output: 'this', 'is', 'sample', 'subscript'
The point is that the Unicode aware \d in Python 3 regex does not match No Unicode category.
If you need to work with ASCII only letter words, use
r'\b[a-zA-Z]{2,}\b'
Or, make the pattern non-Unicode aware by using re.A / re.ASCII flag:
re.compile(r'\b[^\W\d_]{2,}\b', re.A)
See this Python 3 demo.
If you need to work with any Unicode letters you may fix it by either adding all the No characters to the regex negated character class (which might make it a tedious solution), or add a programmatic check after a match is found to see if the match contains any char from the No category.
See this Python 3 demo:
import re, sys, unicodedata
s = "test1 this is a sample subscript o₁"
No = [chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'No']
print([x for x in re.findall(r'\b[^\W\d_]{2,}\b', s) if not any(y in x for y in No)])
# => ['this', 'is', 'sample', 'subscript']
Make sure you are using the latest Python version to support the latest Unicode standard, or rely on the PyPi regex module:
p = regex.compile(r"\b\p{L}{2,}\b")
print(p.findall(s))

Regex to match capital/special/unicode/vietnamese characters

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter).
When I use the 're' module, my function (temp) does not catch word like "Đà".
The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.
Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.
I have 2 ways :
def temp(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
lis=word_tokenize(sentence)
def temp2(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
Input:
'nous avons 2 Đồng et 3 Euro'
Expected output :
['Đồng','Euro']
Thank you!
You may use this regex:
\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b
Regex Demo
The answer of #Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.
lis=word_tokenize(sentence)
def temp(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
def temp2(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
def temp3(sentence):
return re.findall(capital_letter,sentence)
By this way:
start_time = time.time()
for k in range(100000):
temp2(sentence)
print("%s seconds" % (time.time() - start_time))
Here are the results:
>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds
>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds
>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds
To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use
import re, sys, unicodedata
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))
sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']
The pLu is a Unicode letter character class pattern built dynamically using unicodedata. It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too). The [^\W\d_] is a construct matching any Unicode letter. So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.
Note that your original r'[a-z]*[A-Z]+[a-z]*' will only find Next in this input:
print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']
See the Python demo
To match the words as whole words, use \b word boundary:
p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))
In case you want to use Python 2.x, do not forget to use re.U flag to make the \W, \d and \b Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]] / \p{Lu} constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.

Python Regex Picking "not include" word

I am trying to find words in the string that do not contain any "a" characters. I wrote the code below but it does not work. How can I say to regex "do not include"? Can't I use "^" sign as "not"?
import re
string2 = "asfdba12312sssdr1 12şljş1 kf"
t = re.findall(r'([^a]\w*) | \w*[^a] ', string2 )
print(t)
The result of that code is "['sfdba12312sssdr1', '12şljş1']"
You need to use a regex with word boundaries with a re.UNICODE flag:
r = re.compile(ur'\b[^\Wa]+\b', re.UNICODE)
The \W and \b will become Unicode aware then.
See the regex demo
[^\Wa] matches any Unicode letter, digit or inderscore, but not a. Add a re.I flag to make it case-insensitive.
If you do not want to match words with digits, add \d to the char class: [^\W\da].
See Python demo:
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'\b[^\Wa]+\b', re.UNICODE)
s = u"asfdba12312sssdr1 12şljş1 kf"
res = [x.encode('utf8') for x in p.findall(s)]
print(res)
[^a] is the single non-a character. [^a]\w* is a single non-a character followed by any number of word-characters. Note that a space is a non-a character, and word-characters can also include a...
The easiest and the most intuitive way to do this in Python is not using re.findall at all:
[word for word in string2.split() if not 'a' in word]

Python Regex - checking for a capital letter with a lowercase after

I am trying to check for a capital letter that has a lowercase letter coming directly after it. The trick is that there is going to be a bunch of garbage capital letters and number coming directly before it. For example:
AASKH317298DIUANFProgramming is fun
as you can see, there is a bunch of stuff we don't need coming directly before the phrase we do need, Programming is fun.
I am trying to use regex to do this by taking each string and then substituting it out with '' as the original string does not have to be kept.
re.sub(r'^[A-Z0-9]*', '', string)
The problem with this code is that it leaves us with rogramming is fun, as the P is a capital letter.
How would I go about checking to make sure that if the next letter is a lowercase, then I should leave that capital untouched. (The P in Programming)
Use a negative look-ahead:
re.sub(r'^[A-Z0-9]*(?![a-z])', '', string)
This matches any uppercase character or digit that is not followed by a lowercase character.
Demo:
>>> import re
>>> string = 'AASKH317298DIUANFProgramming is fun'
>>> re.sub(r'^[A-Z0-9]*(?![a-z])', '', string)
'Programming is fun'
You can also use match like this :
>>> import re
>>> s = 'AASKH317298DIUANFProgramming is fun'
>>> r = r'^.*([A-Z][a-z].*)$'
>>> m = re.match(r, s)
>>> if m:
... print(m.group(1))
...
Programming is fun

Python 3 regex with diacritics and ligatures,

Names in the form: Ceasar, Julius are to be split into First_name Julius Surname Ceasar.
Names may contain diacritics (á à é ..), and ligatures (æ, ø)
This code seems to work OK in Python 3.3
import re
def doesmatch(pat, str):
try:
yup = re.search(pat, str)
print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1)))
except AttributeError:
print('no match for {0}'.format(str))
s = 'Révèrberë, Harry'
t = 'Åapö, Renée'
u = 'C3po, Robby'
v = 'Mærsk, Efraïm'
w = 'MacDønald, Ron'
x = 'Sträßle, Mpopo'
pat = r'^([^\d\s]+), ([^\d\s]+)'
# matches any letter, diacritic or ligature, but not digits or punctuation inside the ()
for i in s, t, u, v, w, x:
doesmatch(pat, i)
All except u match. (no match for numbers in names), but I wonder if there isn't a better way than the non-digit non-space approach.
More important though: I'd like to refine the pattern so it distinquishes capitals from lowercase letters, but including capital diacritics and ligatures, preferably using regex also. As if ([A-Z][a-z]+), would match accented and combined characters.
Is this possible?
(what I've looked at so far:
Dive into python 3 on UTF-8 vs Unicode; This Regex tutorial on Unicode (which I'm not using); I think I don't need new regex but I admit I haven't read all its documentation)
If you want to distinguish uppercase and lowercase letters using the standard library's re module, then I'm afraid you'll have to build a character class of all the relevant Unicode codepoints manually.
If you don't really need to do this, use
[^\W\d_]
to match any Unicode letter. This character class matches anything that's "not a non-alphanumeric character" (which is the same as "an alphanumeric character") that's also not a digit nor an underscore.

Categories

Resources