Regex to match capital/special/unicode/vietnamese characters - python

I'm facing an issue. Indeed, I work with vietnamese texts and I want to find every word containing uppercase(s) (capital letter).
When I use the 're' module, my function (temp) does not catch word like "Đà".
The other way (temp2) is to check each character at a time, it works but it is slow since I have to split the sentences into words.
Hence I would like to know if there is a way of the "re" module to catch all the special capital letter.
I have 2 ways :
def temp(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
lis=word_tokenize(sentence)
def temp2(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
Input:
'nous avons 2 Đồng et 3 Euro'
Expected output :
['Đồng','Euro']
Thank you!

You may use this regex:
\b\S*[AĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴAĂÂÁẮẤÀẰẦẢẲẨÃẴẪẠẶẬĐEÊÉẾÈỀẺỂẼỄẸỆIÍÌỈĨỊOÔƠÓỐỚÒỒỜỎỔỞÕỖỠỌỘỢUƯÚỨÙỪỦỬŨỮỤỰYÝỲỶỸỴA-Z]+\S*\b
Regex Demo

The answer of #Rizwan M.Tuman is correct. I want to share with you the speed of execution of the three functions for 100,000 sentences.
lis=word_tokenize(sentence)
def temp(lis):
proper_noun=[]
for word in lis:
for letter in word:
if letter.isupper():
proper_noun.append(word)
break
return proper_noun
def temp2(sentence):
return re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)
def temp3(sentence):
return re.findall(capital_letter,sentence)
By this way:
start_time = time.time()
for k in range(100000):
temp2(sentence)
print("%s seconds" % (time.time() - start_time))
Here are the results:
>>Check each character of a list of words if it is a capital letter (.isupper())
(sentence has already been splitted into words)
0.4416656494140625 seconds
>>Function with re module which finds normal capital letters [A-Z] :
0.9373950958251953 seconds
>>Function with re module which finds all kinds of capital letters :
1.0783331394195557 seconds

To match only 1+ letter chunks that contain at least 1 uppercase Unicode letter you may use
import re, sys, unicodedata
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
p = re.compile(r"[^\W\d_]*{Lu}[^\W\d_]*".format(Lu=pLu))
sentence = 'nous avons 2 Đồng et 3 Ęułro.+++++++++++++++Next line'
print(p.findall(sentence))
# => ['Đồng', 'Ęułro', 'Next']
The pLu is a Unicode letter character class pattern built dynamically using unicodedata. It is dependent on the Python version, use the latest to include as many Unicode uppercase letters as possible (see this answer for more details, too). The [^\W\d_] is a construct matching any Unicode letter. So, the pattern matches any 0+ Unicode letters, followed with at least 1 Unicode uppercase letter, and then having any 0+ Unicode letters.
Note that your original r'[a-z]*[A-Z]+[a-z]*' will only find Next in this input:
print(re.findall(r'[a-z]*[A-Z]+[a-z]*', sentence)) # => ['Next']
See the Python demo
To match the words as whole words, use \b word boundary:
p = re.compile(r"\b[^\W\d_]*{Lu}[^\W\d_]*\b".format(Lu=pLu))
In case you want to use Python 2.x, do not forget to use re.U flag to make the \W, \d and \b Unicode aware. However, it is recommended to use the latest PyPi regex library and its [[:upper:]] / \p{Lu} constructs to match uppercase letters since it will support the up-to-date list of Unicode letters.

Related

Python regex match string of 8 characters that contain both alphabets and numbers

I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall. The string can start with either letter or alphabet followed by any combination.
e.g.-
Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.
Output: ['896av6uf','a96bv6u0']
I came up with this regex r'([a-z]+[\d]+[\w]*|[\d]+[a-z]+[\w]*)' however it is giving me strings with less than 8 characters as well.
Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.
You can use
\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b
The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\W\d_] matches any Unicode letter and \d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).
Details:
\b - a word boundary
(?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
(?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
[a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
\b - a word boundary
First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:
\b[a-z\d]{8}\b
Next condition is that the word must contain both letters and numbers:
[a-d]\d
Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:
\b(?=.*[a-z]\d)[a-z\d]{8}\b
Im sure there a tidier way of doing this but this will work.
You can use \b\w{8}\b
It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (e.g. whitespace, start/end of line).
You can try it in one of the online playgrounds such as this one: https://regex101.com/
The meat of the matching is done with the \w{8} which means 8 letters/words (including capitals and underscore). \b means "word boundary"
If you want only digits and lowercase letters, replace this by \b[a-z0-9]{8}\b
You can then further check for existence of both digits AND letter, e.g. by using filter:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), result))
result is what you get from re.findall() .
So bottom line, I would use:
list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[a-z]', s), re.findall(r'\b[a-z0-9]{8}\b', str)))
A more compact solution than others have suggested is this:
((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})
This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.
Breakdown:
(?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
[0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.
Test Case:
Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm
import re
pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')
test = input()
match = pattern.findall(test)
print(match)
Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']

What is the regex to match the words containing all the vowels?

I am learning regex in python but can't seem to get the hang of it. I am trying the filter out all the words containing all the vowels in english and this is my regex:
r'\b(\S*[aeiou]){5}\b'
seems like it is too vague since any vowel(even repeated ones) can appear at any place and any number is times so this is throwing words like 'actionable', 'unfortunate' which do have count of vowels as 5 but not all the vowels. I looked around the internet and found this regex:
r'[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
But as it appears, its only for the sequential appearance of the vowels, pretty limited task than the one I am trying to accomplish. Can someone 'think out loud' while crafting the regex for the problem that I have?
If you plan to match words as chunks of text only consisting of English letters you may use a regex like
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b
See the regex demo
To support languages other than English, you may replace [a-zA-Z]+ with [^\W\d_]+.
If a "word" you want to match is a chunk of non-whitespace chars you may use
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+
See this regex demo.
Define these patterns in Python using raw string literals, e.g.:
rx_AllVowelWords = r'\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b'
Details
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b:
\b - a word boundary, here, a starting word boundary
(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u) - a sequence of positive lookaheads that are triggered right after the word boundary position is detected, and require the presence of a, e, i, o and u after any 0+ word chars (letters, digits, underscores - you may replace \w*? with [^\W\d_]*? to only check letters)
[a-zA-Z]+ - 1 or more ASCII letters (replace with [^\W\d_]+ to match all letters)
\b - a word boundary, here, a trailing word boundary
The second pattern details:
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+:
(?<!\S) - a position at the start of the string or after a whitespace
(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u) - all English vowels must be present - in any order - after any 0+ chars other than whitespace
\S+ - 1+ non-whitespace chars.
I can't think of an easy way to find "words with all vowels" with a single regexp, but it can easily be done by anding-together regex matches to a, e, i, o, and u separately. For example, something like the following Python script should determine whether a given English word has all vowels (in any order, any multiplicity) or not:
#! /usr/bin/python3
# all-vowels.py
import sys
import re
if len(sys.argv) != 2: sys.exit()
word=sys.argv[1]
if re.search(r'a', word) and re.search(r'e', word) and re.search(r'i', word) and re.search(r'o', word) and re.search(r'u', word):
print("Word has all vowels!")
else:
print("Word does NOT have all vowels.")

Find every two (non-overlapping) vowels inbetween consonants

Task
You are given a string . It consists of alphanumeric characters, spaces and symbols(+,-).
Your task is to find all the substrings of the origina string that contain two or more vowels.
Also, these substrings must lie in between consonants and should contain vowels only.
Input Format: a single line of input containing string .
Output Format: print the matched substrings in their order of occurrence on separate
lines. If no match is found, print -1.
Sample Input:
rabcdeefgyYhFjkIoomnpOeorteeeeet
Sample Output:
ee
Ioo
Oeo
eeeee
The challenge above was taken from https://www.hackerrank.com/challenges/re-findall-re-finditer
The following code passes all the test cases:
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The following doesn't.
import re
sol = re.findall(r"[^aiueo]([aiueoAIUEO]{2,})[^aiueo]", input())
if sol:
for s in sol:
print(s)
else:
print(-1)
The only difference beteen them is the final bit of the regex. I can't understand why the second code fails. I would argue that ?= is useless because by grouping [aiueoAIUEO]{2,} I'm already excluding it from capture, but obviously I'm wrong and I can't tell why.
Any help?
The lookahead approach allows the consonant that ends one sequence of vowels to start the next sequence, whereas the non-lookahead approach requires at least two consonants between those sequences (one to end a sequence, another to start the next, as both are matched).
See
import re
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})(?=[^aiueo])', 'moomoom'))
print(re.findall(r'[^aiueo]([aiueoAIUEO]{2,})[^aiueo]', 'moomoom'))
Which will output
['oo', 'oo']
['oo']
https://ideone.com/2Wn1TS
To be a bit picky, both attempts aren't correct regarding your problem description. They allow for uppercase vowels, spaces and symbols to be separators. You might want to use [b-df-hj-np-tv-z] instead of [^aeiou] and use flags=re.I
Here's an alternative solution that doesn't require using the special () characters for grouping, relying instead on a "positive lookbehind assertion" with (?<=...) RE syntax:
import re
sol=re.findall(r"(?<=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])[AEIOUaeiou]{2,}(?=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])", input())
if sol:
print(*sol, sep="\n")
else:
print(-1)
You can use re.IGNORECASE our re.I flag to ignore case sensitivity. Also, you can avoid vowels, digits from alphanumeric characters and space, + and - characters mentioned in the problem.
import re
vowels = re.findall(r"[^aeiou\d\s+-]([aeiou]{2,})(?=[^aeiou\d\s+-])", input(), re.I)
if len(vowels):
for vowel in vowels:
print(vowel)
else:
print("-1")

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Replacing punctuation except intra-word dashes with a space

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold # ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold # ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'
R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).
In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).
The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold # ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space.
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+")

Categories

Resources