Regex for a third-person verb - python

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?

You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))

If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Related

Regex - Regular expression for counting no.of digits between alphabets

Need to construct a regular expression that counts numbers between alphabets.
schowalte3rguss77ie85 - 2
xyz1zyx - 1
x1y1z1 - 2
I have constructed this . But this doesn't work for case 3.
[[a-z]+[0-9]+[a-z]]*
Any help would be appreciated. Thanks in advance.
Use regx:
(?<=[a-z])\d+(?=[a-z])
Demo: https://regex101.com/r/tpss6x/1
[Javascript]
If you want a count only, the last part should be a lookahead assertion.
If you want to also match uppercase chars, you can make the pattern case insensitive.
[a-z]\d+(?=[a-z])
Explanation
[a-z] Match a single char a-z
\d+ Match 1+ digits
(?=[a-z]) Positive lookahead, assert a char a-z to the right
Regex demo
You can use
(?<=[^\W\d_])\d+(?=[^\W\d_])
See the regex demo. If you want to only support ASCII letters, replace [^\W\d_] (that matches any Unicode letter) with [a-zA-Z].
Details:
(?<=[^\W\d_]) - immediately before the current location, there must be any Unicode letter
\d+ - one or more digits
(?=[^\W\d_]) - immediately after the current location, there must be any Unicode letter.
Counting can be done with len(...), see this Python demo:
import re
text = "schowalte3rguss77ie85"
matches = re.findall(r'(?<=[^\W\d_])\d+(?=[^\W\d_])', text)
print(len(matches)) # => 2

regex: don't match number preceded by certain character

Following code extracts the first sequence of numbers that appear in a string:
num = re.findall(r'^\D*(\d+)', string)
I'd like to add that the regular expression doesn't match numbers preceded by vor V.
Example:
string = 'foobarv2_34 423_wd"
Output: '34'
If you need to get the first match, you need to use re.search, not re.findall.
In this case, you can use a simpler regular expression like (?<!v)\d+ with re.I:
import re
m = re.search(r'(?<!v)\d+', 'foobarv2_34 423_wd', re.I)
if m:
print(m.group()) # => 34
See the Python demo.
Details
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
\d+ - one or more digits.
If you cannot use re.search for some reason, you can use
^.*?(?<!v)(\d+)
See this regex demo. Note that \D* (zero or more non-digits) is replaced with .*? that matches zero or more chars other than line break chars as few as possible (with re.S or re.DOTALL, it will also match line breaks) since there is a need to match all digits not preceded with v.
More details:
^ - start of string
.*? - zero or more chars other than line break chars as few as possible
(?<!v) - a negative lookbehind that fails the match if there is a v (or V since re.I is used) immediately to the left of the current location
(\d+) - Group 1: one or more digtis.

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

What is the regex to match the words containing all the vowels?

I am learning regex in python but can't seem to get the hang of it. I am trying the filter out all the words containing all the vowels in english and this is my regex:
r'\b(\S*[aeiou]){5}\b'
seems like it is too vague since any vowel(even repeated ones) can appear at any place and any number is times so this is throwing words like 'actionable', 'unfortunate' which do have count of vowels as 5 but not all the vowels. I looked around the internet and found this regex:
r'[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
But as it appears, its only for the sequential appearance of the vowels, pretty limited task than the one I am trying to accomplish. Can someone 'think out loud' while crafting the regex for the problem that I have?
If you plan to match words as chunks of text only consisting of English letters you may use a regex like
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b
See the regex demo
To support languages other than English, you may replace [a-zA-Z]+ with [^\W\d_]+.
If a "word" you want to match is a chunk of non-whitespace chars you may use
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+
See this regex demo.
Define these patterns in Python using raw string literals, e.g.:
rx_AllVowelWords = r'\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b'
Details
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b:
\b - a word boundary, here, a starting word boundary
(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u) - a sequence of positive lookaheads that are triggered right after the word boundary position is detected, and require the presence of a, e, i, o and u after any 0+ word chars (letters, digits, underscores - you may replace \w*? with [^\W\d_]*? to only check letters)
[a-zA-Z]+ - 1 or more ASCII letters (replace with [^\W\d_]+ to match all letters)
\b - a word boundary, here, a trailing word boundary
The second pattern details:
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+:
(?<!\S) - a position at the start of the string or after a whitespace
(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u) - all English vowels must be present - in any order - after any 0+ chars other than whitespace
\S+ - 1+ non-whitespace chars.
I can't think of an easy way to find "words with all vowels" with a single regexp, but it can easily be done by anding-together regex matches to a, e, i, o, and u separately. For example, something like the following Python script should determine whether a given English word has all vowels (in any order, any multiplicity) or not:
#! /usr/bin/python3
# all-vowels.py
import sys
import re
if len(sys.argv) != 2: sys.exit()
word=sys.argv[1]
if re.search(r'a', word) and re.search(r'e', word) and re.search(r'i', word) and re.search(r'o', word) and re.search(r'u', word):
print("Word has all vowels!")
else:
print("Word does NOT have all vowels.")

Categories

Resources