Detecting a repeated sequence with regex - python

I have a text example like
0s11 0s12 0s33 my name is 0sgfh 0s1 0s22 0s87
I want to detect the consecutive sequences that start 0s.
So, the expected output should be 0s11 0s12 0s33, 0sgfh 0s1 0s22 0s87
I tried using regex
(0s\w+)
but that would detect each 0s11, 0s12, 0s33, etc. individually.
Any idea on how to modify the pattern?

To get those 2 matches where there are at least 2 consecutive parts:
\b0s\w+(?:\s+0s\w+)+
Explanation
\b A word boundary to prevent a partial word match
0s\w+ Match os and 1+ word chars
(?:\s+0s\w+)+ Repeat 1 or more times whitespace chars followed by 0s and 1+ word chars
Regex demo
If you also want to match a single occurrence:
\b0s\w+(?:\s+0s\w+)*
Regex demo
Note that \w+ matches 1 or more word characters so it would not match only 0s

Should be doable with re.findall(). Your pattern was correct! :)
import re
testString = "0s11 0s12 0s33 my name is 0sgfh 0s1 0s22 0s87"
print(re.findall('0s\w', testString))
['0s11', '0s12', '0s33', '0sgfh', '0s1', '0s22', '0s87']
Hope this helps!

Related

Regex - Regular expression for counting no.of digits between alphabets

Need to construct a regular expression that counts numbers between alphabets.
schowalte3rguss77ie85 - 2
xyz1zyx - 1
x1y1z1 - 2
I have constructed this . But this doesn't work for case 3.
[[a-z]+[0-9]+[a-z]]*
Any help would be appreciated. Thanks in advance.
Use regx:
(?<=[a-z])\d+(?=[a-z])
Demo: https://regex101.com/r/tpss6x/1
[Javascript]
If you want a count only, the last part should be a lookahead assertion.
If you want to also match uppercase chars, you can make the pattern case insensitive.
[a-z]\d+(?=[a-z])
Explanation
[a-z] Match a single char a-z
\d+ Match 1+ digits
(?=[a-z]) Positive lookahead, assert a char a-z to the right
Regex demo
You can use
(?<=[^\W\d_])\d+(?=[^\W\d_])
See the regex demo. If you want to only support ASCII letters, replace [^\W\d_] (that matches any Unicode letter) with [a-zA-Z].
Details:
(?<=[^\W\d_]) - immediately before the current location, there must be any Unicode letter
\d+ - one or more digits
(?=[^\W\d_]) - immediately after the current location, there must be any Unicode letter.
Counting can be done with len(...), see this Python demo:
import re
text = "schowalte3rguss77ie85"
matches = re.findall(r'(?<=[^\W\d_])\d+(?=[^\W\d_])', text)
print(len(matches)) # => 2

FutureWarning: Possible nested set at position 1 Error Python

I was working on something and at some point, I needed to check whether the string satisfies this:
The string must contain at least 5 words and each separated by a hyphen(-) or an underscore(_).
Here is the code that I wrote:
password=eval(input('Password:'))
pattern=r'[[\w][-_]]{5,}'
import re
re.fullmatch(pattern,password)
But it gives ' ipython-input-32-7c87b09218f8>:4: FutureWarning: Possible nested set at position 1
re.fullmatch(pattern,password) ' error. Why that happens, any idea?Thanks in advance.Btw I'm using Jupyter notebook.
You can match 1+ word characters, and then repeat at least 4 times matching either _ or / and again 1 or more word characters.
\w+(?:[/_]\w+){4,}
Explanation
\w+ Match 1+ word characters
(?: Non capture group to repeat as a whole part
[/_] Character class matching either / or _
\w+ Match 1+ word characters
){4,} close the no capture group and repeat 4 or more times
See a regex demo.

What is the regex to match the words containing all the vowels?

I am learning regex in python but can't seem to get the hang of it. I am trying the filter out all the words containing all the vowels in english and this is my regex:
r'\b(\S*[aeiou]){5}\b'
seems like it is too vague since any vowel(even repeated ones) can appear at any place and any number is times so this is throwing words like 'actionable', 'unfortunate' which do have count of vowels as 5 but not all the vowels. I looked around the internet and found this regex:
r'[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*
But as it appears, its only for the sequential appearance of the vowels, pretty limited task than the one I am trying to accomplish. Can someone 'think out loud' while crafting the regex for the problem that I have?
If you plan to match words as chunks of text only consisting of English letters you may use a regex like
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b
See the regex demo
To support languages other than English, you may replace [a-zA-Z]+ with [^\W\d_]+.
If a "word" you want to match is a chunk of non-whitespace chars you may use
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+
See this regex demo.
Define these patterns in Python using raw string literals, e.g.:
rx_AllVowelWords = r'\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b'
Details
\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b:
\b - a word boundary, here, a starting word boundary
(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u) - a sequence of positive lookaheads that are triggered right after the word boundary position is detected, and require the presence of a, e, i, o and u after any 0+ word chars (letters, digits, underscores - you may replace \w*? with [^\W\d_]*? to only check letters)
[a-zA-Z]+ - 1 or more ASCII letters (replace with [^\W\d_]+ to match all letters)
\b - a word boundary, here, a trailing word boundary
The second pattern details:
(?<!\S)(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u)\S+:
(?<!\S) - a position at the start of the string or after a whitespace
(?=\S*?a)(?=\S*?e)(?=\S*?i)(?=\S*?o)(?=\S*?u) - all English vowels must be present - in any order - after any 0+ chars other than whitespace
\S+ - 1+ non-whitespace chars.
I can't think of an easy way to find "words with all vowels" with a single regexp, but it can easily be done by anding-together regex matches to a, e, i, o, and u separately. For example, something like the following Python script should determine whether a given English word has all vowels (in any order, any multiplicity) or not:
#! /usr/bin/python3
# all-vowels.py
import sys
import re
if len(sys.argv) != 2: sys.exit()
word=sys.argv[1]
if re.search(r'a', word) and re.search(r'e', word) and re.search(r'i', word) and re.search(r'o', word) and re.search(r'u', word):
print("Word has all vowels!")
else:
print("Word does NOT have all vowels.")

Build regular expression to recognize at least a given interval

I have a regular expression given by a word and a range of words following.
For example:
pattern = 'word \\w+ \\w+ \\w+"
result = [text[match.start():match.end()] for match in re.finditer(pattern, text)]
How could you modify the regular expression so that when there is a smaller number of elements that in the interval also recognize it? For example if the word is in the end of the string I would like it to return that interval too.
Always if possible to return the greatest possible pattern.
Your 'word \\w+ \\w+ \\w+" regex matches a word and then 3 more "words" (space separated). You want to match 0 to 3 of these words. Use
re.findall(r'word(?:\s+\w+){0,3}', s)
Or, to allow any non-word chars in between the "words", replace \s with \W:
re.findall(r'word(?:\W+\w+){0,3}', s)
Details:
word - word string
(?:\s+\w+){0,3} - 0 to 3 sequences (the {0,3} is a greedy version of the limiting quantifier, it will match as many occurrences as possible) of:
\s+ - 1+ whitespaces
\w+ - 1 or more word chars.
See the regex demo.

Regex for a third-person verb

I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?
You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))
If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".

Categories

Resources