I'm really new to regex and I've been able to find regex which can match this quite easily, but I am unsure how to only match words without it.
I have a .txt file with words like
sheep
fleece
eggs
meat
potato
I want to make a regular expression that matches words in which vowels are not repeated consecutively, so it would return eggs meat potato.
I'm not very experienced with regex and I've been unable to find anything about how to do this online, so it'd be awesome if someone with more experience could help me out. Thanks!
I'm using python and have been testing my regex with https://regex101.com.
Thanks!
EDIT: provided incorrect examples of results for the regular expression. Fixed.
Note that, since the desired output includes meat but not fleece, desired words are allowed to have repeated vowels, just not the same vowel repeated.
To select lines with no repeated vowel:
>>> [w for w in open('file.txt') if not re.search(r'([aeiou])\1', w)]
['eggs\n', 'meat\n', 'potato\n']
The regex [aeiou] matches any vowel (you can include y if you like). The regex ([aeiou])\1 matches any vowel followed by the same vowel. Thus, not re.search(r'([aeiou])\1', w) is true only for strings w that contain no repeated vowels.
Addendum
If we wanted to exclude meat because it has two vowels in a row, even though they are not the same vowel, then:
>>> [w for w in open('file.txt') if not re.search(r'[aeiou]{2}', w)]
['eggs\n', 'potato\n']
#John1024 's answer should work
I also would try
"\w*(a{2,}|e{2,}|i{2,}|o{2,}|u{2,})\w*"ig
Related
In my program I'm using count=strr2.lower().count("ieee") to calculate the number of occurrences in the following string,
"i love ieee and ieeextream is the best coding competition ever"
In here it counts "ieeextream" is also as one occurrence which is not my expected result. The expected output is count=1
So are there any method to check only for "ieee" word or can we change the same code with different implementation? Thanks for your time
If you are trying to find the sub-string as a whole word present in the original string, then I guess, this is what you need :
count=strr2.lower().split().count("ieee")
If you want to count only whole words, you can use a regular expression, wrapping the word to be found in word-boundary characters \b. This will also work if the word is surrounded by punctuation.
>>> import re
>>> s = "i love IEEE, and ieeextream is the best coding competition ever"
>>> len(re.findall(r"\bieee\b", s.lower()))
1
For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))
There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.
If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))
I have a string that can be ilustrated by the following (extraspaces intended):
"words that don't matter START some words one some words two some words three END words that don't matter"
To grab each substring between START and END ['some words one', some words two', 'some words three'], I wrote the following code:
result = re.search(r'(?<=START).*?(?=END)', string, flags=re.S).group()
result = re.findall(r'(\(?\w+(?:\s\w+)*\)?)', result)
Is it possible to achieve this with one single regex?
In theory you could just wrap your second regex in ()* and put it into your first. That would capture all occurrences of your inner expression in the bounds. Unfortunately the Python implementation only retains the last match of a group that is matched multiple times. The only implementation that I know that retains all matches of a group is the .NET one. So unfortunately not a solution for you.
On the other hand why can't you simply keep the two step approach that you have?
Edit:
You can compare the behaviour I described using online regex tools.
Pattern: (\w+\s*)* Input: aaa bbb ccc
Try it for example with https://pythex.org/ and http://regexstorm.net/tester.
You will see that Python returns one match/group which is ccc while .NET returns $1 as three captures aaa, bbb, ccc.
Edit2: As #Jan says there is also the newer regex module that supports multi captures. I had completely forgotten about that.
With the newer regex module, you can do it in one step:
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+
This looks complicated, but broken down, it says:
(?:\G(?!\A)|START) # look for START or the end of the last match
\s*\K # whitespaces, \K "forgets" all characters to the left
(?!\bEND\b) # neg. lookahead, do not overrun END
\w+\s+\w+\s+\w+ # your original expression
In Python this looks like:
import regex as re
rx = re.compile(r'''
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+''', re.VERBOSE)
string = "words that don't matter START some words one some words two some words three END words that don't matter"
print(rx.findall(string))
# ['some words one', 'some words two', 'some words three']
Additionally, see a demo on regex101.com.
This is an ideal situation where we could use a re.split, as #PeterE mentioned to circumvent the problem of having access only to the last captured group.
import re
s=r'"words that don\'t matter START some words one some words two some words three END words that don\'t matter" START abc a bc c END'
print('\n'.join(re.split(r'^.*?START\s+|\s+END.*?START\s+|\s+END.*?$|\s{2,}',s)[1:-1]))
Enable a re.MULTILINE/re.M flag as we are using ^ and $.
OUTPUT
some words one
some words two
some words three
abc
a bc c
I'm trying to get all 3 letter words. They end with double letters and start with the letter 'a'.
Like: app, add, all, arr, aoo, aee
I tried this but It doesn't work very well...
words =re.findall(r" a(\w)\1* ",text)
You are using
words =re.findall(r" a(\w)\1* ",text)
and here is a demo of it.
You can see an improvement by using a word boundary and as well as a specific limit of matches in your search here
\ba(\w)\1{1}\b
as you want 1 and only 1 additional instances of the matched \w, achieved with the {1} which only allows 1 of the match, i.e., \1 which is an additional \w.
I think you need to use + instead of *:
words = re.findall(r"\ba(\w)\1+\b", text)
Otherwise you will match things with non-double letters. Also use \b to detect word boundaries.
I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock
import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).
I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).
You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.
>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.
import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html