Regex for matching exact words that contain apostrophes in Python? - python

For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))

There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.

If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))

Related

regex to match words but not if a certain phrase

In python, I want to match substring containing two terms with up to a certain number of words in between but not when it is equal to a certain substring.
I have this regular expression (regex101) that does the first part of the job, matching two terms with up to a certain number of words in between.
But I want to add a part or condition with AND operator to exclude a specific sentence like "my very funny words"
my(?:\s+\w+){0,2}\s+words
Expected results for this input:
I'm searching for my whatever funny words inside this text
should match with my whatever funny words
while for this input:
I'm searching for my very funny words inside this text
there should be no match
Thank you all for helping out
You may use the following regex pattern:
my(?! very funny)(?:\s+\w+){0,2}\s+words
This inserts a negative lookahead (?! very funny) into your existing pattern to exclude the matches you don't want. Here is a working demo.

Extracting a section of a string using regex with repeating ending words

I am attempting to extract some some raw strings using re module in python. The end of a to-be-extracted section is identified by a repeating word (repeated multiple times), Current efforts always captures the last match of the repeating word. How can I modify this behavior?
A textfile has been extracted from a pdf. The entire PDF is stored as one string. A general formatting of the string is as below:
*"***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"*
The intended string to be captured is: "Collection of alphanumeric words and characters"
The attempted solution used in this situation was: "
re.compile(r"*{3}Start of notes:(.+)\sEndofsection")
This attempt tends to match the whole string rather than just "Collection of alphanumeric words and characters" as intended.
One possible approach is to split with Endofsection and then extract the string from the first section only - this works, but I was hoping to find a more elegant solution using re.compile.
Two problems in your regex,
You need to escape * as it is a meta character as \*
Second, you are using (.+) which is a greedy quantifier and will try matching as much as possible, but since you want the shortest match, you need to just change it to (.+?)
Fixing these two issues, gives you the correct intended match.
Regex Demo
Python code,
import re
s = "***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"
m = re.search(r'\*{3}Start of notes:(.+?)\sEndofsection', s)
if m:
print(m.group(1))
Prints,
Collection of alphanumeric words and characters

Regular Expression in python how to find paired words

I'm doing the cipher for python. I'm confused on how to use Regular Expression to find a paired word in a text dictionary.
For example, there is dictionary.txt with many English words in it. I need to find word paired with "th" at the beginning. Like they, them, the, their .....
What kind of Regular Expression should I use to find "th" at the beginning?
Thank you!
If you got a list of words (so that every word is a string), you find words beginning with 'th' with this:
yourRegEx = re.compile(r'^th\w*') # ^ for beginning of string
^(th\w*)
gives you all results where the string begins with th . If there is more than one word in the string you will only get the first.
(^|\s)(th\w*)
wil give you all the words begining with th even if there is more than one word begining with th
(th)\w*
notice you have this great online tool to generate python code and test regex:
REGEX WEBSITE

Regex: How to match words without consecutive vowels?

I'm really new to regex and I've been able to find regex which can match this quite easily, but I am unsure how to only match words without it.
I have a .txt file with words like
sheep
fleece
eggs
meat
potato
I want to make a regular expression that matches words in which vowels are not repeated consecutively, so it would return eggs meat potato.
I'm not very experienced with regex and I've been unable to find anything about how to do this online, so it'd be awesome if someone with more experience could help me out. Thanks!
I'm using python and have been testing my regex with https://regex101.com.
Thanks!
EDIT: provided incorrect examples of results for the regular expression. Fixed.
Note that, since the desired output includes meat but not fleece, desired words are allowed to have repeated vowels, just not the same vowel repeated.
To select lines with no repeated vowel:
>>> [w for w in open('file.txt') if not re.search(r'([aeiou])\1', w)]
['eggs\n', 'meat\n', 'potato\n']
The regex [aeiou] matches any vowel (you can include y if you like). The regex ([aeiou])\1 matches any vowel followed by the same vowel. Thus, not re.search(r'([aeiou])\1', w) is true only for strings w that contain no repeated vowels.
Addendum
If we wanted to exclude meat because it has two vowels in a row, even though they are not the same vowel, then:
>>> [w for w in open('file.txt') if not re.search(r'[aeiou]{2}', w)]
['eggs\n', 'potato\n']
#John1024 's answer should work
I also would try
"\w*(a{2,}|e{2,}|i{2,}|o{2,}|u{2,})\w*"ig

How to use a regex to find combined words with period delimiters?

I am trying to find all instances of words separated by period delimiters.
So for example, these would be valid:
word1.word2
word1.word2.word3
word1.word2.word3.word4
Valid letters of words are those composed of a-zA-Z0-9-.
And so on. I tried [\w.]* but this does not appear to be accurate.
You can use the following:
[a-zA-Z0-9]\w+(?:\.\w+)+
See DEMO

Categories

Resources