Python regexp to match full or partial word - python

Is there a way to get regexp to match as much of a specific word as is possible? For example, if I am looking for the following words: yesterday, today, tomorrow
I want the following full words to be extracted:
yest
yesterday
tod
toda
today
tom
tomor
tomorrow
The following whole words should fail to match (basically, spelling mistakes):
yesteray
tomorow
tommorrow
tody
The best I could come up with so far is:
\b((tod(a(y)?)?)|(tom(o(r(r(o(w)?)?)?)?)?)|(yest(e(r(d(a(y)?)?)?)?)?))\b (Example)
Note: I could implement this using a finite state machine but thought it would be a giggle to get regexp to do this. Unfortunately, anything I come up with is ridiculously complex and I'm hoping that I've just missed something.

The regex you are looking for should include optional groups with alternations.
\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b
See demo
Note that \b word boundaries are very important since you want to match whole words only.
Regex explanation:
\b - leading word boundary
(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)?) - a capturing group matching
yest(?:e(?:r(?:d(?:ay?)?)?)?)? - yest, yeste, yester, yesterd, yesterda or yesterday
tod(?:ay?)? - tod or toda or today
tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)? - tom, tomo, tomor, tomorr, tomorro, or tomorrow
\b - trailing word boundary
See Python demo:
import re
p = re.compile(ur'\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b', re.IGNORECASE)
test_str = u"yest\nyeste\nyester\nyesterd\nyesterda\nyesterday\ntod\ntoda\ntoday\ntom\ntomo\ntomor\ntomorr\ntomorro\ntomorrow\n\nyesteray\ntomorow\ntommorrow\ntody\nyesteday"
print(p.findall(test_str))
# => [u'yest', u'yeste', u'yester', u'yesterd', u'yesterda', u'yesterday', u'tod', u'toda', u'today', u'tom', u'tomo', u'tomor', u'tomorr', u'tomorro', u'tomorrow']

Pipe separate all the valid words or word substrings like below. This will only match the valid spellings as desired
^(?|yest|yesterday|tod|today)\b
Tested this already at https://regex101.com/

Related

How to find all words that ends with colon (:) using regex

I am new to regex, I have the following expressions, I want to find the word or consecutive words ending with colon(:) using regular expression.
Incarcerate: imprison or confine, Strike down: if someone is struck down, especially by an illness, they are killed or severely harmed by it, Accost: approach and address.
The output should be like this Incarcerate:, Strike down:, Accost: . I have written the following regex, but it captures the following.
My regex -> (\w+):+
It captures the words like Incarcerate:, Accost:, it does not capture Strike down:
Please help me.
I want to do it in both typescript and python.
You can optionally repeat a space and 1+ word chars. Note that the words are in group 1, and the : is outside of the group.
(\w+(?: \w+)*):
Regex demo
To include the : in the match:
\w+(?: \w+)*:
The pattern matches
\w+ Match 1 or more word characters
(?: \w+)* Repeat 0+ times matching a space and 1+ word characters
: Match a single :
Regex demo
Example in Python
import re
s = "Incarcerate: imprison or confine, Strike down: if someone is struck down, especially by an illness, they are killed or severely harmed by it, Accost: approach and address."
pattern = r"\w+(?: \w+)*:"
print(re.findall(pattern, s))
Output
['Incarcerate:', 'Strike down:', 'Accost:']

Python 2.7 Regex capture groups not working as predicted

I am trying to pattern match and replace first person with second person with Python 2.7.
string = re.sub(r'(\W)I(\W)', '\g<1>you\g<2>',string)
string = re.sub(r'(\W)(me)(\W)', '\g<1>you\g<3>',string)
# but does NOT work
string = re.sub(r'(\W)I|(me)(\W)', '\g<1>you\g<3>',string)
I want to use the last regex, but somehow the capture groups are all messed up and even doing a \g<0> shows strange, irregular matches. I would think that capture group 3 would be the last word boundary, but it doesn't appear to be.
A sample sentence could be: I like candy.
I am not interested very much in the correctness of the replacement (me will never actually be selected since I goes first), but I don't know why the capture groups don't work as I would expect.
Thanks!
Try with following regex.
Regex: \b(I|me)\b
Explanation:
\b on both sides marks the word boundary.
(I|me) matches either I OR me.
Note:- You can make it case insensitive using i flag.
Regex101 Demo

identify new line in regex

I would like to perform some regex on the text from MAcbeth
My text is as follows:
Scena Secunda.
Alarum within. Enter King Malcome, Donalbaine, Lenox, with
attendants,
meeting a bleeding Captaine.
King. What bloody man is that? he can report,
As seemeth by his plight, of the Reuolt
The newest state
My intention is to get the text from Enter to the full-stop.
I am trying this regular expression Enter(.?)*\.
But it is showing no matches. Can anybody fix my regexp?
I am trying it out in this link
Since #Tushar has not explained the issue you had with your regex, I decided to explain it.
Your regex - Enter(.?)*\. - matches a word Enter (literally), then optionally matches any character except a newline 0 or more times, as many as possible, up to the last period.
The problem is that your string contains a newline between the Enter and the period. You'd need a regex pattern to match newlines, too. To force . to match newline symbols, you may use DOTALL mode. However, it won't get you the expected result as the * quantifier is greedy (will return the longest possible substring).
So, to get the substring from Enter till the closest period, you can use
Enter([^.]*)
See this regex demo. If you need no capture group, remove it.
And an IDEONE demo:
import re
p = re.compile(r'Enter([^.]*)')
test_str = "Scena Secunda.\n\nAlarum within. Enter King Malcome, Donalbaine, Lenox, with\nattendants,\nmeeting a bleeding Captaine.\n\n King. What bloody man is that? he can report,\nAs seemeth by his plight, of the Reuolt\nThe newest state"
print(p.findall(test_str)) # if you need the capture group text, or
# print(p.search(test_str).group()) # to get the whole first match, or
# print(re.findall(r'Enter[^.]*', test_str)) # to return all substrings from Enter till the next period

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

extract a sentence using python

I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.

Categories

Resources