Building a regular expression to find text near each other - python

I'm having issue getting this search to work:
import re
word1 = 'this'
word2 = 'that'
sentence = 'this and that'
print(re.search('(?:\b(word1)\b(?: +[^ \n]*){0,5} *\b(word2)\b)|(?:\b(word2)\b(?: +[^ \n]*){0,5} *\b(word1)\b)',sentence))
I need to build a regex search to find if a string has up to 5 different sub-strings in any order within a certain number of other words (so two strings could be 3 words apart, three strings a total of 6 words apart, etc).
I've found a number of similar questions such as Regular expression gets 3 words near each other. How to get their context? or How to check if two words are next to each other in Python?, but none of them quite do this.
So if the search words were 'this', 'that', 'these', and 'those' and they appeared within 9 words of each other in any order, then the script would output True.
It seems like writing an if/else block with all sorts of different regex statements to accommodate the different permutations would be rather cumbersome, so I'm hoping there is a more efficient way to code this in Python.

This can be done using engines that support conditionals, atomic groups
and capture group status as flaged, marked EMPTY or NULL. Where null is undefined.
So this is almost all modern engines. Some are incomplete though like JS.
Python can support this using its replacement engine import regex.
Basically this will support out of order and can be confined to the shortest
range from 4 to 9 total words.
The bottom (?= \1 \2 \3 \4 ) asserts that all the required items were found.
Using this without the atomic group might cause backtrack problems, but since it
is there, this regex is very fast.
update: added lookahead (?= this | that | these | those ) so it starts match on a special word.
Python code
>>> import regex
>>>
>>> targ = 'this sdgbsesfrgnh these meat ball those nhwsgfr that sfdng sfgnsefn sfgnndfsng'
>>> pat = r'(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9
}?(?=\1\2\3\4)'
>>>
>>> regex.search(pat, targ).group()
'this sdgbsesfrgnh these meat ball those nhwsgfr that '
General PCRE / Perl et all (same regex)
(?=this|that|these|those)(?>\s*(?:(?(1)(?!))\bthis\b()|(?(2)(?!))\bthat\b()|(?(3)(?!))\bthese\b()|(?(4)(?!))\bthose\b()|(?(5)(?!))\b(.+?)\b|(?(6)(?!))\b(.+?)\b|(?(7)(?!))\b(.+?)\b|(?(8)(?!))\b(.+?)\b|(?(9)(?!))\b(.+?)\b)\s*){4,9}?(?=\1\2\3\4)
https://regex101.com/r/zhSa64/1
(?= this | that | these | those )
(?>
\s*
(?:
(?(1)(?!))
\b this \b ( ) # (1)
|
(?(2)(?!))
\b that \b ( ) # (2)
|
(?(3)(?!))
\b these \b ( ) # (3)
|
(?(4)(?!))
\b those \b ( ) # (4)
|
(?(5)(?!))
\b ( .+? ) \b # (5)
|
(?(6)(?!))
\b ( .+? ) \b # (6)
|
(?(7)(?!))
\b ( .+? ) \b # (7)
|
(?(8)(?!))
\b ( .+? ) \b # (8)
|
(?(9)(?!))
\b ( .+? ) \b # (9)
)
\s*
){4,9}?
(?= \1 \2 \3 \4 )

ANSWER CHANGED because I found a way to do it with just a regular expression. The approach is to start with a lookahead that requires all target words to be present in the next N words. Then look for a pattern of target words (in any order) separated by 0 or more other words (up to the allowed maximum intermediate words)
The word span (N) is the greatest number of words that would allow all the target words to be at the maximum allowed distance.
For example, if we have 3 target words, and we allow a maximum of 4 other words between them, then the maximum word span will be 11. So 3 target words plus 2 intermediate series of maximum 4 other words 3+4+4=11.
The search pattern is formed by assembling parts that depend on the words and the maximum number of intermediate words allowed.
Pattern : \bALL((ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}
breakdown:
\b start on a word boundary
ALL will be substituted by multiple lookaheads that will ensure that every target word is found in the next N words.
each lookahead will have the form (?=(\w+\W*){0,SPAN}WORD\b) where WORD is a target word and SPAN is the number of other words in the longest possible sequence of words. There will be one such lookahead for each of the target words. Thus ensuring that the sequence of N words contains all of target words.
(\b(ANY)(\W+\w+\W*){0,INTER}) matches any target word followed by zero to maxInter intermediate words. In that, ANY will be replaced by a pattern that matches any of the target words (i.e. the words separated by pipes). And INTER will be replaced by the allowed number of intermediate words.
{COUNT,COUNT} ensured that there are as many repetitions of the above as there are target words. This corresponds to the pattern: targetWord+intermediates+targetWord+intermediates...+targetWord
With the look ahead placed before the repeating pattern, we are guaranteed to have all the target words in the sequence of words containing exactly the number of target words with no more intermediate words than is allowed.
...
import re
words = {"this","that","other"}
maxInter = 3 # maximum intermediate words between the target words
wordSpan = len(words)+maxInter*(len(words)-1)
anyWord = "|".join(words)
allWords = "".join(r"(?=(\w+\W*){0,SPAN}WORD\b)".replace("WORD",w)
for w in words)
allWords = allWords.replace("SPAN",str(wordSpan-1))
pattern = r"\bALL(\b(ANY)(\W+\w+\W*){0,INTER}){COUNT,COUNT}"
pattern = pattern.replace("COUNT",str(len(words)))
pattern = pattern.replace("INTER",str(maxInter))
pattern = pattern.replace("ALL",allWords)
pattern = pattern.replace("ANY",anyWord)
textList = [
"looking for this and that and some other thing", # YES
"that rod is longer than this other one", # NO: 4 words apart
"other than this, I have nothing", # NO: missing "that"
"ignore multiple words not before this and that or other", # YES
"this and that or other, followed by a bunch of words", # YES
]
output:
print(pattern)
\b(?=(\w*\b\W+){0,8}this\b)(?=(\w*\b\W+){0,8}other\b)(?=(\w*\b\W+){0,8}that\b)(\b(other|this|that)\b(\w*\b\W+){0,3}){3,3}
for text in textList:
found = bool(re.search(pattern,text))
print(found,"\t:",text)
True : looking for this and that and some other thing
False : that rod is longer than this other one
False : other than this, I have nothing
True : ignore multiple words not before this and that or other
True : this and that or other, followed by a bunch of words

Related

Returning the word with multiple instances of a character using regex

I tried finding a solution by using only re.compile and findall but I can't seem to get it. I am looking to return the string which has multiple instances of any character inside (e.g macdonalds kfc burgerking) and the result should be
[(macdonalds, ad),(burgerking,rg)]
I tried the code
p = re.compile(r`\w+(.)\1{1,}\w+`)
p.findall('macdonalds kfc burgerking')
but it could only search for instances that concur simultaneously. (e.g Baaaabaaa sheep)
With Python re, you can use
import re
rx = re.compile(r'\w*?(\w)\w*\1\w*')
rx_extract_dupe = re.compile(r'(.)(?=.*\1)')
text = 'macdonalds kfc burgerking'
matches = rx.finditer(text)
print( [(x.group(), "".join(rx_extract_dupe.findall(x.group()))) for x in matches] )
# => [('macdonalds', 'ad'), ('burgerking', 'rg')]
See this Python demo. With \w*?(\w)\w*\1\w* regex, you extract all words with at least one char repetition, and then (.)(?=.*\1) with .findall() applied to the match gets the list of duplicated chars.
If you install the PyPi regex library you can use
import regex
rx = regex.compile(r'(?:\w*?(\w)(?=\w*\1))+\w*')
text = 'macdonalds kfc burgerking'
print( [(x.group(), "".join(x.captures(1)) ) for x in rx.finditer(text)] )
# => [('macdonalds', 'ad'), ('burgerking', 'rg')]
See the Python demo.
Details:
(?:\w*?(\w)(?=\w*\1))+ - one or more sequences of
\w*? - zero or more word chars, as few as possible
(\w) - Group 1: any single word char
(?=\w*\1) - that is followed with zero or more word chars and then the same word char as captured in Group 1
\w* - any zero or more word chars.
The x.group() contains a match value and x.captures(1) contains all occurrences of the repeated characters in a word.

regex match a word after a certain character

I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)

Regex complete words pattern

I want to get patterns involving complete words, not pieces of words.
E.g. 12345 [some word] 1234567 [some word] 123 1679. Random text and the pattern appears again 1111 123 [word] 555.
This should return
[[12345, 1234567, 123, 1679],[1111, 123, 555]]
I am only tolerating one word between the numbers otherwise the whole string would match.
Also note that it is important to capture that 2 matches were found and so a two-element list was returned.
I am running this in python3.
I have tried:
\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b
but I am not sure how to scale this to an unrestricted number of matches.
re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)
This matches [number] [word] [number] but not any number that might follow with or without a word in between.
Are you expecting re.findall() to return a list of lists? It will only return a list - no matter what regex you use.
One approach is to split your input string into sentences and then loop through them
import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))
the trick is to find a <pattern> to split your input.
You can't do this in one operation with the Python re engine.
But you could match the sequence with one match, then extract the
digits with another.
This matches the sequence
r"(?<!\w)\d+(?:(?:[^\S\r\n]+[a-zA-Z](?:\w*[a-zA-Z])*)?[^\S\r\n]+\d+)*(?!\w)"
https://regex101.com/r/73AYLU/1
Explained
(?<! \w ) # Not a word behind
\d+ # Many digits
(?: # Optional word block
(?: # Optional words
[^\S\r\n]+ # Horizontal whitespace
[a-zA-Z] # Starts with a letter
(?: \w* [a-zA-Z] )* # Can be digits in middle, ends with a letter
)? # End words, do once
[^\S\r\n]+ # Horizontal whitespace
\d+ # Many digits
)* # End word block, do many times
(?! \w ) # Not a word ahead
This gets the array of digits from the sequence matched above (use findall)
r"(?<!\S)(\d+)(?!\S)"
https://regex101.com/r/BHov38/1
Explained
(?<! \S ) # Whitespace boundary
( \d+ ) # (1)
(?! \S ) # Whitespace boundary
This is a bit complicated, maybe this expression would be just something to look into:
(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)
and script the rest of the problem for a valid solution.
Demo

Trying to repeat the regex breaks the regex

I have a working regex that matches ONE of the following lines:
A punctuation from the following list [.,!?;]
A word that is preceded by the beginning of the string or a space.
Here's the regex in question ([.,!?;] *|(?<= |\A)[\-'’:\w]+)
What I need it to do however is for it to match 3 instances of this. So, for example, the ideal end result would be something like this.
Sample text: "This is a test. Test"
Output
"This" "is" "a"
"is" "a" "test"
"a" "test" "."
"test" "." "Test"
I've tried simply adding {3} to the end in the hopes of it matching 3 times. This however results in it matching nothing at all or the occasional odd character. The other possibility I've tried is just repeating the whole regex 3 times like so ([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+) which is horrible to look at but I hoped it would work. This had the odd effect of working, but only if at least one of the matches was one of the previously listed punctuation.
Any insights would be appreciated.
I'm using the new regex module found here so that I can have overlapping searches.
What is wrong with your approach
The ([.,!?;] *|(?<= |\A)[\-'’:\w]+) pattern matches a single "unit" (either a word or a single punctuation from the specified set [.,!?;] followed with 0+ spaces. Thus, when you fed this pattern to the regex.findall, it only could return just the chunk list ['This', 'is', 'a', 'test', '. ', 'Test'].
Solution
You can use a slightly different approach: match all words, and all chunks that are not words. Here is a demo (note that C'est and AUX-USB are treated as single "words"):
>>> pat = r"((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*))\s*((?1))\s*((?1))"
>>> results = regex.findall(pat, text, overlapped = True)
>>> results
[("C'est", 'un', 'test'), ('un', 'test', '....'), ('test', '....', 'aux-usb')]
Here, the pattern has 3 capture groups, and the second and third one contain the same pattern as in Group 1 ((?1) is a subroutine call used in order to avoid repeating the same pattern used in Group 1). Group 2 and Group 3 can be separated with whitespaces (not necessarily, or the punctuation glued to a word would not be matched). Also, note the negative lookbehind (?<!') that will ensure that C'est is treated as a single entity.
Explanation
The pattern details:
((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*)) - Group 1 matching:
(?:[^\w\s'-]+(?=\s|\b) - 1+ characters other than [a-zA-Z0-9_], whitespace, ' and - immediately followed with a whitespace or a word boundary
| - or
\b(?<!')\w+(?:['-]\w+)*) - 1+ word characters not preceded with a ' (due to (?<!')) and preceded with a word boundary (\b) and followed with 0+ sequences of - or ' followed with 1+ word characters.
\s* - 0+ whitespaces
((?1)) - Group 2 (same pattern as for Group 1)
\s*((?1)) - see above

regex continue only if positive lookahead has been matched at least once

Using python: How do i get the regex to continue only if a positive lookahead has been matched at least once.
I'm trying to match:
Clinton-Orfalea-Brittingham Fellowship Program
Here's the code I'm using now:
dp2= r'[A-Z][a-z]+(?:-\w+|\s[A-Z][a-z]+)+'
print np.unique(re.findall(dp2, tt))
I'm matching the word, but it's also matching a bunch of other extraneous words.
My thought was that I'd like the \s[A-Z][a-z] to kick in ONLY IF -\w+ has been hit at least once (or maybe twice). would appreciate any thoughts.
To clarify: I'm not aiming to match specifically this set of words, but to be able to generically match Proper noun- Proper noun- (indefinite number of times) and then a non-hyphenated Proper noun.
eg.
Noun-Noun-Noun Noun Noun
Noun-Noun Noun
Noun-Noun-Noun Noun
THE LATEST ITERATION:
dp5= r'(?:[A-Z][a-z]+-?){2,3}(?:\s\w+){2,4}'
The {m,n} notation can be used to force the regex to ONLY MATCH if the previous expression exists between m and n times. Maybe something like
(?:[A-Z][a-z]+-?){2,3}\s\w+\s\w+ # matches 'Clinton-Orfalea-Brittingham Fellowship Program'
If you're SPECIFICALLY looking for "Clinton-Orfalea-Brittingham Fellowship Program", why are you using Regex to find it? Just use word in string. If you're looking for things of the form: Name-Name-Name Noun Noun, this should work, but be aware that Name-Name-Name-Name Noun Noun won't, nor will Name-Name-Name Noun Noun Noun (In fact, something like "Alice-Bob-Catherine Program" will match not only that but whatever word comes after it!)
# Explanation
RE = r"""(?: # Begins the group so we can repeat it
[A-Z][a-z]+ # Matches one cap letter then any number of lowercase
-? # Allows a hyphen at the end of the word w/o requiring it
){2,3} # Ends the group and requires the group match 2 or 3 times in a row
\s\w+ # Matches a space and the next word
\s\w+ # Does so again
# those last two lines could just as easily be (?:\s\w+){2}
"""
RE = re.compile(RE,re.verbose) # will compile the expression as written
If you're looking specifically for hyphenated proper nouns followed by non-hyphenated proper nouns, I would do this:
[A-Z][a-z]+-(?:[A-Z][a-z]+(?:-|\s))+
# Explanation
RE = r"""[A-Z][a-z]+- # Cap letter+small letters ending with a hyphen
(?: # start a non-cap group so we can repeat it
[A-Z][a-z]+# As before, but doesn't require a hyphen
(?:
-|\s # but if it doesn't have a hyphen, it MUST have a space
) # (this group is just to give precedence to the |
)+ # can match multiple of these.
"""

Categories

Resources