How to use Regular Expressions to find duplicate non-consecutive words? - python

I need to reprint lines of a poem which coincide with specific rules. The rule I have been having trouble with is reprinting a line if the line has a word which appears more then once.
For example, I have to go out with Jane would not print. Whereas, I have to go out to the movies with Jane would print as the word to is repeated in the line.
Rules = ['']
Yip = open('poem.txt', 'r')
Lines = Yip.read().split('\n')
n = 1
for r in Rules:
i = 1
print("\nMatching rule", n)
for ln in Lines:
if re.search(r, ln):
print(i, end = ", ")
i = i + 1
n = n + 1
I've gotten the code '(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+', this finds duplicate words but only consecutively.
Likewise I've gotten to '^(?=(.*?to){2}).*$', this is I believe my closest code. It will print the line above as it finds both instances of 'to' but the problem is it only hits the 'to'.
I'm trying to figure out if theres a way to write the code which will print the line if it finds a non-consecutive duplicate of any word in the line so it will work on any line given.

The general regex that matches consecutive and non-consecutive duplicate words is
\b(\w+)\b(?=.*?\b\1\b)
See the regex demo
To make the pattern search for duplicate words across lines, make sure . matches line break chars, for example:
(?s)\b(\w+)\b(?=.*?\b\1\b)
^^^^
Or, use re.S or re.DOTALL in Python re.
To make it case insensitive, add i modifier, or use re.I / re.IGNORECASE:
(?si)\b(\w+)\b(?=.*?\b\1\b)
^^^^^
Pattern details
\b - word boundary
(\w+) - Group 1: one or more word chars (letters, digits, _)
\b - word boundary
(?=.*?\b\1\b) - a positive lookahead that matches a location immediately followed with
.*? - any 0+ chars, as few as possible
\b\1\b - Group 1 value as whole word (we need to use \b word boundaries again here since \1 does not "remember" the context where (\w+) matched).
Python demo:
import re
strs = ['I have to go out with Jane','I have to go out to the movies with Jane']
rx = re.compile(r'(?si)\b(\w+)\b(?=.*?\b\1\b)')
for s in strs:
print(s, "=>", rx.findall(s))
Output:
I have to go out with Jane => []
I have to go out to the movies with Jane => ['to']

Related

A regex pattern that matches all words starting from a word with an s and stopping before a word that starts with an s

I'm trying to capture words in a string such that the first word starts with an s, and the regex stops matching if the next word also starts with an s.
For example. I have the string " Stack, Code and StackOverflow". I want to capture only " Stack, Code and " and not include "StackOverflow" in the match.
This is what I am thinking:
Start with a space followed by an s.
Match everything except if the group is a space and an s (I'm using negative lookahead).
The regex I have tried:
(?<=\s)S[a-z -,]*(?!(\sS))
I don't know how to make it work.
I think this should work. I adapted the regex from this thread. You can also test it out here. I have also included a non-regex solution. I basically track the first occurrence of a word starting with an 's' and the next word starting with an 's' and get the words in that range.
import re
teststring = " Stack, Code and StackOverflow"
extractText = re.search(r"(\s)[sS][^*\s]*[^sS]*", teststring)
print(extractText[0])
#non-regex solution
listwords = teststring.split(' ')
# non regex solution
start = 0
end = 0
for i,word in enumerate(listwords):
if word.startswith('s') or word.startswith('S'):
if start == 0:
start = i
else:
end = i
break
newstring = " " + " ".join([word for word in listwords[start:end]])
print(newstring)
Output
Stack, Code and
Stack, Code and
You could use for example a capture group:
(S(?<!\S.).*?)\s*S(?<!\S.)
Explanation
( Capture group 1
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
.*? Match any character, as few as possible
) Close group
\s* Match optional whitespace chars
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
See a regex demo and a Python demo.
Example code:
import re
pattern = r"(S(?<!\S.).*?)\s*S(?<!\S.)"
s = "Stack, Code and StackOverflow"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
Stack, Code and
Another option using a lookaround to assert the S to the right and not consume it to allow multiple matches after each other:
S(?<!\S.).*?(?=\s*S(?<!\S.))
Regex demo
import re
pattern = r"S(?<!\S.).*?(?=\s*S(?<!\S.))"
s = "Stack, Code and StackOverflow test Stack"
print(re.findall(pattern, s))
Output
['Stack, Code and', 'StackOverflow test']

Returning the word with multiple instances of a character using regex

I tried finding a solution by using only re.compile and findall but I can't seem to get it. I am looking to return the string which has multiple instances of any character inside (e.g macdonalds kfc burgerking) and the result should be
[(macdonalds, ad),(burgerking,rg)]
I tried the code
p = re.compile(r`\w+(.)\1{1,}\w+`)
p.findall('macdonalds kfc burgerking')
but it could only search for instances that concur simultaneously. (e.g Baaaabaaa sheep)
With Python re, you can use
import re
rx = re.compile(r'\w*?(\w)\w*\1\w*')
rx_extract_dupe = re.compile(r'(.)(?=.*\1)')
text = 'macdonalds kfc burgerking'
matches = rx.finditer(text)
print( [(x.group(), "".join(rx_extract_dupe.findall(x.group()))) for x in matches] )
# => [('macdonalds', 'ad'), ('burgerking', 'rg')]
See this Python demo. With \w*?(\w)\w*\1\w* regex, you extract all words with at least one char repetition, and then (.)(?=.*\1) with .findall() applied to the match gets the list of duplicated chars.
If you install the PyPi regex library you can use
import regex
rx = regex.compile(r'(?:\w*?(\w)(?=\w*\1))+\w*')
text = 'macdonalds kfc burgerking'
print( [(x.group(), "".join(x.captures(1)) ) for x in rx.finditer(text)] )
# => [('macdonalds', 'ad'), ('burgerking', 'rg')]
See the Python demo.
Details:
(?:\w*?(\w)(?=\w*\1))+ - one or more sequences of
\w*? - zero or more word chars, as few as possible
(\w) - Group 1: any single word char
(?=\w*\1) - that is followed with zero or more word chars and then the same word char as captured in Group 1
\w* - any zero or more word chars.
The x.group() contains a match value and x.captures(1) contains all occurrences of the repeated characters in a word.

Regex to find strings containing substring, but not ending on same substring

I'm trying to write a regex that checks if a string contains the substring "ing", but it most not end on "ing".
So the word sing would not work but singer would.
I think I have figured out how to make sure that the string does not end with ing, for that I'm using
(!<?(ing))$
But I can't seem to get it to work when I want the word to contain "ing" as well. I was thinking something like
(\w+(ing))(!<?(ing))$
But that does not work, all of my solution that sort of makes it work will take in more than one word as well. So it will match singer but not singer crafting, it should still match singer here, just not crafting.
You may use the pattern:
ing(?=\w)
This would only be true for words which contain ing which is also followed by another word character. Here is an example:
inp = 'singer'
if re.search(r'ing(?=\w)', inp):
print('singer is a MATCH')
inp = 'sing'
if re.search(r'ing(?=\w)', inp):
print('sing is a MATCH')
This prints:
singer is a MATCH
Edit:
To match entire words containing non terminal ing, I suggest using re.findall:
inp = "Madonna is a singer who likes to sing."
matches = re.findall(r'\b\w*ing\w+\b', inp)
print(matches) # prints ['singer']
If the word can not end with ing but must contain ing:
\b\w*ing(?!\w*ing\b)\w+
Explanation
\b A word boundary
\w* Match 0+ word characters
ing Match the required ing
(?!\w*ing\b) Negaetive lookahead, assert the ing is not at the end of the word
\w+ Match 1+ word chars so that there must be at least a single char following
Regex demo | Python demo
For example
import re
items = ["singer","singing","ing","This is a ing testing singalong"]
pattern = r"\b\w*ing(?!\w*ing$)\w+\b"
for item in items:
result = re.findall(pattern, item)
if result:
print(result)
Output
['singer']
['singalong']
You can use this pattern:
import re
pattern = re.compile('\w*ing\w+')
print(pattern.match('sing')) # No match
print(pattern.match('singer')) # Match

Python regular expression to find letters and numbers

Entering a string
I used 'findall' to find words that are only letters and numbers (The number of words to be found is not specified).
I created:
words = re.findall ("\ w * \ s", x) # x is the input string
If i entered "asdf1234 cdef11dfe a = 1 b = 2"
these sentences seperated asdf1234, cdef11dfe, a =, 1, b =, 2
I would like to pick out only asdf1234, cdef11dfe
How do you write a regular expression?
Try /[a-zA-z0-9]{2,}/.
This looks for any alphanumeric character ([a-zA-Z0-9]) at least 2 times in a row ({2,}). That would be the only way to filter out the one letter words of the string.
The problem with \w is that it includes underscores.
This one should work : (?<![\"=\w])(?:[^\W_]+)(?![\"=\w])
Explanation
(?:[^\W_])+ Anything but a non-word character or an underscore at least one time (non capturing group)
(?<![\"=\w]) not precedeed by " or a word character
(?![\"=\w]) not followed by " or a word character
RegEx Demo
Sample code Run online
import re
regex = r"(?<![\"=\w])(?:[^\W_]+)(?![\"=\w])"
test_str = "a01a b02 c03 e dfdfd abcdef=2 b=3 e=4 c=\"a b\" aaa=2f f=\"asdf 12af\""
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
print (match.group())

Using Regex to find words with characters that are the same or that are different

I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'
There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.
Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.
To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.
Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line

Categories

Resources