In my regex pattern, I would like to make sure a certain substring only occurs once in between two other substrings.
So, let's take for example these strings:
string_a = “this and that”
string_b = "this and and that"
I want to return a match for string_a but not for string_b, because 'and' occurs twice there between this/that.
I would do that with a negative lookahead-tempered dot:
my_pattern = "this(?:(?!and.*and).)*that"
This matches string_a and not string_b, so so far so good.
However, with the following sentence is also not matched (like string_b):
string_c = "this and that and"
Evidently, the negative lookahead occurs for the whole string, rather than between "this" and "that" as I had anticipated and hoped.
How can I do this instead?
You can use another tempered greedy token to temper the .* inside the lookahead:
this(?:(?!this|that|and(?:(?!that).)*?and).)*?that
See the regex demo.
Details:
this - a fixed string
(?:(?!this|that|and(?:(?!that).)*?and).)*? - any char other than line break chars, zero or more but as few as possible occurrernces, that does not start a this, that char sequences or a pattern that matches and, then any char other than line break chars, zero or more but as few as possible occurrernces, that does not start a that char sequence and then and string
that - a fixed string.
Related
I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b
I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']
I'm trying to create a regex that matches a third person form of a verb created using the following rule:
If the verb ends in e not preceded by i,o,s,x,z,ch,sh, add s.
So I'm looking for a regex matching a word consisting of some letters, then not i,o,s,x,z,ch,sh, and then "es". I tried this:
\b\w*[^iosxz(sh)(ch)]es\b
According to regex101 it matches "likes", "hates" etc. However, it does not match "bathes", why doesn't it?
You may use
\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*
See the regex demo
Since Python re does not support variable length alternatives in a lookbehind, you need to split the conditions into two lookbehinds here.
Pattern details:
\b - a leading word boundary
(?=\w*(?<![iosxz])(?<![cs]h)es\b) - a positive lookahead requiring a sequence of:
\w* - 0+ word chars
(?<![iosxz]) - there must not be i, o, s, x, z chars right before the current location and...
(?<![cs]h) - no ch or sh right before the current location...
es - followed with es...
\b - at the end of the word
\w* - zero or more (maybe + is better here to match 1 or more) word chars.
See Python demo:
import re
r = re.compile(r'\b(?=\w*(?<![iosxz])(?<![cs]h)es\b)\w*')
s = 'it matches "likes", "hates" etc. However, it does not match "bathes", why doesn\'t it?'
print(re.findall(r, s))
If you want to match strings that end with e and are not preceded by i,o,s,x,z,ch,sh, you should use:
(?<!i|o|s|x|z|ch|sh)e
Your regex [^iosxz(sh)(ch)] consists of character group, the ^ simply negates, and the rest will be exactly matched, so it's equivalent to:
[^io)sxz(c]
which actually means: "match anything that's not one of "io)sxz(c".
I have tried creating a regex pattern that matches only letters and allows a whitespace:
import re
user_input = raw_input('Input: ')
if re.match('[A-Za-z ]', user_input):
print user_input
However,
When inputting o888, or something similar, a match seems to still occur
That happens because your regex allows partial matches.
Use
if re.match('[A-Za-z ]*$', user_input):
^^
to anchor the pattern at the end and match 0+ chars. As re.match anchors the pattern at the start of the string, the ^ anchor is not necessary, but $ - end of string - is required to enforce the full string match.
If you do not want to allow an empty string, use + quantifier - one or more occurrences - rather than * (zero or more occurrences).
I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'
There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.
Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.
To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.
Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line