Regex to match repeated (unknown) substrings

Regex to match repeated (unknown) substrings - python

I'm trying to find "laughter words" or similar such as hahaha, hihihi, hueheu within user messages. My current approach is as follows:
>>> substring_list = ['ha', 'ah', 'he', 'eh', 'hi', 'ih', 'ho', 'hu', 'hue']
>>> pattern_core = '|'.join(substring_list)
>>> self.regex_pattern = re.compile(r'\b[a-z]*(' + pattern_core + r'){2,}[a-z]*\b', re.IGNORECASE)
The [a-z]* allows for some leeway when it comes to typos (e.g., ahhahah). In principle this works reasonably well. The problem is that it needs to be maintained in the sense that substring_list needs to be updated to match new forms of "laughter words" (e.g., adding xi); "laughter words" seem to vary quite noticable between countries.
Now I wonder if I can somehow find words based on repeated patterns (of sizes, say, 2-4) without knowing the individual pattern. For example, hurrhurr contains hurr as repeated pattern. In the ideal case I can (a) match hurrhurr and (b) identify the core pattern hurr. I have no idea if this is possible with regular expressions.

This regex will do it:
\b[a-z]*?([a-z]{2,}?)\1+[a-z]*?\b
Usage:
self.regex_pattern = re.compile(r'\b[a-z]*?([a-z]{2,}?)\1+[a-z]*?\b', re.IGNORECASE)
Here's a working demo.
The gist is similar to what you were doing, but the "core" is different. The heart of the regex is this piece:
([a-z]{2,}?)\1+
The logic is to find a group consisting of 2 or more letters, then match the same group (\1) one or more additional times.

In the ideal case I can (a) match hurrhurr and (b) identify the core
pattern hurr. I have no idea if this is possible with regular expressions.
import re
string = """hahaha, huehue, heehee,
axaxaxax, x the theme, ------, hhxhhxhhx,
bananas, if I imagine, HahHaH"""
pattern = r"""
(
\b #Match a word boundary...
(
[a-z]{2,}? #Followed by a letter, 2 or more times, non-greedy...
) #Captured in group 2,
\2+ #Followed by whatever matched group 2, one or more times...
\b #Followed by a word boundary.
) #Capture in group 1.
"""
results = re.findall(pattern, string, re.X|re.I)
print(results)
--output:--
[('hahaha', 'ha'), ('huehue', 'hue'), ('heehee', 'hee'), ('axaxaxax', 'ax'), ('hhxhhxhhx', 'hhx'), ('HahHaH', 'Hah')]

Related

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$

The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101

I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']

After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)

Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD

The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.

I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.

You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.

Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.

Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);

repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

I'm trying to use Python's findall to try and find all the hypenated and non-hypenated identifiers in a string (this is to plug into existing code, so using any constructs beyond findall won't work). If you imagine code like this:
regex = ...
body = "foo-bar foo-bar-stuff stuff foo-word-stuff"
ids = re.compile(regex).findall(body)
I would like the ids value to be ['foo', 'bar', 'word', 'foo-bar', 'foo-bar-stuff', and 'stuff'] (although not bar-stuff, because it's hypenated, but does not appear as a standalone space-separated identifier). Order of the array/set is not important.
A simple regex which matches the non-hypenated identifiers is \w+ and one which matches the hypenated ones is [\w-]+. However, I cannot figure out one which does both simultaneously (I don't have total control over the code, so cannot concatenate the lists together - I would like to do this in one Regex if possible).
I have tried \w|[\w-]+ but since the expression is greedy, this misses out bar for example, only matching -bar since foo has already been matched and it won't retry the pattern from the same starting position. I would like to find matches for (for example) both foo and foo-bar which begin (are anchored) at the same string position (which I think findall simply doesn't consider).
I've been trying some tricks such as lookaheads/lookbehinds such as mentioned, but I can't find any way to make them applicable to my scenario.
Any help would be appreciated.

You may use
import re
s = "foo-bar foo-bar-stuff stuff" #=> {'foo-bar', 'foo', 'bar', 'foo-bar-stuff', 'stuff'}
# s = "A-B-C D" # => {'C', 'D', 'A', 'A-B-C', 'B'}
l = re.findall(r'(?<!\S)\w+(?:-\w+)*(?!\S)', s)
res = []
for e in l:
res.append(e)
res.extend(e.split('-'))
print(set(res))
Pattern details
(?<!\S) - no non-whitespace right before
\w+ - 1+ word chars
(?:-\w+)* - zero or more repetitions of
- - a hyphen
\w+ - 1+ word chars
(?!\S) - no non-whitespace right after.
See the pattern demo online.
Note that to get all items, I split the matches with - and add these items to the resulting list. Then, with set, I remove any eventual dupes.

If you don't have to use regex
Just use split(below is example)
result = []
for x in body.split():
if x not in result:
result.append(x)
for y in x.split('-'):
if y not in result:
result.append(y)

This is not possible with findall alone, since it finds all non-overlapping matches, as the documentation says.
All you can do is find all longest matches with \w[-\w]* or something like that, and then generate all valid spans out of them (most probably starting from their split on '-').
Please note that \w[-\w]* will also match 123, 1-a, and a--, so something like(?=\D)\w[-\w]* or (?=\D)\w+(?:-\w+)* could be preferable (but you would still have to filter out the 1 from a-1).

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?

You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]

Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.

We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

Python regex: Alternation for sets of words

We know \ba\b|\bthe\b will match either word "a" or "the"
I want to build a regex expression to match a pattern like
a/the/one reason/reasons for/of
Which means I want to match a string s contains 3 words:
the first word of s should be "a", "the" or "one"
the second word should be "reason" or "reasons"
the third word of s should be "for" or "of"
The regex \ba\b|\bthe\b|\bone\b \breason\b|reasons\b \bfor\b|\bof\b doesn't help.
How can I do this? BTW, I use python. Thanks.

You need to use a capture group to refuse of mixing the OR's (|)
(\ba\b|\bthe\b|\bone\b) (\breason\b|reasons\b) (\bfor\b|\bof\b)
And then as a more elegant way you can put the word boundaries around the groups.Also note that when you are using space in your regex around the words there is no need to use word boundary.And for reasons and reason you can make the last s optional with ?. And note that if you don't want to match your words as a separate groups you can makes your groups to a none capture group by :?.
\b(?:a|the|one) reasons? (?:for|of)\b
Or use capture group if you want the words in group :
\b(a|the|one) (reasons?) (for|of)\b

The regular expression modifier A|B means that "if either A or B matches, then the whole thing matches". So in your case, the resulting regular expression matches if/where any of the following 5 regular expressions match:
\ba\b
\bthe\b
\bone\b \breason\b
reasons\b \bfor\b
\bof\b
To limit the extent to which | applies, use the non-capturing grouping for this, that is (?:something|something else). Also, for having an optional s at the end of reason you do not need to use alteration; this is exactly equal to reasons?.
Thus we get the regular expression \b(?:a|the|one) reasons? (?:for|of)\b.
Note that you do not need to use the word boundary operators \b within the regular expression, only at the beginning and end (otherwise it would match something like everyone reasons forever).

An interesting feature of the regex module is the named list. With it, you don't have to include several alternatives separated by | in a non capturing group. You only need to define the list before and to refer to it in the pattern by its name. Example:
import regex
words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]
pattern = r'\m \L<word1> \s+ \L<word2> \s+ \L<word3> \M'
p = regex.compile(pattern, regex.X, word1=words[0], word2=words[1], word3=words[2])
s = 'the reasons for'
print(p.search(s))
Even if this feature isn't essential, It improves the readability.
You can achieve something similar with the re module if you join items with | before:
import re
words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]
words = ['|'.join(x) for x in words]
pattern = r'\b ({}) \s+ ({}) \s+ ({}) \b'.format(*words)
p = re.compile(pattern, re.X)

As I understand you want some regex like this:
(?:a|the|one)\s+(?:reason|reasons)\s+(?:for|of)
It's so simple, just combine them by using groups.
see: DEMO
Note Your requirement above, its sound is not so strict for me, in case that you want to modify something by yourself, let's consider the explanation below
Explanation
(?:abc|ijk|xyz)
Any word abc, ijk or xyz which grouped by non-capture group (?:...) means this word will not capture to regex variable $1, $2, $3, ....
\s+
This is word delimiter which here I set it as any spaces, + stands for 1 or more.

Use parentheses for grouping:
'\b(a|the|one) reason(|s) (for|of)\b'
I left the sentence-internal \b's out since the spaces imply them: A space following a letter is always a word boundary. In general you should put the \b outside the alternatives; it's shorter and more readable.
If it matters, you can use "non-capturing groups" in all modern regexp engines: Use (?:stuff) instead of (stuff). But if it doesn't matter for your uses, or if you need to know which of the word alternatives are actually present, then go with simple parens.

you can just use:
r"\b(a|the)\b"

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.

Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to match repeated (unknown) substrings - python

Related

regex: how to get repeating blocks as groups()? [duplicate]

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

Python regex: Alternation for sets of words

Confusing Behaviour of regex in Python

Categories

Resources