Python regex: Alternation for sets of words

Python regex: Alternation for sets of words - python

We know \ba\b|\bthe\b will match either word "a" or "the"
I want to build a regex expression to match a pattern like
a/the/one reason/reasons for/of
Which means I want to match a string s contains 3 words:
the first word of s should be "a", "the" or "one"
the second word should be "reason" or "reasons"
the third word of s should be "for" or "of"
The regex \ba\b|\bthe\b|\bone\b \breason\b|reasons\b \bfor\b|\bof\b doesn't help.
How can I do this? BTW, I use python. Thanks.

You need to use a capture group to refuse of mixing the OR's (|)
(\ba\b|\bthe\b|\bone\b) (\breason\b|reasons\b) (\bfor\b|\bof\b)
And then as a more elegant way you can put the word boundaries around the groups.Also note that when you are using space in your regex around the words there is no need to use word boundary.And for reasons and reason you can make the last s optional with ?. And note that if you don't want to match your words as a separate groups you can makes your groups to a none capture group by :?.
\b(?:a|the|one) reasons? (?:for|of)\b
Or use capture group if you want the words in group :
\b(a|the|one) (reasons?) (for|of)\b

The regular expression modifier A|B means that "if either A or B matches, then the whole thing matches". So in your case, the resulting regular expression matches if/where any of the following 5 regular expressions match:
\ba\b
\bthe\b
\bone\b \breason\b
reasons\b \bfor\b
\bof\b
To limit the extent to which | applies, use the non-capturing grouping for this, that is (?:something|something else). Also, for having an optional s at the end of reason you do not need to use alteration; this is exactly equal to reasons?.
Thus we get the regular expression \b(?:a|the|one) reasons? (?:for|of)\b.
Note that you do not need to use the word boundary operators \b within the regular expression, only at the beginning and end (otherwise it would match something like everyone reasons forever).

An interesting feature of the regex module is the named list. With it, you don't have to include several alternatives separated by | in a non capturing group. You only need to define the list before and to refer to it in the pattern by its name. Example:
import regex
words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]
pattern = r'\m \L<word1> \s+ \L<word2> \s+ \L<word3> \M'
p = regex.compile(pattern, regex.X, word1=words[0], word2=words[1], word3=words[2])
s = 'the reasons for'
print(p.search(s))
Even if this feature isn't essential, It improves the readability.
You can achieve something similar with the re module if you join items with | before:
import re
words = [ ['a', 'the', 'one'], ['reason', 'reasons'], ['for', 'of'] ]
words = ['|'.join(x) for x in words]
pattern = r'\b ({}) \s+ ({}) \s+ ({}) \b'.format(*words)
p = re.compile(pattern, re.X)

As I understand you want some regex like this:
(?:a|the|one)\s+(?:reason|reasons)\s+(?:for|of)
It's so simple, just combine them by using groups.
see: DEMO
Note Your requirement above, its sound is not so strict for me, in case that you want to modify something by yourself, let's consider the explanation below
Explanation
(?:abc|ijk|xyz)
Any word abc, ijk or xyz which grouped by non-capture group (?:...) means this word will not capture to regex variable $1, $2, $3, ....
\s+
This is word delimiter which here I set it as any spaces, + stands for 1 or more.

Use parentheses for grouping:
'\b(a|the|one) reason(|s) (for|of)\b'
I left the sentence-internal \b's out since the spaces imply them: A space following a letter is always a word boundary. In general you should put the \b outside the alternatives; it's shorter and more readable.
If it matters, you can use "non-capturing groups" in all modern regexp engines: Use (?:stuff) instead of (stuff). But if it doesn't matter for your uses, or if you need to know which of the word alternatives are actually present, then go with simple parens.

you can just use:
r"\b(a|the)\b"

Related

regex: how to get repeating blocks as groups()? [duplicate]

I need to capture multiple groups of the same pattern. Suppose, I have the following string:
HELLO,THERE,WORLD
And I've written the following pattern
^(?:([A-Z]+),?)+$
What I want it to do is to capture every single word, so that Group 1 is : "HELLO", Group 2 is "THERE" and Group 3 is "WORLD". What my regex is actually capturing is only the last one, which is "WORLD".
I'm testing my regular expression here and I want to use it with Swift (maybe there's a way in Swift to get intermediate results somehow, so that I can use them?)
UPDATE: I don't want to use split. I just need to now how to capture all the groups that match the pattern, not only the last one.

With one group in the pattern, you can only get one exact result in that group. If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
You have to use your language's regex implementation functions to find all matches of a pattern, then you would have to remove the anchors and the quantifier of the non-capturing group (and you could omit the non-capturing group itself as well).
Alternatively, expand your regex and let the pattern contain one capturing group per group you want to get in the result:
^([A-Z]+),([A-Z]+),([A-Z]+)$

The key distinction is repeating a captured group instead of capturing a repeated group.
As you have already found out, the difference is that repeating a captured group captures only the last iteration. Capturing a repeated group captures all iterations.
In PCRE (PHP):
((?:\w+)+),?
Match 1, Group 1. 0-5 HELLO
Match 2, Group 1. 6-11 THERE
Match 3, Group 1. 12-20 BRUTALLY
Match 4, Group 1. 21-26 CRUEL
Match 5, Group 1. 27-32 WORLD
Since all captures are in Group 1, you only need $1 for substitution.
I used the following general form of this regular expression:
((?:{{RE}})+)
Example at regex101

I think you need something like this....
b="HELLO,THERE,WORLD"
re.findall('[\w]+',b)
Which in Python3 will return
['HELLO', 'THERE', 'WORLD']

After reading Byte Commander's answer, I want to introduce a tiny possible improvement:
You can generate a regexp that will match either n words, as long as your n is predetermined. For instance, if I want to match between 1 and 3 words, the regexp:
^([A-Z]+)(?:,([A-Z]+))?(?:,([A-Z]+))?$
will match the next sentences, with one, two or three capturing groups.
HELLO,LITTLE,WORLD
HELLO,WORLD
HELLO
You can see a fully detailed explanation about this regular expression on Regex101.
As I said, it is pretty easy to generate this regexp for any groups you want using your favorite language. Since I'm not much of a swift guy, here's a ruby example:
def make_regexp(group_regexp, count: 3, delimiter: ",")
regexp_str = "^(#{group_regexp})"
(count - 1).times.each do
regexp_str += "(?:#{delimiter}(#{group_regexp}))?"
end
regexp_str += "$"
return regexp_str
end
puts make_regexp("[A-Z]+")
That being said, I'd suggest not using regular expression in that case, there are many other great tools from a simple split to some tokenization patterns depending on your needs. IMHO, a regular expression is not one of them. For instance in ruby I'd use something like str.split(",") or str.scan(/[A-Z]+/)

Just to provide additional example of paragraph 2 in the answer. I'm not sure how critical it is for you to get three groups in one match rather than three matches using one group. E.g., in groovy:
def subject = "HELLO,THERE,WORLD"
def pat = "([A-Z]+)"
def m = (subject =~ pat)
m.eachWithIndex{ g,i ->
println "Match #$i: ${g[1]}"
}
Match #0: HELLO
Match #1: THERE
Match #2: WORLD

The problem with the attempted code, as discussed, is that there is one capture group matching repeatedly so in the end only the last match can be kept.
Instead, instruct the regex to match (and capture) all pattern instances in the string, what can be done in any regex implementation (language). So come up with the regex pattern for this.
The defining property of the shown sample data is that the patterns of interest are separated by commas so we can match anything-but-a-comma, using a negated character class
[^,]+
and match (capture) globally, to get all matches in the string.
If your pattern need be more restrictive then adjust the exclusion list. For example, to capture words separated by any of the listed punctuation
[^,.!-]+
This extracts all words from hi,there-again!, without the punctuation. (The - itself should be given first or last in a character class, unless it's used in a range like a-z or 0-9.)
In Python
import re
string = "HELLO,THERE,WORLD"
pattern = r"([^,]+)"
matches = re.findall(pattern,string)
print(matches)
In Perl (and many other compatible systems)
use warnings;
use strict;
use feature 'say';
my $string = 'HELLO,THERE,WORLD';
my #matches = $string =~ /([^,]+)/g;
say "#matches";
(In this specific example the capturing () in fact aren't needed since we collect everything that is matched. But they don't hurt and in general they are needed.)
The approach above works as it stands for other patterns as well, including the one attempted in the question (as long as you remove the anchors which make it too specific). The most common one is to capture all words (usually meaning [a-zA-Z0-9_]), with the pattern \w+. Or, as in the question, get only the substrings of upper-case ascii letters[A-Z]+.

I know that my answer came late but it happens to me today and I solved it with the following approach:
^(([A-Z]+),)+([A-Z]+)$
So the first group (([A-Z]+),)+ will match all the repeated patterns except the final one ([A-Z]+) that will match the final one. and this will be dynamic no matter how many repeated groups in the string.

You actually have one capture group that will match multiple times. Not multiple capture groups.
javascript (js) solution:
let string = "HI,THERE,TOM";
let myRegexp = /([A-Z]+),?/g; // modify as you like
let match = myRegexp.exec(string); // js function, output described below
while (match != null) { // loops through matches
console.log(match[1]); // do whatever you want with each match
match = myRegexp.exec(string); // find next match
}
Syntax:
// matched text: match[0]
// match start: match.index
// capturing group n: match[n]
As you can see, this will work for any number of matches.

Sorry, not Swift, just a proof of concept in the closest language at hand.
// JavaScript POC. Output:
// Matches: ["GOODBYE","CRUEL","WORLD","IM","LEAVING","U","TODAY"]
let str = `GOODBYE,CRUEL,WORLD,IM,LEAVING,U,TODAY`
let matches = [];
function recurse(str, matches) {
let regex = /^((,?([A-Z]+))+)$/gm
let m
while ((m = regex.exec(str)) !== null) {
matches.unshift(m[3])
return str.replace(m[2], '')
}
return "bzzt!"
}
while ((str = recurse(str, matches)) != "bzzt!") ;
console.log("Matches: ", JSON.stringify(matches))
Note: If you were really going to use this, you would use the position of the match as given by the regex match function, not a string replace.

Design a regex that matches each particular element of the list rather then a list as a whole. Apply it with /g
Iterate throught the matches, cleaning them from any garbage such as list separators that got mixed in. You may require another regex, or you can get by with simple replace substring method.
The sample code is in JS, sorry :) The idea must be clear enough.
const string = 'HELLO,THERE,WORLD';
// First use following regex matches each of the list items separately:
const captureListElement = /^[^,]+|,\w+/g;
const matches = string.match(captureListElement);
// Some of the matches may include the separator, so we have to clean them:
const cleanMatches = matches.map(match => match.replace(',',''));
console.log(cleanMatches);

repeat the A-Z pattern in the group for the regular expression.
data="HELLO,THERE,WORLD"
pattern=r"([a-zA-Z]+)"
matches=re.findall(pattern,data)
print(matches)
output
['HELLO', 'THERE', 'WORLD']

re.match never returns None? [duplicate]

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).

As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$

The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$

You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']

You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

Regex find content in between single quotes, but only if contains certain word

I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'

If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

I'm trying to use Python's findall to try and find all the hypenated and non-hypenated identifiers in a string (this is to plug into existing code, so using any constructs beyond findall won't work). If you imagine code like this:
regex = ...
body = "foo-bar foo-bar-stuff stuff foo-word-stuff"
ids = re.compile(regex).findall(body)
I would like the ids value to be ['foo', 'bar', 'word', 'foo-bar', 'foo-bar-stuff', and 'stuff'] (although not bar-stuff, because it's hypenated, but does not appear as a standalone space-separated identifier). Order of the array/set is not important.
A simple regex which matches the non-hypenated identifiers is \w+ and one which matches the hypenated ones is [\w-]+. However, I cannot figure out one which does both simultaneously (I don't have total control over the code, so cannot concatenate the lists together - I would like to do this in one Regex if possible).
I have tried \w|[\w-]+ but since the expression is greedy, this misses out bar for example, only matching -bar since foo has already been matched and it won't retry the pattern from the same starting position. I would like to find matches for (for example) both foo and foo-bar which begin (are anchored) at the same string position (which I think findall simply doesn't consider).
I've been trying some tricks such as lookaheads/lookbehinds such as mentioned, but I can't find any way to make them applicable to my scenario.
Any help would be appreciated.

You may use
import re
s = "foo-bar foo-bar-stuff stuff" #=> {'foo-bar', 'foo', 'bar', 'foo-bar-stuff', 'stuff'}
# s = "A-B-C D" # => {'C', 'D', 'A', 'A-B-C', 'B'}
l = re.findall(r'(?<!\S)\w+(?:-\w+)*(?!\S)', s)
res = []
for e in l:
res.append(e)
res.extend(e.split('-'))
print(set(res))
Pattern details
(?<!\S) - no non-whitespace right before
\w+ - 1+ word chars
(?:-\w+)* - zero or more repetitions of
- - a hyphen
\w+ - 1+ word chars
(?!\S) - no non-whitespace right after.
See the pattern demo online.
Note that to get all items, I split the matches with - and add these items to the resulting list. Then, with set, I remove any eventual dupes.

If you don't have to use regex
Just use split(below is example)
result = []
for x in body.split():
if x not in result:
result.append(x)
for y in x.split('-'):
if y not in result:
result.append(y)

This is not possible with findall alone, since it finds all non-overlapping matches, as the documentation says.
All you can do is find all longest matches with \w[-\w]* or something like that, and then generate all valid spans out of them (most probably starting from their split on '-').
Please note that \w[-\w]* will also match 123, 1-a, and a--, so something like(?=\D)\w[-\w]* or (?=\D)\w+(?:-\w+)* could be preferable (but you would still have to filter out the 1 from a-1).

unexpected re.sub behavior

I defined
s='f(x) has an occ of x but no y'
def italicize_math(line):
p="(\W|^)(x|y|z|f|g|h)(\W|$)"
repl=r"\1<i>\2</i>\3"
return re.sub(p,repl,line)
and made the following call:
print(italicize_math(s)
The result is
'<i>f</i>(x) has an occ of <i>x</i> but no <i>y</i>'
which is not what I expected. I wanted this instead:
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Can anyone tell me why the first occurence of x was not enclosed in inside the "i" tags?

You seem to be trying to match non-alphanumeric characters (\W) when you really want a word boundary (\b):
>>> p=r"(\b)(x|y|z|f|g|h)(\b)"
>>> re.sub(p,repl,s)
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Of course, ( is non alpha-numeric -- The reason your inner content doesn't match is because \W consumes a character in the match. so with a string like 'f(x)', you match the ( when you match f. Since ( was already matched, it won't match again when you try to match x. By contrast, word boundaries don't consume any characters.

Because the group construct is matching the position at the beginning of the string first and x would overlap the previous match. Also, the first and third groups are redundant since they can be replaced by word boundaries; and you can make use of a character class to combine letters.
p = r'\b([fghxyz])\b'
repl = r'<i>\1</i>'

Like previous answer mention, its because the ( char being consume when matching f thus cause subsequent x to fail the match.
beside replace with word boundary \b, you could also use lookahead regex which just do a peek and won't consume anything match inside the lookahead. Since it didn't consume anything, you don't need the \3 either
p=r"(\W|^)(x|y|z|f|g|h)(?=\W|$)"
repl=r"\1<i>\2</i>"
re.sub(p,repl,line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex: Alternation for sets of words - python

you can just use: r"\b(a|the)\b"

Related

regex: how to get repeating blocks as groups()? [duplicate]

re.match never returns None? [duplicate]

Regex find content in between single quotes, but only if contains certain word

How can I define a regex that matches multiple patterns anchored at the same position in my search text?

unexpected re.sub behavior

Categories

Resources