I have a string that can be ilustrated by the following (extraspaces intended):
"words that don't matter START some words one some words two some words three END words that don't matter"
To grab each substring between START and END ['some words one', some words two', 'some words three'], I wrote the following code:
result = re.search(r'(?<=START).*?(?=END)', string, flags=re.S).group()
result = re.findall(r'(\(?\w+(?:\s\w+)*\)?)', result)
Is it possible to achieve this with one single regex?
In theory you could just wrap your second regex in ()* and put it into your first. That would capture all occurrences of your inner expression in the bounds. Unfortunately the Python implementation only retains the last match of a group that is matched multiple times. The only implementation that I know that retains all matches of a group is the .NET one. So unfortunately not a solution for you.
On the other hand why can't you simply keep the two step approach that you have?
Edit:
You can compare the behaviour I described using online regex tools.
Pattern: (\w+\s*)* Input: aaa bbb ccc
Try it for example with https://pythex.org/ and http://regexstorm.net/tester.
You will see that Python returns one match/group which is ccc while .NET returns $1 as three captures aaa, bbb, ccc.
Edit2: As #Jan says there is also the newer regex module that supports multi captures. I had completely forgotten about that.
With the newer regex module, you can do it in one step:
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+
This looks complicated, but broken down, it says:
(?:\G(?!\A)|START) # look for START or the end of the last match
\s*\K # whitespaces, \K "forgets" all characters to the left
(?!\bEND\b) # neg. lookahead, do not overrun END
\w+\s+\w+\s+\w+ # your original expression
In Python this looks like:
import regex as re
rx = re.compile(r'''
(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+''', re.VERBOSE)
string = "words that don't matter START some words one some words two some words three END words that don't matter"
print(rx.findall(string))
# ['some words one', 'some words two', 'some words three']
Additionally, see a demo on regex101.com.
This is an ideal situation where we could use a re.split, as #PeterE mentioned to circumvent the problem of having access only to the last captured group.
import re
s=r'"words that don\'t matter START some words one some words two some words three END words that don\'t matter" START abc a bc c END'
print('\n'.join(re.split(r'^.*?START\s+|\s+END.*?START\s+|\s+END.*?$|\s{2,}',s)[1:-1]))
Enable a re.MULTILINE/re.M flag as we are using ^ and $.
OUTPUT
some words one
some words two
some words three
abc
a bc c
Related
In python, I want to match substring containing two terms with up to a certain number of words in between but not when it is equal to a certain substring.
I have this regular expression (regex101) that does the first part of the job, matching two terms with up to a certain number of words in between.
But I want to add a part or condition with AND operator to exclude a specific sentence like "my very funny words"
my(?:\s+\w+){0,2}\s+words
Expected results for this input:
I'm searching for my whatever funny words inside this text
should match with my whatever funny words
while for this input:
I'm searching for my very funny words inside this text
there should be no match
Thank you all for helping out
You may use the following regex pattern:
my(?! very funny)(?:\s+\w+){0,2}\s+words
This inserts a negative lookahead (?! very funny) into your existing pattern to exclude the matches you don't want. Here is a working demo.
I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']
I'm parsing a file which has text "$string1:$string2"
How do I regex match this string and extract "string1" and "string2" from it, basically regex match this pattern : "$*:$*"
You were nearly there with your own pattern, it needs three alterations in order to work as you want it.
First, the star in regexes isn't a glob, as you might be expecting it from shell scripting, it's a kleene star. Meaning, it needs some character group it can apply it's "zero to n times" logic on. In your case, the alphanumeric character class \w should work. If that's too restrictive, use . instead, which matches any character except line breaks.
Secondly, you need to apply the regex in a way that you can easily extract the results you want. The usual way to go about it is to define groups, using parentheses.
Last but not least, the $ sign is a meta-character in regexes, so if you want to match it literally, you need to write a backslash in front of it.
In working code, it'll look like this:
import re
s = "$string1:$string2"
r = re.compile(r"\$(\w*):\$(\w*)")
match = r.match(s)
print(match.group(1)) # print the first group that was matched
print(match.group(2)) # print the second group that was matched
Output:
string1
string2
I want to match a specific order of letters with Python's Re module. For example, how could I match thing like
bob
ara
gag
but not
aaa
bal
ie: I want one letter, then another, and then the first again.
For two different letters, I could just loop over all 650 possibilities. However, when matching larger strings, that becomes impossible (and isn't really nice anyways).
You can use this regex with capturing group, lookahead and a back-reference:
^([a-zA-Z])(?!\1)[a-zA-Z]\1$
RegEx Demo
^ # line start
([a-zA-Z]) # match any letter and capture it as group #1
(?!\1)[a-zA-Z] # match any letter but make sure it is not what have in group #1
\1 # match what we captured in capture group #1
$ # line end
Use this regex ([a-zA-Z])(?!\1)[a-zA-Z]\1
Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?
You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).