Python 2.7 Regex capture groups not working as predicted

Python 2.7 Regex capture groups not working as predicted - python

I am trying to pattern match and replace first person with second person with Python 2.7.
string = re.sub(r'(\W)I(\W)', '\g<1>you\g<2>',string)
string = re.sub(r'(\W)(me)(\W)', '\g<1>you\g<3>',string)
# but does NOT work
string = re.sub(r'(\W)I|(me)(\W)', '\g<1>you\g<3>',string)
I want to use the last regex, but somehow the capture groups are all messed up and even doing a \g<0> shows strange, irregular matches. I would think that capture group 3 would be the last word boundary, but it doesn't appear to be.
A sample sentence could be: I like candy.
I am not interested very much in the correctness of the replacement (me will never actually be selected since I goes first), but I don't know why the capture groups don't work as I would expect.
Thanks!

Try with following regex.
Regex: \b(I|me)\b
Explanation:
\b on both sides marks the word boundary.
(I|me) matches either I OR me.
Note:- You can make it case insensitive using i flag.
Regex101 Demo

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.

For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.

You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.

If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})

I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

NOT operator for regex

Using python script, I am cleaning a piece of text where I want to replace following words:
promocode, promo, code, coupon, coupon code, code.
However, I dont want to replace them if they start with a '#'. Thus, #promocode, #promo, #code, #coupon should remain the way they are.
I tried following regex for it:
1. \b(promocode|promo code|promo|coupon code|code|coupon)\b
2. (?<!#)(promocode|promo code|promo|coupon code|code|coupon)
None of them are working. I am basically looking something that will allow me to say "Does NOT start with # and" (promocode|promo code|promo|coupon code|code|coupon)
Any suggestions ?

You need to use a negative look-behind:
(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b
This (?<!#) will ensure you will only match these words if there is no # before them and \b will ensure you only match whole words. The non-capturing group (?:...) is used just for grouping purposes so as not to repeat \b around each alternative in the list (e.g. \bpromo\b|\bcode\b...). Why use non-capturing group? So that it does not interfere with the Match result. We do not need unnecessary overhead with digging out the values (=groups) we need.
See demo here
See IDEONE demo, only the first promo is deleted:
import re
p = re.compile(r'(?<!#)\b(?:promocode|promo code|promo|coupon code|code|coupon)\b')
test_str = "promo #promo "
print(p.sub('', test_str))
A couple of words about your regular expressions.
The \b(promocode|promo code|promo|coupon code|code|coupon)\b is good, but it also matches the words in the alternation group not preceded with #.
The (?<!#)(promocode|promo code|promo|coupon code|code|coupon) regex is better, but you still do not match whole words (see this demo).

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Negating match if a string is just before another string

I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory

Just use a negative look-behind in your regex:
(?<!with\s)accessory

Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.