Non-greedy wild card appears to match greedily?

Non-greedy wild card appears to match greedily? - python

I need to understand why regular expression is matching greedily when I am specifying it not to.
Given string='.GATA..GATA..ETS..ETS.'
Return the shortest substring of GATA...ETS
I use the regex pattern pattern = r'(GATA).*?(ETS)'
syntax_finder=re.compile(pattern,re.IGNORECASE)
for match in syntax_finder.finditer(string):
print(match)
Returns <re.Match object; span=(1, 17), match='GATA..GATA..ETS'>
However, I want it to return 'GATA..ETS'
Does anyone know why this is happening?
I am not looking for a solution to this exact matching problem. I will be doing a lot of these types of searches with more complicated patterns of GATA and ETS, but I will always want it to return the shortest match.
Thanks!

Does anyone know why this is happening?
The regex matches non-greedily. It finds the first GATA and then, because .*? is used rather than .*, matches until the first ETS after that. It just happens that there is another GATA in the way, which you don't want - but which non-greedy matching doesn't care about.
I will be doing a lot of these types of searches with more complicated patterns of GATA and ETS
Then regexes are probably underpowered for the job. My suggestion is to use them to split the string into GATA, ETS and intervening portions (tokenization), and then use other techniques to find the patterns in that sequence (parsing).
I am not looking for a solution to this exact matching problem.
But I can't resist :)
>>> re.search(r'(GATA)((?<!GAT)A|[^A])*?(ETS)', '.GATA..GATA..ETS..ETS.')
<_sre.SRE_Match object; span=(7, 16), match='GATA..ETS'>
Here we use a negative lookbehind assertion: while scanning the part between GATA and ETS, we only allow an A if it is not preceded by GAT.

Related

Python regex Doesn't Match a Simple Pattern

I am trying to match a very simple pattern using Python's regex package (I am new to regex). I don't understand the following behavior:
import regex
regex.match('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
or
regex.match('ARTICLE', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
doesn't match anything. Of course if I do
regex.match('economy', 'economy')
it does it. Why that is the case?
Also, if I want to match case sensitive 'ARTCLE' in the above example, what should be right way to do it?
I am usng 2016.1.10 version of regex.

match looks for a match at the start of the string. If you want to match other than the start you need to use search.
I don't have regex installed here but it should be the same as re.
>>> re.search('economy', 'promising.\n\nARTICLE 4\n\nECONOMY The economy')
<_sre.SRE_Match object; span=(35, 42), match='economy'>

Finding all words: Negative Look Behind in Regex

I am currently using Python 2.7 (I'm working with some old code of mine). And I am trying to get all words via regex, where I can ignore words with apostrophes, like can't and Gary's. So far I have made all letters in the string lowercase and here's my current regex:
r"(?<=\s|^)([a-z]+)(?=\s|$)"
I get the following error:
raise error, v # invalid expression
error: look-behind requires fixed-width pattern
I also tried:
r"(?:\s|^)([a-z]+)(?=\s|$)"
But, as you can see on Regex101, it doesn't capture the last word.
I know that there are probably better alternatives to doing this, but now I am really curious as to how to do a negative look behind in this situation. However, if you could explain that and offer your own better solution, that'd be fine and appreciated.

In this case, just use a negative lookbehind with the opposite character class \S (same can be done with the lookahead):
r"(?<!\S)([a-z]+)(?!\S)"
See the regex demo.
A "positive" approach will look less pretty:
r"(?:(?<=\s)|^)([a-z]+)(?=\s|$)"
See another regex demo. The (?:(?<=\s)|^) non-capturing group combines 2 zero-width assertion alternatives, (?<=\s) that requires a whitespace before the current location, and ^, matching the start of string.

Python Regular Expression - Named Group Not Fully Matching

I have the following Python regex pattern:
(?P<key>.*)(?P<operator><=>|=|>=|>|<=|<|!=|<>)(?P<value>.*)
and my input string example is: this!=that, but the != is not getting matched as a group:
{u'operator': '=', u'key': 'this!', u'value': 'that'}
Can you please help me match against the full operator != in this example using the above regex pattern with some explanation on why my original pattern did not work? Thank you in advance!

You need to use lazy matching with the first capturing group, otherwise, .* will "eat" the first symbol since it is greedy and can also match any symbols in your alternatives:
(?P<key>.*?)(?P<operator><=>|!=|>=|<=|<>|[=><])(?P<value>.*)
See demo
I have also rearranged the alternatives so that they go from the longest to the shortest. This might be important since regex is processing from left to right, and thus, we should check for the longest option first.
And the last three alternatives can be shrunk into a character class [=><] to lessen the backtracking.

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

multiple negative lookahead assertions

I can't figure out how to do multiple lookaround for the life of me. Say I want to match a variable number of numbers following a hash but not if preceded by something or followed by something else. For example I want to match #123 or #12345 in the following. The lookbehinds seem to be fine but the lookaheads do not. I'm out of ideas.
matches = ["#123", "This is #12345",
# But not
"bad #123", "No match #12345", "This is #123-ubuntu",
"This is #123 0x08"]
pat = '(?<!bad )(?<!No match )(#[0-9]+)(?! 0x0)(?!-ubuntu)'
for i in matches:
print i, re.search(pat, i)

You should have a look at the captures as well. I bet for the last two strings you will get:
#12
This is what happens:
The engine checks the two lookbehinds - they don't match, so it continues with the capturing group #[0-9]+ and matches #123. Now it checks the lookaheads. They fail as desired. But now there's backtracking! There is one variable in the pattern and that is the +. So the engine discards the last matched character (3) and tries again. Now the lookaheads are no problem any more and you get a match. The simplest way to solve this is to add another lookahead that makes sure that you go to the last digit:
pat = r'(?<!bad )(?<!No match )(#[0-9]+)(?![0-9])(?! 0x0)(?!-ubuntu)'
Note the use of a raw string (the leading r) - it doesn't matter in this pattern, but it's generally a good practice, because things get ugly once you start escaping characters.
EDIT: If you are using or willing to use the regex package instead of re, you get possessive quantifiers which suppress backtracking:
pat = r'(?<!bad )(?<!No match )(#[0-9]++)(?! 0x0)(?!-ubuntu)'
It's up to you which you find more readable or maintainable. The latter will be marginally more efficient, though. (Credits go to nhahtdh for pointing me to the regex package.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.