Python regex A|B|C matches C even though B should match - python

I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?

The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo

So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.

Related

Python Regex: Find specific phrase in any form in text (including if followed by . or ,)

I'm trying to find when a specific product name is mentioned in customer notes (i.e. un-standardized, messy text). The product name is "Lending QB." Within the text, the product name can appear in any of the follow ways:
str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'
Here is the regex that mostly works:
df['lendingQB'] = df['Text'].str.findall('(?i)(?<!\S)lending\s?qb(?!\S)', re.IGNORECASE)
Using regex101.com to test, and confirming within my Python program, I can capture the product name in strings (str) 1-3, but not 4 and 5; which makes me believe the issue is with not finding the product name when it's followed by a punctuation mark.
My understanding is the \S would include commas and periods.
I tried adding |[,.] to the regex but then nothing matches:
'(?i)(?<!\S)lending\s?qb(?!\S|[,.])'
(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")
Any suggestions?
AC
The pattern (?!\S) uses a negative lookahead to check what follows is not a non whitespace character.
What you could so is replace the (?!\S) with a word boundary \b to let it not be part of a larger match:
(?i)(?<!\S)lending\s?qb\b
Regex demo
Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[\s,.]|$)
For example:
str5 ="The best product is Lending qb."
print(re.findall(r'(?<!\S)lending\s?qb(?=[\s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']
You have correctly identified one issue in the regex (punctuation immediately after QB), but there is a second edge case to consider given that the input is messy -- what if there are multiple spaces in Lending QB?.
I believe the most robust solution to your problem is:
(?i)(?<!\S)lending\s*qb\b
\b enforces that QB occur at the end of a word, automatically considering punctuation.
\s? was replaced with \s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.
PS. Another point to consider is that \b terminates on all punctuation, (?=\s|[,.]) will only terminate on the given punctuation: , or . in this case. Given the wide range of possible punctuation (colon, semicolon, dash, hyphen, emdash...) I would strongly recommend \b over (?=\s|[,.]). Unless you want precise control over allowable terminating punctuation of course...
PPS. further test cases to illustrate my points
str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'
This (?!\S) is a forward whitespace boundary.
It is really this (?![^\s]) a negative of a negative
with the added benefit of it matching at the EOS (end of string).
What that means is you can use the negative class form to add characters
that qualify as a boundary.
So, just put the period and comma in with the whitespace.
(?i)(?<![^\s,.])lending\s?qb(?![^\s,.])
https://regex101.com/r/BrOj2J/1
As a tutorial point, this concept encapsulates multiple assertions
and is basic engine Boolean class logic which speeds up the engine
by a ten fold factor by comparison.
Thank you "The fourth bird", "sln", and "Mark_Anderson". Your answers provided solutions and also were very educational. I went with Mark's answer since it seemed to be the most robust, which is where I'm trying to get to. Ideally, I do want to capture all cases when the product name is mentioned, no matter how messy it's typed.
I changed my code to this:
df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!\S)lending\s*qb\b', re.IGNORECASE)

regex catastrophic backtracking ; extracting words starts with capital before the specific word

I'm relatively new to Python world and having trouble with regex.
I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)).
I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'),
and some of them are taking forms like ('Abc and Def' or 'Abc & Def').
For example,
from the text,
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products was good. However, Computer's Parts and Display
Segment sale has been decreasing.
I only want to extract 'Computer's Parts and Display Segment'.
So I tried to create a regex
((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))
(
1.[A-Z]+[a-zA-Z-0-9\']*.?\s => this part is to find words start with capital letter and other parts are composed of a-z or A-Z or 0-9 or - or ' or . .
(?:and |\& )? => this part is to match word with and or & )
However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem.
Could you help me?
Thanks!
Overview
Pointing out a few things in your pattern:
[a-zA-Z\-0-9\'] You don't need to escape ' here. Also, you can just place - at the start or end of the set and you won't need to escape it.
\& The ampersand character doesn't need to be escaped.
[S|s] Says to match either S, |, or s, thus you could potentially match |ales. The correct way to write this is [Ss].
Code
See regex in use here
(?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)
Results
Input
;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.
Output
Computer's Parts and Display Segment
Explanation
(?:(?:[A-Z][\w'-]*|and) +)+ Match this one or more times
(?:[A-Z][\w'-]*|and) Match either of the following
[A-Z][\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens -
and Match this literally
+ Match one or more spaces
(?=[sS]ales?) Positive lookahead ensuring any of the words sale, Sale, sales, or Sales follows

python regex negative lookahead method

I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence :
The Company's customers include, among others, Conner Peripherals Inc.("Conner"),
Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!
You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!
>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']
The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.
The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.
What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.
See the sample regex for this:
\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)
Since it is unwieldy, you should think of building it from blocks, dynamically:
import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches)
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']
So,
\b - matches a word boundary
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
(?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
\s+ - 1+ whitespaces
[A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
\b - trailing word boundary
\.? - an optional .
Then, this block is used to build
{0}\s+[sS]egment\b - the block we defined before followed with
\s+ - 1+ whitespaces
[sS]egment\b - either segment or Segment whole words
| - or
({0}) - Group 1 (what re.findall actually returns): the block we defined before.
filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

What is the capitol of X? Regex

What I am trying to do is very simple, I think but I can't seem to get it to work.
My regex is:
"(?wW)hat is the Capital of (\w*?\s?\w*?)\?"
Which I am hoping will allow in things like "Russia" and "Costa Rica" to be in the capture group. Basically, I want to read in a question such as "what is the capitol of Argentina" and then be able to grab the word "Argentina" even if the sentence has a bunch of other stuff in it.
But I tried it and I entered "what is the Capital of russia?" and it said that string didn't match.
I think you are looking for this:
[wW]hat is the capitol of ([\w\s]*)\?
Your fundamental mistake is the mixing up of character classes and capture groups.
To look for a mixture of characters (like w or W) you want to use a character class like [wW]. This means when we are looking for word characters (\w = [a-zA-Z0-9_]) or whitespace characters (\s = [\r\n\t\f ]), we can simple say [\w\s].
The final issue would be your use of ? and * (repetition). First of all, they have no special meaning in the character classes so I removed them. * repeats 0+ characters (+ checks 1+), and ? makes the previous key optional. This means \w*? is unnecessary, since it is saying optionally 0+ matches.
Note, I used a capturing group (...) around the capitol name meaning we can reference the capitol from capture group 1.
Finally, we can use the i modifier to make our matches case-insensitive..the final expression may be simpler to understand:
/what is the capitol of ([a-z ]+)\?/i
This should match:
[wW]hat is the capitol of ([^?]+)\?

Regex which matches the longer string in an OR

Motivation
I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:
UK == United Kingdom,
US == USA == United States,
Korea == South Korea,
and so on...
Explanation
So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:
germany|us|france|chile
But the problem is with multi-word country names and their shorter versions, like:
Republic of Moldova and Moldova
Using this as example, we have the string:
'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'
What I want to get from this:
'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'
But this is what I get:
'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'
Regex
As there are several cases, here is what I'm using so far:
^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$
As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:
(.*[\d\-]+.*|,.*[:/].*)?
Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:
(republic of moldova|moldova|...)
Question
The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string.
Is there a way to tell the regex to choose over the biggest possible match when two values mach?
Edit
I'm using Python with built in re module
As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:
'Department of Chemistry, National University of Singapore, 4512436 Singapore'
Matches:
'Department of Chemistry, National University of'
'Singapore'
Here it matches too soon now.
Your problem is greediness.
The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.
The simplest solution is to make the repetition ungreedy:
^(.*?)...
Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.
EDIT:
There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:
^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$
Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:
There is no word character before the mandatory space preceding the country.
There is no word character after the country that is separated by nothing but spaces.
Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.
EDIT 2:
As abarnert mentioned in a comment, how about not using a regex-only solution?
You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.
Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.
Tada.

Categories

Resources