What is the capitol of X? Regex - python

What I am trying to do is very simple, I think but I can't seem to get it to work.
My regex is:
"(?wW)hat is the Capital of (\w*?\s?\w*?)\?"
Which I am hoping will allow in things like "Russia" and "Costa Rica" to be in the capture group. Basically, I want to read in a question such as "what is the capitol of Argentina" and then be able to grab the word "Argentina" even if the sentence has a bunch of other stuff in it.
But I tried it and I entered "what is the Capital of russia?" and it said that string didn't match.

I think you are looking for this:
[wW]hat is the capitol of ([\w\s]*)\?
Your fundamental mistake is the mixing up of character classes and capture groups.
To look for a mixture of characters (like w or W) you want to use a character class like [wW]. This means when we are looking for word characters (\w = [a-zA-Z0-9_]) or whitespace characters (\s = [\r\n\t\f ]), we can simple say [\w\s].
The final issue would be your use of ? and * (repetition). First of all, they have no special meaning in the character classes so I removed them. * repeats 0+ characters (+ checks 1+), and ? makes the previous key optional. This means \w*? is unnecessary, since it is saying optionally 0+ matches.
Note, I used a capturing group (...) around the capitol name meaning we can reference the capitol from capture group 1.
Finally, we can use the i modifier to make our matches case-insensitive..the final expression may be simpler to understand:
/what is the capitol of ([a-z ]+)\?/i

This should match:
[wW]hat is the capitol of ([^?]+)\?

Related

Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?
The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo
So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.

How to avoid mutations when using the sub() method in regex?

namesRegex = re.compile(r"Agent \w+")
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
When I do this, it doesnt only change Agent, but Alice and Bob also. I mean it changes one more word.
I tried to understand this, for example, when i want to change only Alice, it also changes "gave" too.
How can I only change one word in Regex?
Also one question more, we write like this re.compile(r".* etc")
but even if we dont write "r", like r.compile(".* etc") it does the same thing. Then why do we write r letter over there?
You can make 'Agent ' part of a positive lookbehind pattern instead so that re.sub only matches the agent's name and therefore substitutes only the agent's name with 'CENSORED':
namesRegex = re.compile(r"(?<=Agent )\w+")
Check out regex101
You can test regex expressions against different inputs and see what matches. It even explains what rules were used in the match.
For instance, for Agent \w+ the explanation is:
Agent matches the characters Agent literally (case sensitive)
\w+
matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

Python Regex: Find specific phrase in any form in text (including if followed by . or ,)

I'm trying to find when a specific product name is mentioned in customer notes (i.e. un-standardized, messy text). The product name is "Lending QB." Within the text, the product name can appear in any of the follow ways:
str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'
Here is the regex that mostly works:
df['lendingQB'] = df['Text'].str.findall('(?i)(?<!\S)lending\s?qb(?!\S)', re.IGNORECASE)
Using regex101.com to test, and confirming within my Python program, I can capture the product name in strings (str) 1-3, but not 4 and 5; which makes me believe the issue is with not finding the product name when it's followed by a punctuation mark.
My understanding is the \S would include commas and periods.
I tried adding |[,.] to the regex but then nothing matches:
'(?i)(?<!\S)lending\s?qb(?!\S|[,.])'
(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")
Any suggestions?
AC
The pattern (?!\S) uses a negative lookahead to check what follows is not a non whitespace character.
What you could so is replace the (?!\S) with a word boundary \b to let it not be part of a larger match:
(?i)(?<!\S)lending\s?qb\b
Regex demo
Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[\s,.]|$)
For example:
str5 ="The best product is Lending qb."
print(re.findall(r'(?<!\S)lending\s?qb(?=[\s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']
You have correctly identified one issue in the regex (punctuation immediately after QB), but there is a second edge case to consider given that the input is messy -- what if there are multiple spaces in Lending QB?.
I believe the most robust solution to your problem is:
(?i)(?<!\S)lending\s*qb\b
\b enforces that QB occur at the end of a word, automatically considering punctuation.
\s? was replaced with \s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.
PS. Another point to consider is that \b terminates on all punctuation, (?=\s|[,.]) will only terminate on the given punctuation: , or . in this case. Given the wide range of possible punctuation (colon, semicolon, dash, hyphen, emdash...) I would strongly recommend \b over (?=\s|[,.]). Unless you want precise control over allowable terminating punctuation of course...
PPS. further test cases to illustrate my points
str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'
This (?!\S) is a forward whitespace boundary.
It is really this (?![^\s]) a negative of a negative
with the added benefit of it matching at the EOS (end of string).
What that means is you can use the negative class form to add characters
that qualify as a boundary.
So, just put the period and comma in with the whitespace.
(?i)(?<![^\s,.])lending\s?qb(?![^\s,.])
https://regex101.com/r/BrOj2J/1
As a tutorial point, this concept encapsulates multiple assertions
and is basic engine Boolean class logic which speeds up the engine
by a ten fold factor by comparison.
Thank you "The fourth bird", "sln", and "Mark_Anderson". Your answers provided solutions and also were very educational. I went with Mark's answer since it seemed to be the most robust, which is where I'm trying to get to. Ideally, I do want to capture all cases when the product name is mentioned, no matter how messy it's typed.
I changed my code to this:
df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!\S)lending\s*qb\b', re.IGNORECASE)

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?
Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.
Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Categories

Resources