Python Regex Skipping Optional Groups - python

I am trying to extract a doctor's name and title from a string. If "dr" is in the string, I want it to use that as the title and then use the next word as the doctor's name. However, I also want the regex to be compatible with strings that do not have "dr" in them. In that case, it should just match the first word as the doctor's name and assume no title.
I have come up with the following regex pattern:
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
As I understand it, this should optionally match the letters "dr" (with or without a following period) and then a space, followed by a series of letters, case-insensitive. The problem is, it seems to only pick up the optional "dr" title if it is at the beginning of the string.
import re
pattern = re.compile('(DR\.? )?([A-Z]*)', re.IGNORECASE)
test1 = "Dr Joseph Fox"
test2 = "Joseph Fox"
test3 = "Optometry by Dr Joseph Fox"
print pattern.search(test1).groups()
print pattern.search(test2).groups()
print pattern.search(test3).groups()
The code returns this:
('Dr ', 'Joseph')
(None, 'Joseph')
(None, 'Optometry')
The first two scenarios make sense to me, but why does the third not find the optional "Dr"? Is there a way to make this work?

You're seeing this behavior because regexes tend to be greedy and accept the first possible match. As a result, your regex is accepting only the first word of your third string, with no characters matching the first group, which is optional. You can see this by using the findall regex function:
>>> print pattern.findall(test3)
[('', 'Optometry'), ('', ''), ('', 'by'), ('', ''), ('Dr ', 'Joseph'), ('', ''), ('', 'Fox'), ('', '')]
It's immediately obvious that 'Dr Joseph' was successfully found, but just wasn't the first matching part of your string.
In my experience, trying to coerce regexes to express/capture multiple cases is often asking for inscrutable regexes. Specifically answering your question, I'd prefer to run the string through one regex requiring the 'Dr' title, and if I fail to get any matches, just split on spaces and take the first word (or however you want to go about getting the first word).

Regular expression engines match greedily from left to right. In other words: there is no "best" match and the first match will always be returned. You can do a global search, though...check out re.findall().

Your regex basically accepts any word, therefore it will be difficult to choose which one is the name of the doctor even after using findall if the dr is not present.
Is the re.IGNORECASE really important? Are you only interested in the name of the doctor or both name and surname?
I would reccomend using a regex that matches two words starting with uppercase and only one space in between, maintaining the optional dr before.
If re.ignorecase is really important, maybe it is better to make first a search for dr, and if it is unsuccessful, then store the first word as the name or something like that as proposed before

Look for (?<=...) syntax: Python Regex
Your re pattern will look about like this:
(DR\.? )?(?<=DR\.? )([A-Z]*)

You are only looking for Dr when the string starts with it, you aren't searching for a string containing Dr.
try
pattern = re.compile('(.*DR\.? )?([A-Z]*)', re.IGNORECASE)

Related

Python Regex how to match a substring without replace a part of that

I have the following sentence:
sentence = "Work \nExperience \n\n First Experience..."
Work
Experience
First Experience...
So, I want to remove the "\n" between Work and Experience, but at the same time I don't want remove "\n\n" after Experience.
Work Experience
First Experience...
I've tried different solution like:
string = re.sub(" \n{1}[^\n]"," ",sentence)
but all of them remove the first character after \n (E).
Update: I managed to find the solution thanks to #Wiktor
print(re.sub(r'\w .*?\w+', lambda x: x.group().replace('\n', ''), sentence, flags=re.S))
If you want to make it a generic solution to remove any amount of \n, a newline, in between two strings, you can use
import re
sentence = "Work \nExperience \n\n First Experience..."
print( re.sub(r'Work.*?Experience', lambda x: x.group().replace('\n', ''), sentence, flags=re.S) )
See the Python demo. Output:
Work Experience
First Experience...
The Work.*?Experience with re.S matches any substrings between (and including) Work and Experience and then the match data object (x) is processed upon each match when all newlines are removed usign .replace('\n', '') and these modified strings are returned as replacement patterns to re.sub.

Matching regex pattern where there is \n\r between starting and ending pattern

The red underscore is the desired string I want to match
I would like to match all strings (including \n) between the the two string provided in the example
However, in the first example, where there is a newline, I can't get anything to match
In the second example, the regex expression works. It matches the string highlighted in Green because it resides on a single line
Not sure if there is a notation I need to include for \n\r to be part of the pattern to match
Use this
output = re.search('This(.*?)\n\n(.*?)match', text)
>>> output.group(1)
'is a multiline expression'
>>> output.group(2)
'I would like to '
Try this one aswell:
output = re.search(r"This ([\S.]+) match", text).group(1).replace(r'\n','')
That will find the entire thing as one group then remove the new lines.

Regex for extracting name starting with Mr.|Mrs

I was trying to write regex for identifying name starting with
Mr.|Mrs.
for example
Mr. A, Mrs. B.
I tried several expressions. These regular expressions were checked on online tool at pythonregex.com. The test string used is:
"hey where is Mr A how are u Mrs. B tt`"
Outputs mentioned are of findall() function of Python, i.e.
regex.findall(string)
Their respective outputs with regex are below.
Mr.|Mrs. [a-zA-Z]+ o/p-[u'Mr ', u'Mrs']
why A and B are not appearing with Mr. and Mrs.?
[Mr.|Mrs.]+ [a-zA-Z]+ o/p-[u's Mr', u'. B']
Why s is coming with Mr. instead of A?
I tried many more combinations but these are confusing so here are they. For name part I know regex has to cover more conditions but was starting from basic.
Change your regex like below,
(?:Mr\.|Mrs\.) [a-zA-Z]+
DEMO
You need to put Mr\., Mrs\. inside a non-capturing or capturing group , so that the | (OR) applies to the group itself.
You must need to escape the dot in your regex to match a literal dot or otherwise, it would match any character. . is a special meta character in regex which matches any character except line breaks.
OR
Even shorter one,
Mrs?\. [a-zA-Z]+
? quantifier in the above makes the previous character s as an optional one.
There's a python library for parsing human names :
https://github.com/derek73/python-nameparser
Much better than writing your own regex.

Regular expressions: How do I find a sub-string that is between two regular expression matches?

Let's say I have a string like:
data = 'MESSAGE: Hello world!END OF MESSAGE'
And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.
I tried to get this by using this regular expression in re.search:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.
I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.
BTW, I am using Python 3.3.
Your code doesn't work but for the opposite reason:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
would match
'Hello world!END OF MESSA'
because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
that correctly matches
'Hello world!'
One little question mark:
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'
if you make the first capturing group lazy, it won't consume anything after the exclamation point.
You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).
import re
data = 'MESSAGE: Hello world!END OF MESSAGE'
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)
Returns:
'Hello world!'
Alternatively, you could use this:
regex = r'MESSAGE: (.*?)[A-Z]{2,}'
To break this down (I'll include the search line with the VERBOSE flag:):
regex = r'''
MESSAGE:\s # first part, \s for the space (matches whitespace)
(.*?) # non-greedy, anything but a newline
(?:[A-Z]{2,}) # a secondary group, but non-capturing,
# good for alternatives separated by a pipe, |
'''
re.search(regex, data, re.VERBOSE).group(1)

Python: regex: find if exists, else ignore

I need help with re module. I have pattern:
pattern = re.compile('''first_condition\((.*)\)
extra_condition\((.*)\)
testing\((.*)\)
other\((.*)\)''', re.UNICODE)
That's what happens if I run regex on the following text:
text = '''first_condition(enabled)
extra_condition(disabled)
testing(example)
other(something)'''
result = pattern.findall(text)
print(result)
[('enabled', 'disabled', 'example', 'something')]
But if one or two lines were missed, regex returns empty list. E.g. my text is:
text = '''first_condition(enabled)
other(other)'''
What I want to get:
[('enabled', '', '', 'something')]
I could do it in several commands, but I think that it will be slower than doing it in one regex. Original code uses sed, so it is very fast. I could do it using sed, but I need cross-platform way to do it. Is it possible to do? Tnanks!
P.S. It will be also great if sequence of strings will be free, not fixed:
text = '''other(other)
first_condition(enabled)'''
must return absolutely the same:
[('enabled', '', '', 'something')]
I would parse it to a dictionary first:
import re
keys = ['first_condition', 'extra_condition', 'testing', 'other']
d = dict(re.findall(r'^(.*)\((.*)\)$', text, re.M))
result = [d.get(key, '') for key in keys]
See it working online: ideone
Use a non-matching group for optional stuff, and make the group optional by putting a question mark after the group.
Example:
pat = re.compile(r'a\(([^)]+)\)(?:b\((?P<bgr>[^)]+)\)?')
Sorry but I can't test this right now.
The above requires a string like a(foo) and grabs the text in parents as group 0.
Then it optionally matches a string like b(foo)and if it is matched it will be saved as a named group with name: bgr
Note that I didn't use .* to match inside the parens but [^)]+. This definitely stops matching when it reaches the closing paren, and requires at least one character. You could use [^)]* if the parens can be empty.
These patterns are getting complicated so you might want to use verbose patterns with comments.
To have several optional patterns that might appear in any order, put them all inside a non-matching group and separate them with vertical bars. You will need to use named match groups because you won't know the order. Put an asterisk after the non-matching group to allow for any number of the alternative patterns to be present (including zero if none are present).

Categories

Resources