Python Regex Inconsistency - python

For several different regular expressions I have found optional and conditional sections of the regex to behave differently for the first match and the subsequent matches. This is using python, but I found it to hold generically.
Here are two similar examples that illustrate the issue:
First Example:
expression:
(?:\w. )?([^,.]*).*(\d{4}\w?)
text:
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
matches:
Match 1
wang Wang
2002
Match 2
R
2002
Second example:
expression:
((?:\w\. )?[^,.]*).*(\d{4}\w?)
text:
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
matches:
Match 1
J. wang Wang
2002
Match 2
R
2002
What am I missing?
I would expect this to behave a bit differently, I would think the matches would be consistent. What I think it should be (and don't yet understand why it isn't):
Example 1
Match 1
wang Wang
2002
Match 2
wang Wang
2002
Example 2
Match 1
J. wang Wang
2002
Match 2
R. wang Wang
2002

In your first example you expect the second line to match 'wang Wang'. <<example 1>> shows clearly that's not what's happening.
After the first match, - which ends with '2002.' - the regex tries to match the remaining part which starts with \n\nR. wang Wang. In your first regex the first non-capturing group doesn't match with that, so your group 1 takes over and matches that, ending up with '\n\nR'
(?: # non-capturing group
\w. # word char, followed by 1 char, followed by space
)? # read 0 or 1 times
( # start group 1
[^,.]* # read anything that's not a comma or dot, 0 or more times
) # end group 1
.* # read anything
( # start group 2
\d{4} # until there's 4 digits
\w? # eventually followed by word char
) # end group 2
The same applies to your second regex: even here your non-capturing group (?:\w\. )? doesn't consume the R. because there are a dot and some newlines in front of the initials.
You could have solved it like this ([A-Z]\.)\s([^.,]+).*(\d{4}): See example 3

Related

How to remove a specific pattern from re.findall() results

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere#sec.gov
MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm#sec.gov
JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj#sec.gov
MARC D. KATZ (Cal. Bar No. 189534) katzma#sec.gov
JESSICA W. CHAN (Cal. Bar No. 247669) chanjes#sec.gov
RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr#sec.gov
The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]
You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
See this regex demo. Details:
\b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
(?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
[A-Z]{4,} - four or more uppercase ASCII letters
(?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
(?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
\s+ - one or more whitespaces
\w+ - one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)
You can do some simple .filter-ing, if your array was results,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

Regex Text Cleaning on Multiple forms of text formats

I have a dataframe with multiple forms of names:
JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R
For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.
The expected output is
JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI
The left is the name, and the right is what I want to return.
str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)
Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.
Something like this would work:
(.+(?=,)|\S+$)
( - start capture group #1
.+(?=,) - get everything before a comma
| - or
\S+$ - get everything which is not a whitespace before the end of the line
) - end capture group #1
https://regex101.com/r/myvyS0/1
Python:
str.extract(r'(.+(?=,)|\S+$)', expand=False)
You may use this regex to extract:
>>> print (df)
name
0 JOSEPH W. JASON
1 Ralph Landau
2 RAYMOND C ADAMS
3 ABD, SAMIR
4 ABDOU TCHOUSNOU, BOUBACA
5 ABDL-ALI, OMAR R
>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0 JASON
1 Landau
2 ADAMS
3 ABD
4 ABDOU TCHOUSNOU
5 ABDL-ALI
RegEx Details:
(: Start capture group
[^,]+(?=,): Match 1+ non-comma characters tha
|: OR
\w+: Match 1+ word charcters
(?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
): End capture group
(?=,|$): Lookahead to assert that we have comma or end of line ahead

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match

I am trying to split a sentence correctly bases on normal grammatical rules in python.
The sentence I want to split is
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
The expected output is
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
To achieve this I am using regular , after a lot of searching I came upon the following regex which does the trick.The new_str was jut to remove some \n from 's'
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
So the way I understand the reg ex is that we are first selecting
1) All the characters like i.e
2) From the filtered spaces from the first selection ,we select those characters
which dont have words like Mr. Mrs. etc
3) From the filtered 2nd step we select only those subjects where we have either dot or question and are preceded by a space.
So I tried to change the order as below
1) Filter out all the titles first.
2) From the filtered step select those that are preceded by space
3) remove all phrases like i.e
but when I do that the blank after is also split
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
Shouldn't the last step in the modified procedure be capable in identifying phrases like i.e ,why is it failing to detect it ?
First, the last . in (?<!\w\.\w.) looks suspicious, if you need to match a literal dot with it, escape it ((?<!\w\.\w\.)).
Coming back to the question, when you use r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)' regex, the last negative lookbehind checks if the position after a whitespace is not preceded with a word char, dot, word char, any char (since the . is unescaped). This condition is true, because there are a dot, e, another . and a space before that position.
To make the lookbehind work that same way as when it was before \s, put the \s into the lookbehind pattern, too:
(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
See the regex demo
Another enhancement can be using a character class in the second lookbehind: (?<=\.|\?) -> (?<=[.?]).

Categories

Resources