How to remove a specific pattern from re.findall() results

How to remove a specific pattern from re.findall() results - python

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere#sec.gov
MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm#sec.gov
JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj#sec.gov
MARC D. KATZ (Cal. Bar No. 189534) katzma#sec.gov
JESSICA W. CHAN (Cal. Bar No. 247669) chanjes#sec.gov
RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr#sec.gov
The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
See this regex demo. Details:
\b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
(?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
[A-Z]{4,} - four or more uppercase ASCII letters
(?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
(?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
\s+ - one or more whitespaces
\w+ - one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)

You can do some simple .filter-ing, if your array was results,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

Related

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.

Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']

I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).

r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?

Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

To find words (one or more than one consecutive) having first uppercase in capital?

I need to write a regular expression in python, which can find words from text having first letter in uppercase, these words can be single one or consecutive ones.
For example, for the sentence
Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.
expexted output should be
'Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee'
I have written a regular expression for this,
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
but output of this is
'Dallas Buyer Club', 'Craig Borten, 'Melisa Wallack', 'Jean-Marc Valee'
It is only printing consecutive first uppercase words, not single words like
'American', 'Directed'
also the regular expression,
[A-Z][a-z]+
printing all words but individually,
'Dallas', 'Buyers', 'Club' and so on.
Please help me on this.

I think you mixed up the brackets (and make the regex a bit too complicated. Simply use:
[A-Z][a-z]+(?:\s[A-Z][a-z]+)*
So here we have a matching part [A-Za-z]+ and in order to match more groups, we simply use (...)* to repeat ... zero or more times. In the ... we include the separator(s) (here \s) and the group again ([A-Z][a-z]+).
This will however not include the hyphen between 'Jean' and 'Marc'. In order to include it as well, we can expand the \s:
[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*
Depending on some other characters (or sequences of characters) that are allowed, you may have to alter the [\s-] part further).
This then generates:
>>> rgx = re.compile(r'[A-Z][a-z]+(?:[\s-][A-Z][a-z]+)*')
>>> txt = r'Dallas Buyer Club is a great American biographical drama film,co-written by Craig Borten and Melisa Wallack, and Directed by Jean-Marc Vallee.'
>>> rgx.findall(txt)
['Dallas Buyer Club', 'American', 'Craig Borten', 'Melisa Wallack', 'Directed', 'Jean-Marc Vallee']
EDIT: in case the remaining characters can be uppercase as well, you can use:
[A-Z][A-Za-z]+(?:[\s-][A-Z][A-Za-z]+)*
Note that this will match words with two or more characters. In case single word characters should be matched as well, like 'J R R Tolkien', then you can write:
[A-Z][A-Za-z]*(?:[\s-][A-Z][A-Za-z]*)*

Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match

I am trying to split a sentence correctly bases on normal grammatical rules in python.
The sentence I want to split is
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
The expected output is
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
To achieve this I am using regular , after a lot of searching I came upon the following regex which does the trick.The new_str was jut to remove some \n from 's'
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
So the way I understand the reg ex is that we are first selecting
1) All the characters like i.e
2) From the filtered spaces from the first selection ,we select those characters
which dont have words like Mr. Mrs. etc
3) From the filtered 2nd step we select only those subjects where we have either dot or question and are preceded by a space.
So I tried to change the order as below
1) Filter out all the titles first.
2) From the filtered step select those that are preceded by space
3) remove all phrases like i.e
but when I do that the blank after is also split
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
Shouldn't the last step in the modified procedure be capable in identifying phrases like i.e ,why is it failing to detect it ?

First, the last . in (?<!\w\.\w.) looks suspicious, if you need to match a literal dot with it, escape it ((?<!\w\.\w\.)).
Coming back to the question, when you use r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)' regex, the last negative lookbehind checks if the position after a whitespace is not preceded with a word char, dot, word char, any char (since the . is unescaped). This condition is true, because there are a dot, e, another . and a space before that position.
To make the lookbehind work that same way as when it was before \s, put the \s into the lookbehind pattern, too:
(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
See the regex demo
Another enhancement can be using a character class in the second lookbehind: (?<=\.|\?) -> (?<=[.?]).

python reg-ex pattern not matching

I have a reg-ex matching problem with the following pattern and the string. Pattern is basically a name followed by any number of characters followed by one of the phrases(see pattern below) follwed by any number of characters followed by institution name.
pattern = "[David Maxwell|David|Maxwell] .* [educated at|graduated from|attended|studied at|graduate of] .* Eton College"
str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
match = re.search(pattern, str)
But the search method returns a no match for the above str? Is my reg-ex proper? I'm new to reg-ex. Any help is appreciated

[...] means "any character from this set of characters". If you want "any word in this group of words" you need to use parenthesis: (...|...).
There's another problem in your expression, where you have .* (space, dot, star, space), which means "a space, followed by zero or more characters, followed by a space". In other words, the shortest possible match is two spaces. However, your text only has one space between "educated at" and "Eton College".
>>> pattern = '(David Maxwell|David|Maxwell).*(educated at|graduated from|attended|studied at|graduate of).*Eton College'
>>> str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
>>> re.search(pattern, str)
<_sre.SRE_Match object at 0x1006d10b8>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove a specific pattern from re.findall() results - python

You can do some simple .filter-ing, if your array was results, removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

Related

joining multiple regular expression for readability

Regex to match strings in quotes that contain only 3 or less capitalized words

To find words (one or more than one consecutive) having first uppercase in capital?

Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match

python reg-ex pattern not matching

Categories

Resources