python reg-ex pattern not matching

python reg-ex pattern not matching - python

I have a reg-ex matching problem with the following pattern and the string. Pattern is basically a name followed by any number of characters followed by one of the phrases(see pattern below) follwed by any number of characters followed by institution name.
pattern = "[David Maxwell|David|Maxwell] .* [educated at|graduated from|attended|studied at|graduate of] .* Eton College"
str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
match = re.search(pattern, str)
But the search method returns a no match for the above str? Is my reg-ex proper? I'm new to reg-ex. Any help is appreciated

[...] means "any character from this set of characters". If you want "any word in this group of words" you need to use parenthesis: (...|...).
There's another problem in your expression, where you have .* (space, dot, star, space), which means "a space, followed by zero or more characters, followed by a space". In other words, the shortest possible match is two spaces. However, your text only has one space between "educated at" and "Eton College".
>>> pattern = '(David Maxwell|David|Maxwell).*(educated at|graduated from|attended|studied at|graduate of).*Eton College'
>>> str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
>>> re.search(pattern, str)
<_sre.SRE_Match object at 0x1006d10b8>

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):

Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.

Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]

Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

How to remove a specific pattern from re.findall() results

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere#sec.gov
MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm#sec.gov
JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj#sec.gov
MARC D. KATZ (Cal. Bar No. 189534) katzma#sec.gov
JESSICA W. CHAN (Cal. Bar No. 247669) chanjes#sec.gov
RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr#sec.gov
The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
See this regex demo. Details:
\b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
(?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
[A-Z]{4,} - four or more uppercase ASCII letters
(?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
(?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
\s+ - one or more whitespaces
\w+ - one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)

You can do some simple .filter-ing, if your array was results,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks

You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object

You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?

Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

regex for removing entity names

Given tweets like the following:
Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform
Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold
How do I write a regex that removes both "by Cormark" and "by Zacks Investment Research"
I tried this:
"by ([A-Za-z ]+\w to)"
using python but it requires the word "to". I would like the regex to stop before capturing the word "to".
It would also be interesting if someone could show me how to write a regex that captures camel-case examples, like "Zacks Investment Research".

You can use a positive look-ahead in order to exclude the word to:
>>> s1 = "Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform"
>>>
>>> s2 = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>>
>>> import re
>>> re.sub(r'by[\w\s]+(?=to)','',s1)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> re.sub(r'by[\w\s]+(?=to)','',s2)
'Brinker International Inc (EAT) Upgraded to Hold'
>>>
Note that the regex [\w\s]+ will match any combination of word characters and white spaces. If you just want to match the alphabetical characters and white space you can use [a-z\s] with re.I flag (Ignore case).

To remove all capitalized words after by, you can use
by [A-Z][a-z]*(?: +[A-Z][a-z]*)*
See regex demo
Explanation:
by - literal sequence of 3 characters b, y and a space
[A-Z][a-z]* - a capitalized word (one uppercase followed by zero or more lowercase letters)
(?: +[A-Z][a-z]*)* - zero or more sequences of...
+[A-Z][a-z]* - 1 or more spaces followed by an uppercase letter followed by zero or more lowercase letters.
A regular space may be replaced with \s in the pattern to match any whitespace. Also, to match CaMeL words, you can replace all [a-z] with [a-zA-Z].

You could also do it with str method index then just slice and add up:
>>> def remove_name(s):
b = s.index(' by ')
t = s.index(' to ')
s = s[:b]+s[t:]
return s
>>>
>>> s = 'Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform'
>>> remove_name(s)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>>
>>> s = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> remove_name(s)
'Brinker International Inc (EAT) Upgraded to Hold'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python reg-ex pattern not matching - python

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

How to remove a specific pattern from re.findall() results

Insert space after the second or third capital letter python

Regex to match strings in quotes that contain only 3 or less capitalized words

regex for removing entity names

Categories

Resources