regex for removing entity names - python

Given tweets like the following:
Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform
Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold
How do I write a regex that removes both "by Cormark" and "by Zacks Investment Research"
I tried this:
"by ([A-Za-z ]+\w to)"
using python but it requires the word "to". I would like the regex to stop before capturing the word "to".
It would also be interesting if someone could show me how to write a regex that captures camel-case examples, like "Zacks Investment Research".

You can use a positive look-ahead in order to exclude the word to:
>>> s1 = "Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform"
>>>
>>> s2 = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>>
>>> import re
>>> re.sub(r'by[\w\s]+(?=to)','',s1)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> re.sub(r'by[\w\s]+(?=to)','',s2)
'Brinker International Inc (EAT) Upgraded to Hold'
>>>
Note that the regex [\w\s]+ will match any combination of word characters and white spaces. If you just want to match the alphabetical characters and white space you can use [a-z\s] with re.I flag (Ignore case).

To remove all capitalized words after by, you can use
by [A-Z][a-z]*(?: +[A-Z][a-z]*)*
See regex demo
Explanation:
by - literal sequence of 3 characters b, y and a space
[A-Z][a-z]* - a capitalized word (one uppercase followed by zero or more lowercase letters)
(?: +[A-Z][a-z]*)* - zero or more sequences of...
+[A-Z][a-z]* - 1 or more spaces followed by an uppercase letter followed by zero or more lowercase letters.
A regular space may be replaced with \s in the pattern to match any whitespace. Also, to match CaMeL words, you can replace all [a-z] with [a-zA-Z].

You could also do it with str method index then just slice and add up:
>>> def remove_name(s):
b = s.index(' by ')
t = s.index(' to ')
s = s[:b]+s[t:]
return s
>>>
>>> s = 'Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform'
>>> remove_name(s)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>>
>>> s = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> remove_name(s)
'Brinker International Inc (EAT) Upgraded to Hold'

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Regex for matching alphabet, numbers and special charters while looping in python

I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

How to split text into sentences when there is no space after full stop?

I have a text like
'A gas well near Surabaya in East Java operated by Lapindo Brantas Inc. has spewed steaming mud since May last year, submerging villages, industries and fields.A gas well near Surabaya in East Java operated by PT Lapindo Brantas has spewed steaming mud since May last year, submerging villages, factories and fields.Last week, Indonesia's coordinating minister for social welfare, Aburizal Bakrie, whose family firm controls Lapindo Brantas, said the volcano was a "natural disaster" unrelated to the drilling activities.President Susilo Bambang Yudhoyono last month ordered Lapindo to pay 3.8 trillion rupiah (420.7 million dollars) in compensation and costs'
I want to split it into sentences. NLTK or any standard regex which I find online fails.
You can use a regex positive lookahead to add spaces to the end of sentences and then pass it to the tool of your choice. This adds a space to periods that don't already have one, but skips non-alphanumerics like commas. By sticking to character classes instead of, say, A-Z, this works for any language.
>>> re.sub(r'\.(?=[^ \W\d])', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._')
'Foo bar. Baz Inc., foobar. 1.1, and abc. _'
You can catch some urls by adding another lookahead searching for slashes
>>> re.sub(r'\.(?=[^ \W\d])(?=[^\w*]/)', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever')
'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever'
you can use this regex to capture the dots that are followed by new sentences
(\.)(?:[A-Z])
you can pass it to re.sub with r'\1\n' as a replacement
parsed_text = re.sub(r'(\.)(?:[A-Z])',r'\1\n',your_text)
you can also just split it into a list of sentences (but you lose the dots at the end)
sentence_list = re.split(r'\.(?=[A-Z])',your_text)

python reg-ex pattern not matching

I have a reg-ex matching problem with the following pattern and the string. Pattern is basically a name followed by any number of characters followed by one of the phrases(see pattern below) follwed by any number of characters followed by institution name.
pattern = "[David Maxwell|David|Maxwell] .* [educated at|graduated from|attended|studied at|graduate of] .* Eton College"
str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
match = re.search(pattern, str)
But the search method returns a no match for the above str? Is my reg-ex proper? I'm new to reg-ex. Any help is appreciated
[...] means "any character from this set of characters". If you want "any word in this group of words" you need to use parenthesis: (...|...).
There's another problem in your expression, where you have .* (space, dot, star, space), which means "a space, followed by zero or more characters, followed by a space". In other words, the shortest possible match is two spaces. However, your text only has one space between "educated at" and "Eton College".
>>> pattern = '(David Maxwell|David|Maxwell).*(educated at|graduated from|attended|studied at|graduate of).*Eton College'
>>> str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
>>> re.search(pattern, str)
<_sre.SRE_Match object at 0x1006d10b8>

Categories

Resources