Pulling sentences with combinations of keywords in python using regular expressions - python

Suppose I have the string
'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
and I want to use regular expressions to pull the sentences in that string that mention either apple or pear first and then its color, red or green. So I basically want a list returned that has:
["apples are red.", "this apple is green.", "pears are sometimes red, but not usually.", pears are green."]
I can pull a regular expression for just apples and pears or green and red with something like
re.findall(r'([^.]*?apple[^.]*|[^.]*?pear[^.]*)', string)
and
re.findall(r'([^.]*?red[^.]*|[^.]*?green[^.]*)', string)
but how do I put these two together when I want the fruit (apple/pear) to come first in the string followed by the color and some later point in the sentence?

You can use parentheses to group sub-expressions:
re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", string)
For example:
>>> import re
>>> a = 'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
>>> re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", a)
['apples are red.', ' this apple is green.',
' pears are sometimes red, but not usually.', ' pears are green.']

use this pattern (?:^|\b)(?=[^.]*(?:apple|pear)[^.]*(?:red|green))([^.]+\.) Demo

I recommend you read about NLTK (Natural Language Tool Kit). It's a python package for text processing

Related

Data Cleaning - removing trailing phrases

I'm cleaning some data and wondering how to remove trailing phrases. I don't want to get rid of all numbers as some flavors have numbers. The first table is the pre-cleaned data, the second table is what I want.
Flavor
Orange 5 ml
Cherry
Strawberry 5 mg/ml
#1 flavor
Passion fruit 1.
Cherry Blossom
Flavor
Orange
Cherry
Strawberry
#1 flavor
Passion fruit
Cherry Blossom
Like all data cleansing, this requires knowledge of the entire dataset, so the help you can get is minimal. However, I've cooked up a regular expression that you can use to remove numbers, whitespace, units (ml, mg), slashes (/) and periods (.) from the end of the strings:
\s*\b[/mgl\d\s.]+$
You can use it like this:
df['Flavor'] = df['Flavor'].str.replace(r'\s*\b[/mgl\d\s.]+$', '', regex=True)

Extract full sentence with list of words

I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here

Regex to grab word before a certain character in python

I want to extract word before a certain character from the names column and append new colum as color
if there is no color before the name then I want to display empty string
I've been trying to extract the word before the match. For example, I have the following table:
import pandas as pd
import re
data = ['red apple','green topaz','black grapes','white grapes']
df = pd.DataFrame(data, columns = ['Names'])
Names
red apple
green apple
black grapes
white grapes
normal apples
red apple
The below code i was treid
I am geeting Partial getting output
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple', x)))
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple|grapes', x)))
Desired output:
Names color
red apple red
green apple green
black grapes black
white grapes white
normal apples
red apple red
Please help out me this issue
I found this solution:
gives me a color_column like ['red', 'green', 'black', 'white', '']
import re
data = ['red apple','green topaz','black grapes','white grapes','apples']
colors_column = list(map(lambda x: ' '.join(re.findall(r'(\S\w+)\s+\w+', x)) ,data))
One solution is just to remove the fruit names to get the color:
def remove_fruit_name(description):
return re.sub(r"apple|grapes", "", description)
df['Colors'] = df['Names'].apply(remove_fruit_name)
If you have many lines it may be faster to compile your regexp:
fruit_pattern = re.compile(r"apple|grapes")
def remove_fruit_name(description):
return fruit_pattern.sub("", description)
Another solution is to use a lookahead assertion, it's (probably) a bit faster, but the code is a bit more complex:
# That may be useful to have a set of fruits:
valid_fruit_names = {"apple", "grapes"}
any_fruit_pattern = '|'.join(valid_fruit_names)
fruit_pattern = re.compile(f"(\w*)\s*(?={any_fruit_pattern})")
def remove_fruit_name(description):
match = fruit_pattern.search(description)
if match:
return match.groups()[0]
return description
df['Colors'] = df['Names'].apply(remove_fruit_name)
Here is an example of lookahead quoted from the documentation:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Finally, if you want to make a difference between normal and green you'll need a dictionary of valid colors. Same goes for fruit names if you have non-fruit strings in your input, such as topaz.
Not necessarily an elegant trick, but this seems to work:
((re.search('(\w*) (apple|grape)',a)) or ['',''])[1]
Briefly, you search for the first word before apple or grape, but if there is no match, it returns None which is false. So you use or with a list of empty strings, but since you want to take the first element of the matched expression (index 1), I used a two element list of empty strings (to take the second element there).

Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

My List:
['\n\r\n\tThis article is about sweet bananas. For the genus to which
banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas
used in cooking, see Cooking banana. For other uses, see Banana
(disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya
and Australia, and are likely to have been first domesticated in Papua
New Guinea.\n\r\n\tThey are grown in 135
countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between
"bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is
the largest herbaceous flowering plant.\n\r\n\tAll the above-ground
parts of a banana plant grow from a structure usually called a
"corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West
African origin, possibly from the Wolof word banaana, and passed into
English via Spanish or Portuguese.\n']
Example code:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
After I scraping a web page, I need to do data cleansing. I want to replace all the "\r", "\n", "\t" as "", but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together.
Every subtitle always starts with \n\n and ends with \n\r\n\t, is it possible that I can do something to distinguish them in this list like \aEtymology\a. It's not going to work if I replace \n\n and \n\r\n\t separately to \a first cause other parts might have the same elements like this \n\n\r and it will become like \a\r. Thanks in advance!
Approach
Replace the subtitles to a custom string, <subtitles> in the list
Replace the \n, \r, \t etc. in the list
Replace the custom string with the actual subtitle
Code
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
Output
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
I think you may use regex
You can do this by using regular expressions:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1> is a backreference to the (\w+) in the first regex. It lets you reuse what ever is in there.

Splitting Paragraphs in Python using Regular Expression containing abbreaviations

Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.
The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.

Categories

Resources