I'm cleaning some data and wondering how to remove trailing phrases. I don't want to get rid of all numbers as some flavors have numbers. The first table is the pre-cleaned data, the second table is what I want.
Flavor
Orange 5 ml
Cherry
Strawberry 5 mg/ml
#1 flavor
Passion fruit 1.
Cherry Blossom
Flavor
Orange
Cherry
Strawberry
#1 flavor
Passion fruit
Cherry Blossom
Like all data cleansing, this requires knowledge of the entire dataset, so the help you can get is minimal. However, I've cooked up a regular expression that you can use to remove numbers, whitespace, units (ml, mg), slashes (/) and periods (.) from the end of the strings:
\s*\b[/mgl\d\s.]+$
You can use it like this:
df['Flavor'] = df['Flavor'].str.replace(r'\s*\b[/mgl\d\s.]+$', '', regex=True)
Related
I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here
My List:
['\n\r\n\tThis article is about sweet bananas. For the genus to which
banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas
used in cooking, see Cooking banana. For other uses, see Banana
(disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya
and Australia, and are likely to have been first domesticated in Papua
New Guinea.\n\r\n\tThey are grown in 135
countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between
"bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is
the largest herbaceous flowering plant.\n\r\n\tAll the above-ground
parts of a banana plant grow from a structure usually called a
"corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West
African origin, possibly from the Wolof word banaana, and passed into
English via Spanish or Portuguese.\n']
Example code:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
After I scraping a web page, I need to do data cleansing. I want to replace all the "\r", "\n", "\t" as "", but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together.
Every subtitle always starts with \n\n and ends with \n\r\n\t, is it possible that I can do something to distinguish them in this list like \aEtymology\a. It's not going to work if I replace \n\n and \n\r\n\t separately to \a first cause other parts might have the same elements like this \n\n\r and it will become like \a\r. Thanks in advance!
Approach
Replace the subtitles to a custom string, <subtitles> in the list
Replace the \n, \r, \t etc. in the list
Replace the custom string with the actual subtitle
Code
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
Output
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
I think you may use regex
You can do this by using regular expressions:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1> is a backreference to the (\w+) in the first regex. It lets you reuse what ever is in there.
Consider this text:
...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,
genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast
beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow
(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)
beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...
I would like to parse with python this text and keep only the strings that appear exactly twice and are adjacent. For example an acceptable result should be
bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne
because the trend is that each string appears adjacent to an identical one, just like this:
bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne
So how can someone search for adjacent and identical strings with a regular expression? I am testing my trials here. Thanks!
You can use the following regex:
(\b.+)\1
See demo
Or, to just match and capture the unique substring part:
(\b.+)(?=\1)
Another demo
The word boundary \b makes sure we only match at the beginning of a word, and then match 1 or more characters other than a newline (in a singleline mode, . will also match a newline), and then with the help of a backreference we match exactly the same sequence of characters that was captured with (\b.+).
When using the version with a (?=\1) look-ahead, the matched text does not contain the duplicate part because look-aheads do not consume text and the match does not contain those chunks.
UPDATE
See Python demo:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
print i.group(1).encode('utf-8')
Output:
zyme
abbrühen
Suppose I have the string
'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
and I want to use regular expressions to pull the sentences in that string that mention either apple or pear first and then its color, red or green. So I basically want a list returned that has:
["apples are red.", "this apple is green.", "pears are sometimes red, but not usually.", pears are green."]
I can pull a regular expression for just apples and pears or green and red with something like
re.findall(r'([^.]*?apple[^.]*|[^.]*?pear[^.]*)', string)
and
re.findall(r'([^.]*?red[^.]*|[^.]*?green[^.]*)', string)
but how do I put these two together when I want the fruit (apple/pear) to come first in the string followed by the color and some later point in the sentence?
You can use parentheses to group sub-expressions:
re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", string)
For example:
>>> import re
>>> a = 'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
>>> re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", a)
['apples are red.', ' this apple is green.',
' pears are sometimes red, but not usually.', ' pears are green.']
use this pattern (?:^|\b)(?=[^.]*(?:apple|pear)[^.]*(?:red|green))([^.]+\.) Demo
I recommend you read about NLTK (Natural Language Tool Kit). It's a python package for text processing
Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.
The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.