Splitting Paragraphs in Python using Regular Expression containing abbreaviations - python

Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.

The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.

Related

Data Cleaning - removing trailing phrases

I'm cleaning some data and wondering how to remove trailing phrases. I don't want to get rid of all numbers as some flavors have numbers. The first table is the pre-cleaned data, the second table is what I want.
Flavor
Orange 5 ml
Cherry
Strawberry 5 mg/ml
#1 flavor
Passion fruit 1.
Cherry Blossom
Flavor
Orange
Cherry
Strawberry
#1 flavor
Passion fruit
Cherry Blossom
Like all data cleansing, this requires knowledge of the entire dataset, so the help you can get is minimal. However, I've cooked up a regular expression that you can use to remove numbers, whitespace, units (ml, mg), slashes (/) and periods (.) from the end of the strings:
\s*\b[/mgl\d\s.]+$
You can use it like this:
df['Flavor'] = df['Flavor'].str.replace(r'\s*\b[/mgl\d\s.]+$', '', regex=True)

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

Python: replace \n \r \t in a list excluding those starting \n\n and ends with \n\r\n\t

My List:
['\n\r\n\tThis article is about sweet bananas. For the genus to which
banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas
used in cooking, see Cooking banana. For other uses, see Banana
(disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya
and Australia, and are likely to have been first domesticated in Papua
New Guinea.\n\r\n\tThey are grown in 135
countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between
"bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is
the largest herbaceous flowering plant.\n\r\n\tAll the above-ground
parts of a banana plant grow from a structure usually called a
"corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West
African origin, possibly from the Wolof word banaana, and passed into
English via Spanish or Portuguese.\n']
Example code:
import requests
from bs4 import BeautifulSoup
import re
re=requests.get('http://www.abcde.com/banana')
soup=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
title_tag = soup.select_one('.page_article_title')
print(title_tag.text)
list=[]
for tag in soup.select('.page_article_content'):
list.append(tag.text)
#list=([c.replace('\n', '') for c in list])
#list=([c.replace('\r', '') for c in list])
#list=([c.replace('\t', '') for c in list])
print(list)
After I scraping a web page, I need to do data cleansing. I want to replace all the "\r", "\n", "\t" as "", but I found that I have subtitle in this, if I do this, subtitles and sentences are going to mix together.
Every subtitle always starts with \n\n and ends with \n\r\n\t, is it possible that I can do something to distinguish them in this list like \aEtymology\a. It's not going to work if I replace \n\n and \n\r\n\t separately to \a first cause other parts might have the same elements like this \n\n\r and it will become like \a\r. Thanks in advance!
Approach
Replace the subtitles to a custom string, <subtitles> in the list
Replace the \n, \r, \t etc. in the list
Replace the custom string with the actual subtitle
Code
l=['\n\r\n\tThis article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).\n\r\n\tFor starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)\n\r\n\tMusa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.\n\r\n\tThey are grown in 135 countries.\n\n\n\r\n\tWorldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.\n\r\n\tAll the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.\n']
import re
regex=re.findall("\n\n.*.\n\r\n\t",l[0])
print(regex)
for x in regex:
l = [r.replace(x,"<subtitles>") for r in l]
rep = ['\n','\t','\r']
for y in rep:
l = [r.replace(y, '') for r in l]
for x in regex:
l = [r.replace('<subtitles>', x, 1) for r in l]
print(l)
Output
['\n\nDescription\n\r\n\t', '\n\nEtymology\n\r\n\t']
['This article is about sweet bananas. For the genus to which banana plants belong, see Musa (genus).For starchier bananas used in cooking, see Cooking banana. For other uses, see Banana (disambiguation)Musa species are native to tropical Indomalaya and Australia, and are likely to have been first domesticated in Papua New Guinea.They are grown in 135 countries.Worldwide, there is no sharp distinction between "bananas" and "plantains".\n\nDescription\n\r\n\tThe banana plant is the largest herbaceous flowering plant.All the above-ground parts of a banana plant grow from a structure usually called a "corm".\n\nEtymology\n\r\n\tThe word banana is thought to be of West African origin, possibly from the Wolof word banaana, and passed into English via Spanish or Portuguese.']
import re
print([re.sub(r'[\n\r\t]', '', c) for c in list])
I think you may use regex
You can do this by using regular expressions:
import re
subtitle = re.compile(r'\n\n(\w+)\n\r\n\t')
new_list = [subtitle.sub(r"\a\g<1>\a", l) for l in li]
\g<1> is a backreference to the (\w+) in the first regex. It lets you reuse what ever is in there.

Discover identically adjacent strings with regex and python

Consider this text:
...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,
genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast
beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow
(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)
beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...
I would like to parse with python this text and keep only the strings that appear exactly twice and are adjacent. For example an acceptable result should be
bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne
because the trend is that each string appears adjacent to an identical one, just like this:
bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne
So how can someone search for adjacent and identical strings with a regular expression? I am testing my trials here. Thanks!
You can use the following regex:
(\b.+)\1
See demo
Or, to just match and capture the unique substring part:
(\b.+)(?=\1)
Another demo
The word boundary \b makes sure we only match at the beginning of a word, and then match 1 or more characters other than a newline (in a singleline mode, . will also match a newline), and then with the help of a backreference we match exactly the same sequence of characters that was captured with (\b.+).
When using the version with a (?=\1) look-ahead, the matched text does not contain the duplicate part because look-aheads do not consume text and the match does not contain those chunks.
UPDATE
See Python demo:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
print i.group(1).encode('utf-8')
Output:
zyme
abbrühen

Python: Finding The Longest/Shortest Sentence In A Random Paragraph?

I am using Python 2.7 and need 2 functions to find the longest and shortest sentence (in terms of word count) in a random paragraph. For example, if I choose to put in this paragraph:
"Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
The output for this should be 36 and 16 with 36 meaning there are 36 words in the longest sentence and 16 words in the shortest sentence.
def MaxMinWords(paragraph):
numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
return max(numWords), min(numWords)
EDIT : As many have pointed out in the comments, this solution is far from robust. The point of this snippet is to simply serve as a pointer to the OP.
You need a way to split the paragraph into sentences and to count words in a sentence. You could use nltk package for both:
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count
EDIT: As has been mentioned in the comments below, programmatically determining what constitutes the sentences in a paragraph is quite a complex task. However, given the example you provided, I have elucidated a nice start to perhaps solving your problem below.
First, we want to tokenize the paragraph into sentences. We do this by splitting the text on every occurrence of a . (period). This returns a list of strings, each of which is a sentence.
We then want to break each sentence into its corresponding list of words. Then, using this list of lists, we want the sentence (represented as a list of words) whose length is a maximum and the sentence whose length is a minimum. Consider the following code:
par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
# split paragraph into sentences
sentences = par.split(". ")
# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]
# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)
# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)
print longest_sen_len
print shortest_sen_len

Categories

Resources