I am using Python 2.7 and need 2 functions to find the longest and shortest sentence (in terms of word count) in a random paragraph. For example, if I choose to put in this paragraph:
"Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
The output for this should be 36 and 16 with 36 meaning there are 36 words in the longest sentence and 16 words in the shortest sentence.
def MaxMinWords(paragraph):
numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
return max(numWords), min(numWords)
EDIT : As many have pointed out in the comments, this solution is far from robust. The point of this snippet is to simply serve as a pointer to the OP.
You need a way to split the paragraph into sentences and to count words in a sentence. You could use nltk package for both:
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count
EDIT: As has been mentioned in the comments below, programmatically determining what constitutes the sentences in a paragraph is quite a complex task. However, given the example you provided, I have elucidated a nice start to perhaps solving your problem below.
First, we want to tokenize the paragraph into sentences. We do this by splitting the text on every occurrence of a . (period). This returns a list of strings, each of which is a sentence.
We then want to break each sentence into its corresponding list of words. Then, using this list of lists, we want the sentence (represented as a list of words) whose length is a maximum and the sentence whose length is a minimum. Consider the following code:
par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
# split paragraph into sentences
sentences = par.split(". ")
# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]
# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)
# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)
print longest_sen_len
print shortest_sen_len
Related
I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.
You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |
If we have the sentence = "George coudn't play football in y. 1998 but plays football at θ. 226",
which by letter I mean any letter from Greek or English vocabulary. Is there any way to have as an output = "George coudn't play football in but plays football in"
I tried this one, which removed only the numbers
re_numb = re.compile(r'\d+')
sent = re_numb.sub('', sent)
Just use a Unicode range as in
\s+[\u03b1-\u03c9]+\.\s+\d+
See a demo on regex101.com and a Unicode table for greek letters.
In Python this could be
import re
pattern = re.compile(r'\s+[\u03b1-\u03c9]+\.\s+\d+')
sentence = "George coudn't play football in γ. 1998 but plays football at θ. 226"
sentence = pattern.sub('', sentence)
print(sentence)
And yields
George coudn't play football in but plays football at
In python, \w matches also Greek letters. So you can use:
\b\w\. +\d+\b
where \b is a word boundary.
If you don't want \w to match also underscores:
\b[^_\W]\. +\d+\b
See demo
The following regex can capture the sentence before .y 1995 and θ. 226
(\D)+(?=\s.\.\s\d+)
Demo
If you want to capture only up to first match add ^ to only match from the start of the string
^(\D)+(?=\s.\.\s\d+)
Demo
EDIT
Code sample
To extract each match
import re
text = "George couldn't play football in y. 1998 but plays football at θ. 226"
for match in re.finditer(r'(\D)+(?=\s.\.\s\d+)', text):
print(match.group(), end='') # print without new line
Output
George couldn't play football in but plays football at
To extract only the first match
import re
text = "George couldn't play football in y. 1998 but plays football at θ. 226"
for match in re.finditer(r'^(\D)+(?=\s.\.\s\d+)', text):
print(match.group(), end='')
Output
George couldn't play football in
Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool
I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)
Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.
The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.