I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?
Here is some example code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)
tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']
You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
class BulletPointLangVars(PunktLanguageVars):
sent_end_chars = ('.', '?', '!', '•')
tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")
This will result in the following output:
['•', 'I am a sentence •', 'I am another sentence']
However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:
I introduce a list of sentences.
I am sentence one
I am sentence two
And I am one, too!
Would, depending on the details of your text, result in something like the following:
>>> tokenizer.tokenize("""
Look at these sentences:
• I am sentence one
• I am sentence two
But I am one, too!
""")
['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']
One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.
There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for.
How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.
Related
I am trying to identify every multi-word expression in a sentence and tokenize that sentence. For instance, the example input sentence is "In short, this merchandise is in short supply." and I wish the output could be shown as below:
['In short', ',', 'this', 'merchandise', 'is', 'in short supply', '.']
I have already achieved the aforesaid result by using a predefined list and the following python code.
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import MWETokenizer
multiwordExpressionList = [("In", "short"), ("in", "short", "supply" )] ## this is a predefined list
sentence = "In short, this merchandise is in short supply."
mwe = MWETokenizer(multiwordExpressionList, separator = ' ')
resultList = mwe.tokenize(word_tokenize(sentence))
print(resultList)
However, the drawback is quite clear. This program need a predefined multi-word expression list to identify whether any multi-word expression exist in a sentence. Is there any suggested python package, module or method can identify any multi-word expression exist in a sentence ?
You have 2 options: either use some existing models (e.g. Stanza which also supports MWE https://stanfordnlp.github.io/stanza/mwt.html) or if you have enough data build your own MWE model using for example Gensim: https://radimrehurek.com/gensim/models/phrases.html. I have used the second option fairly successfully.
I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]
That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)
First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)
I'm using sklearn to perform cosine similarity.
Is there a way to consider all the words starting with a capital letter as stop words?
The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.
s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat to store get food doNotMatch'
Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...
Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).