which tokenizer is better to be used with nltk

which tokenizer is better to be used with nltk - python

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine learning.
So does that mean it is better than the default one? Or what is the standard of comparison among various tokenizers?

Looking at the source code for sent_tokenize() reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.
The punkt tokenizer is based on the work published in the following paper:
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ
It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.
Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:
How to tweak the NLTK sentence tokenizer
The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.

Sentences and words are often manually tokenized. There exist various corpora that deal with POS tagging for words according to the sentence contexts. PunktSentenceTokenizer is employed when your data(sentences and words) needs to be trained to achieve a uniform understanding of how the words should be tagged contextually. It could be possible that the data scientist manually annotates words tags for a whole bunch of sentences and then tells the machine to learn them(supervised learning). However, PunktSentenceTokenizer employs ML algorithms to learn these tags on its own (unsupervised).You just choose which data it trains upon.
Depending on the data you are the working with, the results of sent_tokenizeand consequently word_tokenize may not be that different from PunktSentenceTokenizer. Choosing between tokenizers is left upto the data scientist but the standard is always compared against manually annotated tags(because they are the most correct tags).

Related

NLP: check if a detected sentence is a complete sentence

In my NLP project I build my own model to identify sentences in a PDF document. Now I would like to check if my extracted sentences are complete sentences. During my research I have already come across this question, with the solutions presented there allowing quite a few false positives. Does anyone perhaps have a tip on how I can check whether a sentence is a complete sentence?

This is a non-trivial problem, so no approach will work in each and every case. You should also consider that whatever parser you use might merge or split sentences which in the original document were complete sentences, but after they are parsed are not any more.
Generally an alternative to the purely rule-based approaches: you could use a model which was pretrained on the CoLA (Corpus of Linguistic Acceptability) task. These models try to classify sentences/documents into the classes "linguistically acceptable" and "lingustically inacceptable".
On huggingface's model hub there are several pretrained transformer models for this, see for example this inference API for one which is a fine-tuned version of Facebook's RoBERTa model:
Correct Sentence
Incorrect Sentence
You should have a look at how the model was trained when it comes to bullet points/self-standing half sentences etc. though, as some scores might be surprising at first glance.
You might want to combine the models results with a rule-based approach, say for example: "The sentence is acceptable if the score is 0.95 or higher AND the sentence has at least 4 words AND ends with a . ? or !.". You can see what sentences your model + rule-based approach combinations spits out and keep modifying the rules until the results are to your satisfaction.

Gensim Word2Vec Training Data

I am currently trying to train my own word2vec model with my own training data and I am utterly confused about the training data preprocessing.
I ran a short script over my text which lemmatizes and also lower-cases the words in the text such that in the end my training data from a sentence (in German) like:
"Er hat heute zwei Birnen gegessen."
the following comes out:
[er, haben, heute, zwei, birne, essen]
translated in English:
He ate two pears today.
results in:
[he, eat, two, pear, today]
Now the problem is: I haven't seen anyone do this to their training data. The words are kept in uppercase and also not lemmatized and I absolutely don't get how this works. Especially for German there are so many inflections of verbs. Should I just leave them that way? I don't understand how it works not doing the lemmatization since gensim doesn't even know which language it is trained on right?
So in short: Should I do lemmatization and/or lowercasing or just leave every word as it is?
Thanks a lot!

The answer depends on what you are going to use the embeddings for, however, they are usually trained on word forms. Word embeddings are typically pre-trained on very large datasets and cover vocabularies up to 500k words, which typically covers most words in most forms even in language with much richer morphology than German.
You might also want to use FastText (bindings for Gensim exist) instead of Word2Vec. FastText considers character n-gram statistics for the embeddings, so it better generalize for regularities in morphology.
But in general, the data preprocessing always depends on how do you plan to use the embeddings. If you want to do a quantitative diachronic study on how the meaning of the words shifted over the 20th century, then embedding lemmas is a good idea. If you work with low resource language that has a good lemmatizer, it also might be a good idea. If you plan to the embeddings as an input for a downstream NLP model, then you should probably use forms and/or use already pre-trained embeddings.

How to detect sentence type using python

I want to detect sentence type preferably with python.
For example, given a sentence, the program can detect whether a sentence is a question, or an assertion/statement, or a command etc.
This is different from sentiment/happiness analysis. Is there any tool/new research paper that works reasonably well to do that?
Note: I do not have labeled data to train libraries, what I want is a already built model.
Thank

If you have the data, you can train a classifier algorithm to do that for you. TextBlob is a very simple text processing module for python. It has simple methods like train() that lets you use classification algorithms like Naive Bayes, Decision Trees etc. TextBlob also supports sentiment analysis out of the box.

Although TextBlob doesn't provide exactly this functionality, my intuition will be using TextBlob to parse the corpus (if you have one, or using the Stanford NLP Corpus) to create your own Sentence Type Detector / classifier.
The basic process maybe:
load corpus
tokenise the corpus
remove stop words
tagging part-of-speech
extract all the tone words
calculate the entire sentence type probability based on all the tone words' sentence type probability

You can use NLTK
First it is necessary to isolate the sentences:
my_text = "A first affirmation is that Python is useful. But can Python be useful to me?"
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(my_text)
print(sentences)
Then to check the actual type you can probably need to train a model:
NLTK or spacy based approach it is easier NLTK. Detecting whether a sentence is Interrogative or Not in Machine-Learning
BERT or derivative better accuracy BERT Text Classification Using Pytorch

How to calculate the similarity of English words that do not appear in WordNet?

A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?

You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit)
and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.

There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec

You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one:
https://radimrehurek.com/gensim/models/word2vec.html
Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here:
https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit

There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model.

Training Data Set in NLTK Python

I am working on Python NLTK tagging, and my input text is non hindi.
In order to tokenize my input text it must first be trained.
My question is how to train the data?
I am having this line of code as suggested to me here on stackoverflow.
train_data = indian.tagged_sents('hindi.pos')
*how about non-hindi data input.

The short answer is: Training a tagger requires a tagged corpus.
Assigning part of speech tags must be done according to some existing model.
Unfortunately, unlike some problems like finding sentence boundaries, there is no way to choose them out of thin air. There are some experimental approaches that try to assign parts of speech using parallel texts and machine-translation alignment algorithms, but all real POS taggers must be trained on text that has been tagged already.
Evidently you don't have a tagged corpus for your unnamed language, so you'll need to find or create one if you want to build a tagger. Creating a tagged corpus is a major undertaking, since you'll need a lot of training materials to get any sort of decent performance. There may be ways to "bootstrap" a tagged corpus (put together a poor-quality tagger that will make it easier to retag the results by hand), but all that depends on your situation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.