Training Data Set in NLTK Python

Training Data Set in NLTK Python - python

I am working on Python NLTK tagging, and my input text is non hindi.
In order to tokenize my input text it must first be trained.
My question is how to train the data?
I am having this line of code as suggested to me here on stackoverflow.
train_data = indian.tagged_sents('hindi.pos')
*how about non-hindi data input.

The short answer is: Training a tagger requires a tagged corpus.
Assigning part of speech tags must be done according to some existing model.
Unfortunately, unlike some problems like finding sentence boundaries, there is no way to choose them out of thin air. There are some experimental approaches that try to assign parts of speech using parallel texts and machine-translation alignment algorithms, but all real POS taggers must be trained on text that has been tagged already.
Evidently you don't have a tagged corpus for your unnamed language, so you'll need to find or create one if you want to build a tagger. Creating a tagged corpus is a major undertaking, since you'll need a lot of training materials to get any sort of decent performance. There may be ways to "bootstrap" a tagged corpus (put together a poor-quality tagger that will make it easier to retag the results by hand), but all that depends on your situation.

Related

NLP: check if a detected sentence is a complete sentence

In my NLP project I build my own model to identify sentences in a PDF document. Now I would like to check if my extracted sentences are complete sentences. During my research I have already come across this question, with the solutions presented there allowing quite a few false positives. Does anyone perhaps have a tip on how I can check whether a sentence is a complete sentence?

This is a non-trivial problem, so no approach will work in each and every case. You should also consider that whatever parser you use might merge or split sentences which in the original document were complete sentences, but after they are parsed are not any more.
Generally an alternative to the purely rule-based approaches: you could use a model which was pretrained on the CoLA (Corpus of Linguistic Acceptability) task. These models try to classify sentences/documents into the classes "linguistically acceptable" and "lingustically inacceptable".
On huggingface's model hub there are several pretrained transformer models for this, see for example this inference API for one which is a fine-tuned version of Facebook's RoBERTa model:
Correct Sentence
Incorrect Sentence
You should have a look at how the model was trained when it comes to bullet points/self-standing half sentences etc. though, as some scores might be surprising at first glance.
You might want to combine the models results with a rule-based approach, say for example: "The sentence is acceptable if the score is 0.95 or higher AND the sentence has at least 4 words AND ends with a . ? or !.". You can see what sentences your model + rule-based approach combinations spits out and keep modifying the rules until the results are to your satisfaction.

Train Spacy on unlabeled text corpus to extract "important phrases"

I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.
Does anyone know if what I want to do is possible?

If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.
For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.
A Brief Look at TF-IDF
TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
It is commonly used as a statistical measure to determine how important terms are in a corpus.
For a longer, but readable explanation of it, check out the wiki: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
Code Implementation
Check out Scikit-Learn's TfidfVectorizer.
This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.
To use your own tokenization, do the following:
dummy = lambda x: x
vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
This overrides the default tokenization and lets you use your own list of tokens.
From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.
Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.

if you want exact phrases to be recognised, you can compile a list of those phrases and use spaCy's PhraseMatcher component to train and recognise it later.
https://spacy.io/usage/rule-based-matching#phrasematcher
The only thing is it will only recognise the exact phrases supplied to it. This is in contrary to how NER works, it can recognise additional phrases based on training provided , but PhraseMatcher will only recognise the ones you provide it.

How to train NLTK PunktSentenceTokenizer batchwise?

I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.
I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.
Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()
Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)
Now my questions are:
How can I train the algorithm batchwise and would that lead to a lower memory consumption?
Can I use the standard English pickle file and do further training with that already trained object?
I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.

I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second #colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.
There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
Suppose we have a generator that yields a stream of training texts
texts = text_stream()
In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.
We can instantiate a PunktTrainer and then begin training
trainer = PunktTrainer()
for text in texts:
trainer.train(text)
trainer.freq_threshold()
Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.
Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.
trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())
#colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way
params = trainer.get_params()
abbreviations = params.abbrev_types

As described in the source code:
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. It must be
trained on a large collection of plaintext in the target language
before it can be used.
It is not very clear what a large collection really means. In the paper, there are no information given about learning curves (when it is sufficiant to stop learning process, because enough data was seen). Wall Street Journal corpus is mentioned there (it has approximately 30 million words). So it is very unclear if you can simply trim your training corpus and have less memory footprints.
There is also an open issue on your topic saying something about 200 GB RAM and more. As you can see there, NLTK has probably not a good implementation of the algorithm presented by Kiss & Strunk (2006).
I cannot see how to batch it, as you can see in the function signature of train()-method (NLTK version 3.3):
def train(self, train_text, verbose=False):
"""
Derives parameters from a given training text, or uses the parameters
given. Repeated calls to this method destroy previous parameters. For
incremental training, instantiate a separate PunktTrainer instance.
"""
But there are probably more issues, e.g. if you compare the signature of given version 3.3 with the git tagged version 3.3, there is a new parameter finalize which might be helpful and indicates a possible batch-process or a possible merge with an already trained model:
def train(self, text, verbose=False, finalize=True):
"""
Collects training data from a given text. If finalize is True, it
will determine all the parameters for sentence boundary detection. If
not, this will be delayed until get_params() or finalize_training() is
called. If verbose is True, abbreviations found will be listed.
"""
Anyway, I would strongly recommend not using NLTK's Punkt Sentence Tokenizer if you want to do sentence tokenization beyond playground level. Nevertheless, if you want to stick to that tokenizer, I would simply recommend using also the given models and not train new models unless you have a server with huge RAM memory.

Doc2Vec: Differentiate Sentence and Document

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.
Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.
Since a question can sometimes span multiple sentences,
I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?
And this notebook : Doc2Vec-Notebook
seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?

Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.
The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.
The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)
The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.
Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.
You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)
Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).
The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)
(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)

which tokenizer is better to be used with nltk

I have started learning nltk and following this tutorial. First we use the built-in tokenizer by using sent_tokenize and later we use PunktSentenceTokenizer. The tutorial mentions that PunktSentenceTokenizer is capable of unsupervised machine learning.
So does that mean it is better than the default one? Or what is the standard of comparison among various tokenizers?

Looking at the source code for sent_tokenize() reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.
The punkt tokenizer is based on the work published in the following paper:
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ
It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.
Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:
How to tweak the NLTK sentence tokenizer
The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.

Sentences and words are often manually tokenized. There exist various corpora that deal with POS tagging for words according to the sentence contexts. PunktSentenceTokenizer is employed when your data(sentences and words) needs to be trained to achieve a uniform understanding of how the words should be tagged contextually. It could be possible that the data scientist manually annotates words tags for a whole bunch of sentences and then tells the machine to learn them(supervised learning). However, PunktSentenceTokenizer employs ML algorithms to learn these tags on its own (unsupervised).You just choose which data it trains upon.
Depending on the data you are the working with, the results of sent_tokenizeand consequently word_tokenize may not be that different from PunktSentenceTokenizer. Choosing between tokenizers is left upto the data scientist but the standard is always compared against manually annotated tags(because they are the most correct tags).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.