training data format for NLTK punkt - python

I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.
My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.
What is the correct training data format for NLTK Punkt sentence tokenizer?

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).
To train a new model, simply use:
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()
To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :
def train_punktsent(trainfile, modelfile):
""" Trains an unsupervised NLTK punkt sentence tokenizer. """
punkt = PunktTrainer()
try:
with codecs.open(trainfile, 'r','utf8') as fin:
punkt.train(fin.read(), finalize=False, verbose=False)
except KeyboardInterrupt:
print 'KeyboardInterrupt: Stopping the reading of the dump early!'
##HACK: Adds abbreviations from rb_tokenizer.
abbrv_sent = " ".join([i.strip() for i in \
codecs.open('abbrev.lex','r','utf8').readlines()])
abbrv_sent = "Start"+abbrv_sent+"End."
punkt.train(abbrv_sent,finalize=False, verbose=False)
# Finalize and outputs trained model.
punkt.finalize_training(verbose=True)
model = PunktSentenceTokenizer(punkt.get_params())
with open(modelfile, mode='wb') as fout:
pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
return model
However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.
There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt
Edited
Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.

Related

Add preprocessing step to Huggingface tokenizer

I am training my huggingface tokenizer on my own corpora, and I want to save it with a preprocessing step. That is, if I pass some text to it, I want it to apply the preprocessing and then tokenize the text, instead of explicitly preprocessing it before that. A good example is BERTweet: https://github.com/VinAIResearch/BERTweet and their tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True) (here normalization=True indicates that the input will be preprocessed according to some function). I want the same to apply when I train a tokenizer with a custom preprocessing function. My code is:
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
def preprocess(text):
return text
paths = [str(x) for x in Path('data').glob('*.txt')]
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2,
special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])
tokenizer.save_model('CustomBertTokenizer')
Now, when I load the tokenizer:
from transformers import RobertaTokenizerFast
sentence = 'Hey'
tokenizer = RobertaTokenizerFast.from_pretrained('CustomBertTokenizer')
tokenizer(sentence)
I want sentence to be preprocessed with the preprocess function, and then tokenized. So I want to pass like an argument preprocessing=True, or something like that. How can I do it?

How can I extract and store the text generated from an automatic speech recognition deep learning app

The app can be viewed in huggingface https://huggingface.co/spaces/rowel/asr
import gradio as gr
from transformers import pipeline
model = pipeline(task="automatic-speech-recognition",
model="facebook/s2t-medium-librispeech-asr")
gr.Interface.from_pipeline(model,
title="Automatic Speech Recognition (ASR)",
description="Using pipeline with Facebook S2T for ASR.",
examples=['data/ljspeech.wav',]
).launch()
I don't know where the text files are stored with that very few lines of code. I would like to store the sentence text in a string.
Honestly I only know basic python programming. I would just like to store them into string variables and do something with them.
You can open up the Interface.from_pipeline abstraction, and define your own Gradio interface. You need to define your own inputs, outputs, and prediction function, thus accessing the text prediction from the model. Here is an example.
You can test is here https://huggingface.co/spaces/radames/Speech-Recognition-Example
import gradio as gr
from transformers import pipeline
model = pipeline(task="automatic-speech-recognition",
model="facebook/s2t-medium-librispeech-asr")
def predict_speech_to_text(audio):
prediction = model(audio)
# text variable contains your voice-to-text string
text = prediction['text']
return text
gr.Interface(fn=predict_speech_to_text,
title="Automatic Speech Recognition (ASR)",
inputs=gr.inputs.Audio(
source="microphone", type="filepath", label="Input"),
outputs=gr.outputs.Textbox(label="Output"),
description="Using pipeline with F acebook S2T for ASR.",
examples=['ljspeech.wav'],
allow_flagging='never'
).launch()

loading a FastText model in MATLAB

I have trained a FastText model in Python and saved the files into a folder. These are the contents of the folder:
fasttext.model
fasttext.model.trainables.syn1neg.npy
fasttext.model.trainables.vectors_ngrams_lockf.npy
fasttext.model.trainables.vectors_vocab_lockf.npy
fasttext.model.wv.vectors.npy
fasttext.model.wv.vectors_ngrams.npy
fasttext.model.wv.vectors_vocab.npy
How can I load the model in MATLAB and extract the word embeddings of certain words?
This is what we do in Python:
from gensim.models.fasttext import FastText
model = FastText.load(fasttext.model)
vector = model.wv[word]
Is there a similar thing in MATLAB? How can I get the word embeddings generated by a FastText model in Python in MATLAB and work with them?
Use the trainWordEmbedding and readWordEmbedding function
Train and test your word embedding: "emb"
Word embedding doesn't need bag of words. It just needs tokenized document ("cleanDoc").
emb = trainWordEmbedding(cleanDoc, "Dimension",100)
writeWordEmbedding(emb,"medEmb.vec");
List down the vocabulary in the embedding:
emb.Vocabulary

SpaCy: how to load Google news word2vec vectors?

I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):
en_nlp = spacy.load('en',vector=False)
en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin')
The above gives:
MemoryError: Error assigning 18446744072820359357 bytes
I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format:
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('googlenews2.txt')
This file then contains the words and their word vectors on each line.
I tried to load them with:
en_nlp.vocab.load_vectors('googlenews2.txt')
but it returns "0".
What is the correct way to do this?
Update:
I can load my own created file into spacy.
I use a test.txt file with "string 0.0 0.0 ...." on each line. Then zip this txt with .bzip2 to test.txt.bz2.
Then I create a spacy compatible binary file:
spacy.vocab.write_binary_vectors('test.txt.bz2', 'test.bin')
That I can load into spacy:
nlp.vocab.load_vectors_from_bin_loc('test.bin')
This works!
However, when I do the same process for the googlenews2.txt, I get the following error:
lib/python3.6/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1279)()
OSError:
For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')
Remove the first line of the .txt:
tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
Compress the txt as .bz2:
bzip2 googlenews.txt
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.
Then load the wordvectors:
import spacy
nlp = spacy.load('en',vectors='en_google')
or load them after later:
nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')
I know that this question has already been answered, but I am going to offer a simpler solution. This solution will load google news vectors into a blank spacy nlp object.
import gensim
import spacy
# Path to google news vectors
google_news_path = "path\to\google\news\\GoogleNews-vectors-negative300.bin.gz"
# Load google news vecs in gensim
model = gensim.models.KeyedVectors.load_word2vec_format(gn_path, binary=True)
# Init blank english spacy nlp object
nlp = spacy.blank('en')
# Loop through range of all indexes, get words associated with each index.
# The words in the keys list will correspond to the order of the google embed matrix
keys = []
for idx in range(3000000):
keys.append(model.index2word[idx])
# Set the vectors for our nlp object to the google news vectors
nlp.vocab.vectors = spacy.vocab.Vectors(data=model.syn0, keys=keys)
>>> nlp.vocab.vectors.shape
(3000000, 300)
I am using spaCy v2.0.10.
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
I want to highlight that the specific code in the accepted answer is not working now. I encountered "AttributeError: ..." when I run the code.
This has changed in spaCy v2. write_binary_vectors was removed in v2. From spaCy documentations, the current way to do this is as follows:
$ python -m spacy init-model en /path/to/output -v /path/to/vectors.bin.tar.gz
it is much easier to use the gensim api for dowloading the word2vec compressed model by google, it will be stored in /home/"your_username"/gensim-data/word2vec-google-news-300/ . Load the vectors and play ball. I have 16GB of RAM which is more than enough to handle the model
import gensim.downloader as api
model = api.load("word2vec-google-news-300") # download the model and return as object ready for use
word_vectors = model.wv #load the vectors from the model

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance for your help. I'm using Python with NLTK Naive Bayes Classifier.
classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
# Create the P(label) distribution
label_probdist = estimator(label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
return NaiveBayesClassifier(label_probdist, feature_probdist)
To save:
import pickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
To load later:
import pickle
f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
I went thru the same problem, and you cannot save the object since is a ELEFreqDistr NLTK class. Anyhow NLTK is hell slow. Training took 45 mins on a decent set and I decided to implement my own version of the algorithm (run it with pypy or rename it .pyx and install cython). It takes about 3 minutes with the same set and it can simply save data as json (I'll implement pickle which is faster/better).
I started a simple github project, check out the code here
To Retrain the Pickled Classifer :
f = open('originalnaivebayes5k.pickle','rb')
classifier = pickle.load(f)
classifier.train(training_set)
print('Accuracy:',nltk.classify.accuracy(classifier,testing_set)*100)
f.close()

Categories

Resources