I have trained a FastText model in Python and saved the files into a folder. These are the contents of the folder:
fasttext.model
fasttext.model.trainables.syn1neg.npy
fasttext.model.trainables.vectors_ngrams_lockf.npy
fasttext.model.trainables.vectors_vocab_lockf.npy
fasttext.model.wv.vectors.npy
fasttext.model.wv.vectors_ngrams.npy
fasttext.model.wv.vectors_vocab.npy
How can I load the model in MATLAB and extract the word embeddings of certain words?
This is what we do in Python:
from gensim.models.fasttext import FastText
model = FastText.load(fasttext.model)
vector = model.wv[word]
Is there a similar thing in MATLAB? How can I get the word embeddings generated by a FastText model in Python in MATLAB and work with them?
Use the trainWordEmbedding and readWordEmbedding function
Train and test your word embedding: "emb"
Word embedding doesn't need bag of words. It just needs tokenized document ("cleanDoc").
emb = trainWordEmbedding(cleanDoc, "Dimension",100)
writeWordEmbedding(emb,"medEmb.vec");
List down the vocabulary in the embedding:
emb.Vocabulary
Related
Setup
I have Anaconda virtual environment on a Windows machine. Torch, transformers, tensorflow and CUDA installed. I previously used GPU acceleration from the transformers pipeline.
What I want to do ultimately
I want to use BERT to take word embeddings of the text in my dataset, and input that in LDA to do topic modeling. The pseudo-code I intend to run:
import pandas as pd
import tensorflow as tf
import numpy as np
from transformers import BertTokenizer, TFBertModel
# Load your dataset into a pandas dataframe
df = pd.read_csv("topic_modeling_input_dataset.csv")
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the reviews in the dataframe
df["tokenized_reviews"] = df["review"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
# Convert the tokenized reviews to tensors
input_ids = tf.constant(list(df["tokenized_reviews"]))
# Extract the word embeddings using the pre-trained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")
_, word_embeddings = bert_model(input_ids)
# Convert the word embeddings from tensors to numpy arrays
word_embeddings = word_embeddings.numpy()
# Average the word embeddings for each review to obtain sentence embeddings
sentence_embeddings = np.mean(word_embeddings, axis=1)
# Use the sentence embeddings as input to Latent Dirichlet Allocation (LDA) for topic modeling
from sklearn.decomposition import LatentDirichletAllocation
# Initialize the LDA model
lda_model = LatentDirichletAllocation(n_components=10)
# Fit the LDA model on the sentence embeddings
lda_model.fit(sentence_embeddings)
# Print the topics learned by the LDA model
for index, topic in enumerate(lda_model.components_):
print(f"Topic {index}:")
words = [tokenizer.convert_ids_to_tokens[i] for i in np.argsort(topic)[::-1][:10]]
print(words)
But can't get past through importing the libraries
Problem
The command
from transformers import BertTokenizer, TFBertModel gives the error:
RuntimeError: Failed to import transformers.models.bert.modeling_tf_bert because of the following error (look up to see its traceback):
Failed to import transformers.data.data_collator because of the following error (look up to see its traceback):
[WinError 182] The operating system cannot run %1. Error loading "C:\Users\myuser\Anaconda3\envs\text_mining\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Debugging Attempt
In the directory, I only have caffe2_detectron_ops_gpu.dll and no caffe2_detectron_ops.dll, which was the problem in all reported cases I read online.
I also tried reinstalling caffe2 in conda, but can't get a clean command or way to do it. caffe2 documentation mentions that the commands could have unresolved bugs.
The app can be viewed in huggingface https://huggingface.co/spaces/rowel/asr
import gradio as gr
from transformers import pipeline
model = pipeline(task="automatic-speech-recognition",
model="facebook/s2t-medium-librispeech-asr")
gr.Interface.from_pipeline(model,
title="Automatic Speech Recognition (ASR)",
description="Using pipeline with Facebook S2T for ASR.",
examples=['data/ljspeech.wav',]
).launch()
I don't know where the text files are stored with that very few lines of code. I would like to store the sentence text in a string.
Honestly I only know basic python programming. I would just like to store them into string variables and do something with them.
You can open up the Interface.from_pipeline abstraction, and define your own Gradio interface. You need to define your own inputs, outputs, and prediction function, thus accessing the text prediction from the model. Here is an example.
You can test is here https://huggingface.co/spaces/radames/Speech-Recognition-Example
import gradio as gr
from transformers import pipeline
model = pipeline(task="automatic-speech-recognition",
model="facebook/s2t-medium-librispeech-asr")
def predict_speech_to_text(audio):
prediction = model(audio)
# text variable contains your voice-to-text string
text = prediction['text']
return text
gr.Interface(fn=predict_speech_to_text,
title="Automatic Speech Recognition (ASR)",
inputs=gr.inputs.Audio(
source="microphone", type="filepath", label="Input"),
outputs=gr.outputs.Textbox(label="Output"),
description="Using pipeline with F acebook S2T for ASR.",
examples=['ljspeech.wav'],
allow_flagging='never'
).launch()
I am using transformers.BertForMaskedLM to further pre-train the BERT model on my custom dataset. I first serialize all the text to a .txt file by separating the words by a whitespace. Then, I am using transformers.TextDataset to load the serialized data with a BERT tokenizer given as tokenizer argument. Then, I am using BertForMaskedLM.from_pretrained() to load the pre-trained model (which is what transformers library presents). Then, I am using transformers.Trainer to further pre-train the model on my custom dataset, i.e., domain adaptation, for 3 epochs. I save the model with trainer.save_model(). Then, I want to load the further pre-trained model to get the embeddings of the words in my custom dataset. To load the model, I am using AutoModel.from_pretrained() but this pops up a warning.
Some weights of the model checkpoint at {path to my further pre-trained model} were not used when initializing BertModel
So, I know why this pops up. Because I further pre-trained using transformers.BertForMaskedLM but when I load with transformers.AutoModel, it loads it as transformers.BertModel. What I do not understand is if this is a problem or not. I just want to get the embeddings, e.g., embedding vector with a size of 768.
You saved a BERT model with LM head attached. Now you are going to load the serialized file into a standalone BERT structure without any extra element and the warning is issued. This is pretty normal and there is no Fatal error to do so! You can check the list of unloaded params like below:
from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertLMHeadModel, BertConfig
import torch
lmbert = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
lmbert.save_pretrained('you_desired_path/BertLMHeadModel')
lmbert_params = []
for name, param in lmbert.named_parameters():
lmbert_params.append(name)
bert = BertModel.from_pretrained('you_desired_path/BertLMHeadModel')
bert_params = []
for name, param in bert.named_parameters():
bert_params.append(name)
params_ralated_to_lm_head = [param_name for param_name in lmbert_params if param_name.replace('bert.', '') not in bert_params]
params_ralated_to_lm_head
output:
['cls.predictions.bias',
'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias']
I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):
en_nlp = spacy.load('en',vector=False)
en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin')
The above gives:
MemoryError: Error assigning 18446744072820359357 bytes
I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format:
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('googlenews2.txt')
This file then contains the words and their word vectors on each line.
I tried to load them with:
en_nlp.vocab.load_vectors('googlenews2.txt')
but it returns "0".
What is the correct way to do this?
Update:
I can load my own created file into spacy.
I use a test.txt file with "string 0.0 0.0 ...." on each line. Then zip this txt with .bzip2 to test.txt.bz2.
Then I create a spacy compatible binary file:
spacy.vocab.write_binary_vectors('test.txt.bz2', 'test.bin')
That I can load into spacy:
nlp.vocab.load_vectors_from_bin_loc('test.bin')
This works!
However, when I do the same process for the googlenews2.txt, I get the following error:
lib/python3.6/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1279)()
OSError:
For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')
Remove the first line of the .txt:
tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
Compress the txt as .bz2:
bzip2 googlenews.txt
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.
Then load the wordvectors:
import spacy
nlp = spacy.load('en',vectors='en_google')
or load them after later:
nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')
I know that this question has already been answered, but I am going to offer a simpler solution. This solution will load google news vectors into a blank spacy nlp object.
import gensim
import spacy
# Path to google news vectors
google_news_path = "path\to\google\news\\GoogleNews-vectors-negative300.bin.gz"
# Load google news vecs in gensim
model = gensim.models.KeyedVectors.load_word2vec_format(gn_path, binary=True)
# Init blank english spacy nlp object
nlp = spacy.blank('en')
# Loop through range of all indexes, get words associated with each index.
# The words in the keys list will correspond to the order of the google embed matrix
keys = []
for idx in range(3000000):
keys.append(model.index2word[idx])
# Set the vectors for our nlp object to the google news vectors
nlp.vocab.vectors = spacy.vocab.Vectors(data=model.syn0, keys=keys)
>>> nlp.vocab.vectors.shape
(3000000, 300)
I am using spaCy v2.0.10.
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
I want to highlight that the specific code in the accepted answer is not working now. I encountered "AttributeError: ..." when I run the code.
This has changed in spaCy v2. write_binary_vectors was removed in v2. From spaCy documentations, the current way to do this is as follows:
$ python -m spacy init-model en /path/to/output -v /path/to/vectors.bin.tar.gz
it is much easier to use the gensim api for dowloading the word2vec compressed model by google, it will be stored in /home/"your_username"/gensim-data/word2vec-google-news-300/ . Load the vectors and play ball. I have 16GB of RAM which is more than enough to handle the model
import gensim.downloader as api
model = api.load("word2vec-google-news-300") # download the model and return as object ready for use
word_vectors = model.wv #load the vectors from the model
I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.
My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.
What is the correct training data format for NLTK Punkt sentence tokenizer?
Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).
To train a new model, simply use:
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()
To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :
def train_punktsent(trainfile, modelfile):
""" Trains an unsupervised NLTK punkt sentence tokenizer. """
punkt = PunktTrainer()
try:
with codecs.open(trainfile, 'r','utf8') as fin:
punkt.train(fin.read(), finalize=False, verbose=False)
except KeyboardInterrupt:
print 'KeyboardInterrupt: Stopping the reading of the dump early!'
##HACK: Adds abbreviations from rb_tokenizer.
abbrv_sent = " ".join([i.strip() for i in \
codecs.open('abbrev.lex','r','utf8').readlines()])
abbrv_sent = "Start"+abbrv_sent+"End."
punkt.train(abbrv_sent,finalize=False, verbose=False)
# Finalize and outputs trained model.
punkt.finalize_training(verbose=True)
model = PunktSentenceTokenizer(punkt.get_params())
with open(modelfile, mode='wb') as fout:
pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
return model
However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.
There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt
Edited
Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.