Loading pre trained fasttext model - python

I have a question about fasttext (https://fasttext.cc/). I want to download a pre-trained model and use it to retrieve the word vectors from text.
After downloading the pre-trained model (https://fasttext.cc/docs/en/english-vectors.html) I unzipped it and got a .vec file. How do I import this into fasttext?
I've tried to use the mentioned function as follows:
import fasttext
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
vectors = load_vectors('/Users/username/Downloads/wiki-news-300d-1M.vec')
model = fasttext.load_model(vectors)
However, I can't completely run this code because python crashes. How can I successfully load these pre-trained word vectors?
Thank you for your help.

FastText's advantage over word2vec or glove for example is that they use subword information to return vectors for OOV (out-of-vocabulary) words.
So they offer two types of pretrained models : .vec and .bin.
.vec is a dictionary Dict[word, vector], the word vectors are pre-computed for the words in the training vocabulary.
.bin is a binary fasttext model that can be loaded using fasttext.load_model('file.bin') and that can provide word vector for unseen words (OOV), be trained more, etc.
In your case you are loading a .vec file, so vectors is the "final form" of the data.
fasttext.load_model expects a .bin file.
If you need more than a python dictionary you can use gensim.models.keyedvector (which handles any word vectors, such as word2vec, glove, etc...).

I use the following code to load the .vec file in Python 3, where PATH_TO_FASTTEXT is the path to the .vec file.
Most notably, the map needs to be explicitly cast to a list.
def LoadFastText():
input_file = io.open(PATH_TO_FASTTEXT, 'r', encoding='utf-8', newline='\n', errors='ignore')
no_of_words, vector_size = map(int, input_file.readline().split())
word_to_vector: Dict[str, List[float]] = dict()
for i, line in enumerate(input_file):
tokens = line.rstrip().split(' ')
word = tokens[0]
vector = list(map(float, tokens[1:]))
assert len(vector) == vector_size
word_to_vector[word] = vector
return word_to_vector

Related

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.
You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

How to create new entity and use it to find the entity in my test data? How to make my tokenize works?

I would like to make a new entity: let's call it "medicine" and then train it using my corpora. From there, identify all the entities of "medicine". Somehow my code is not working, could anyone help me?
import nltk
test= input("Please enter your file name")
test1= input("Please enter your second file name")
with open(test, "r") as file:
new = file.read().splitlines()
with open(test1, "r") as file2:
new1= file2.read().splitlines()
for s in new:
for x in new1:
sample = s.replace('value', x)
sample1 = ''.join(str(v) for v in sample)
print(sample1)
sentences = nltk.sent_tokenize(sample1)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
print(sentences)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
How to create new entity and use it to find the entity in my test data?
Named entity recognizers are probabilistic, neural or linear models. In your code,
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
does this prediction. So if you want it to recognize new entity types you should first train a classifier on annotated data, containing the new entity type.
Somehow my code is not working,
As I said before, you did not train the model of NLTK with your own data, so it is not working.
How to make my tokenize works?
Tokenizer only extracts word tokens, which is done in your code by this line
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
But, tokenizer does not predict named entity directly.
If you want to train a model to predict custom named entity like medicine using NLTK, then try this tutorial.
From my personal experience NLTK may not be suitable for this, look at Spacy.

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!
I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

Gensim Word2Vec 'you must first build vocabulary before training the model'

I am trying to fit a Word2Vec model. According to the documentation for Gensim's Word2Vec we do not need to call model.build_vocabulary before using it.
But yet it is asking for me to do it. I have tried calling this function and it has not worked. I also fitted a Word2Vec model before without needing to call model.build_vocabulary .
Am I doing something wrong? Here is my code:
from gensim.models import Word2Vec
dataset = pd.read_table('genemap_copy.txt',delimiter='\t', lineterminator='\n')
def row_to_sentences(dataframe):
columns = dataframe.columns.values
corpus = []
for index,row in dataframe.iterrows():
if index == 1000:
break
sentence = ''
for column in columns:
sentence += ' '+str(row[column])
corpus.append([sentence])
return corpus
corpus = row_to_sentences(dataset)
clean_corpus = [[sentence[0].lower()] for sentence in corpus ]
# model = Word2Vec()
# model.build_vocab(clean_corpus)
model = Word2Vec(clean_corpus, size=100, window=5, min_count=5, workers=4)
Help is greatly appreciated!
Also I am using macOS Sierra.
There is not much support online for using Gensim with Mac D: .
I think my problem was having the parameter min_count=5 so it was not considering most of my words if they did not appear more than 5 times.
Try with LineSentence:
from gensim.models.word2vec import LineSentence
and then train your corpus with
model = Word2Vec(LineSentence(clean_corpus), size=100, window=5, min_count=5, workers=4)
Is it that you are appending a new list containing a single sentence each time? corpus.append([sentence]). You need to feed Word2Vec a series of sentences, but not necessarily sentences gathered by document. I'm also not clear on what is in your df but have you tokenised the sentences already?
My generator class I've used before for Word2Vec...
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess
class MySentences(object):
def __init__(self, docs):
self.corpus = docs
def __iter__(self):
for doc in self.corpus:
doc_sentences = sent_tokenize(doc)
for sent in doc_sentences:
yield simple_preprocess(sent) # yields a tokenized
sentence ['like','this','one','.']
sentences = MySentences(df['text'].tolist())
model = gensim.models.Word2Vec(sentences, min_count=5, workers=8, size=300, sg=1)

Spam filter using Python

I`m trying to make a simple spam filter using python 2.7 and scikit-learn. So, I have a set of letters for train and a set of letters for test. Firstly, I want to vectorize training set and fit logistic regression using it, then vectorize each letter in test set and put them into classifier separately.
import codecs
import json
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import linear_model
def classify(mail, vectorizer, logreg):
vect_mail = vectorizer.transform(mail)
res = logreg.predict(vect_mail)
return res
def make_output(test_dir, vectorizer, logreg):
with codecs.open('test.txt', 'w', 'utf-8') as out:
for f in os.listdir(test_dir):
mail = json.load(open(os.path.join(test_dir, f)), 'utf-8')
result = classify(mail['body'].encode('ascii','ignore'), vectorizer, logreg)
out.write(u'%s\t%s\n' % (f, result))
def read_train(train_dir):
for f in os.listdir(train_dir):
with open(os.path.join(train_dir, f), 'r') as fo:
mail = json.load(fo, 'utf-8')
yield mail
if __name__ == '__main__':
train_mails = list(read_train('spam_data/train'))
corpus = list()
is_spam = list()
for mail in train_mails:
corpus.append(mail['body'].encode('ascii','ignore'))
is_spam.append(mail['is_spam'])
vectorizer = CountVectorizer()
cnt_vect = vectorizer.fit_transform(corpus)
logreg = linear_model.LogisticRegression()
logreg.fit(cnt_vect, is_spam)
make_output('spam_data/test', vectorizer, logreg)
But res = logreg.predict(vect_mail) returns a list, not one meaning. So, I guess, predictor interprets vect_mail like sample of documents of one word, not like a document with many words. How should I rewrite this code?
According to the sklearn's documentation, CountVectorizer.transform accepts not a single document to transform, but an iterable of documents. Since a string in Python is an iterable of its characters, transform generates as many "documents" as there are characters in the string.
In order to fix this issue, pass a single-element list to the transform:
vect_mail = vectorizer.transform([mail])

Categories

Resources