Merging two gensim Phrases models

Merging two gensim Phrases models - python

I am trying to build a Phrases model over a big corpus but I keep stumbling over a memory error.
First I tried to fit my entire corpus into a big generator.
Then, I tried to save the model between each document :
import codecs
import gensim
import os
import random
import string
import sys
def gencorp(file_path):
with codecs.open(file_path, 'rb',encoding="utf8") as doc :
for sentence in doc:
yield sentence.split()
out_corpus_dir = "C:/Users/Administrator/Desktop/word2vec/1billionwords_corpus_preprocessed/"
file_nb = 0
bi_detector = gensim.models.Phrases()
for file in os.listdir(out_corpus_dir):
file_nb += 1
file_path = out_corpus_dir+file
bi_detector.add_vocab(gencorp(file_path))
bi_detector.save("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector = gensim.models.Phrases.load("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector.save("generic_EN_bigrams")
But none of these solutions work. However, generic_EN_bigrams_v0 is generated and saved.
So I am wondering if I can train a Phrases model per document and then find a way to merge them after.
Thanks you for any insight :)

According to the gensim's documentation, adding sentences should simply work, and you shouldn't have a memory problem since it's updating the statistics only. Therefore, this minor modification to your code should make it work, i.e. you don't need to recreate the bi_detector object.
import codecs
import gensim
import os
import random
import string
import sys
def gencorp(file_path):
with codecs.open(file_path, 'rb',encoding="utf8") as doc :
for sentence in doc:
yield sentence.split()
out_corpus_dir = "C:/Users/Administrator/Desktop/word2vec/1billionwords_corpus_preprocessed/"
file_nb = 0
bi_detector = gensim.models.Phrases()
for file in os.listdir(out_corpus_dir):
file_nb += 1
file_path = out_corpus_dir+file
bi_detector.add_vocab(gencorp(file_path))
# The following two lines are not required.
# bi_detector.save("generic_EN_bigrams_v%i"%(file_nb/10))
# bi_detector = gensim.models.Phrases.load("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector.save("generic_EN_bigrams")

Related

Question about Ranking of Documents using BM25

I'm trying to rank task sentences of an occupation using bm25. I'm following this tutorial, but getting to this part I get confused "Ranking of documents
Now that we've created our document indexes, we can give it queries and see which documents are the most relevant" i want that the queries be every sentence that i have at my corpus column. How can i do that?
!pip install rank_bm25
import pandas as pd from rank_bm25 import BM25Okapi import string
corpus = pd.read_excel(r'/content/Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
tokenized_corpus = [] for doc in corpus['task']:
print(doc)
doc_tokens = doc.split()
tokenized_corpus.append(doc_tokens)
bm25 = BM25Okapi(tokenized_corpus)
here is my data

Basically you just need to iterate over your list of documents, for example like this:
import pandas as pd
from rank_bm25 import BM25Okapi
import string
def argsort(seq, reverse):
# http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
return sorted(range(len(seq)), key=seq.__getitem__, reverse =reverse)
corpus = pd.read_excel(r'Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
bm25 = BM25Okapi(tokenized_corpus)
for doc_as_query in tokenized_corpus:
scores = bm25.get_scores(doc_as_query)
# show top 3 (optional):
print('\nQuery:',' '.join(doc_as_query))
top_similar = argsort(scores, True)
for i in range(3):
print(' top', i+1,':', corpus['task'][top_similar[i]])
I added sorting and printing the top 3 similar documents, in case you want this (that's why I added the function argsort).

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?
This is my code:
sequence = nltk.tokenize.word_tokenize(raw)
bigram = ngrams(sequence,2)
freq_dist = nltk.FreqDist(bigram)
prob_dist = nltk.MLEProbDist(freq_dist)
number_of_bigrams = freq_dist.N()
However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.
There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:
import os
import requests
import io #codecs
from nltk import word_tokenize, sent_tokenize
# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
Then the language modelling:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)
To get the counts:
model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')
To get the probabilities:
model.score('is', 'language'.split()) # P('is'|'language')
model.score('never', 'language is'.split()) # P('never'|'language is')
There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

Gensim LDA Multicore Python script runs much too slow

I'm running the following python script on a large dataset (around 100 000 items). Currently the execution is unacceptably slow, it would probably take a month to finish at least (no exaggeration). Obviously I would like it to run faster.
I've added a comment belong to highlight where I think the bottleneck is. I have written my own database functions which are imported.
Any help is appreciated!
# -*- coding: utf-8 -*-
import database
from gensim import corpora, models, similarities, matutils
from gensim.models.ldamulticore import LdaMulticore
import pandas as pd
from sklearn import preprocessing
def getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary):
vec_bow = dictionary.doc2bow([researcher['full_proposal_text']])
vec_lda = ldamodel[vec_bow]
# normalization
try:
vec_lda = preprocessing.normalize(vec_lda)
except:
pass
similar_authors = []
for index, other_author in authors.iterrows():
if(other_author['id'] != author['id']):
other_vec_bow = dictionary.doc2bow([other_author['full_proposal_text']])
other_vec_lda = ldamodel[other_vec_bow]
# normalization
try:
other_vec_lda = preprocessing.normalize(vec_lda)
except:
pass
sim = matutils.cossim(vec_lda, other_vec_lda)
similar_authors.append({'id': other_author['id'], 'cosim': sim})
similar_authors = sorted(similar_authors, key=lambda k: k['cosim'], reverse=True)
return similar_authors[:5]
def get_top_five_similar(author, authors, ldamodel, dictionary):
top_five_similar_authors = getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary)
database.insert_top_five_similar_authors(author['id'], top_five_similar_authors, cursor)
connection = database.connect()
authors = []
authors = pd.read_sql("SELECT id, full_text FROM author WHERE full_text IS NOT NULL;", connection)
# create the dictionary
dictionary = corpora.Dictionary([authors["full_text"].tolist()])
# create the corpus/ldamodel
author_text = []
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
corpus = [dictionary.doc2bow(text) for text in author_text]
ldamodel = LdaMulticore(corpus, num_topics=50, id2word = dictionary, workers=30)
#BOTTLENECK: the script hangs after this point.
authors.apply(lambda x: get_top_five_similar(x, authors, ldamodel, dictionary), axis=1)

I noticed these problems in your code.. but I'm not sure the they are the reason for the slow execution..
this loop here is useless it well never run:
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
also there is no need to loop the words of the text it is enough to use split function on it and it will be a list of words, by lopping authors courser..
try to write it like this:
first:
all_authors_text = []
for author in authors:
all_authors_text.append(author['full_text'].split())
and after that make the dictionary:
dictionary = corpora.Dictionary(all_authors_text)

NLTK POS tags extraction, tried key, values but not there yet

I have a list of names on which I am using NLTK to POS tag. I use it along with wordsegment, as the names a jumbled up like thisisme.
So I have succesfully POS tagged these names using a loop, however, I am unable to extract the POS tags. The entire exercise is been done from a CSV.
This is what I want the CSV to look like at the end of the day.
name, length, pos
thisisyou 6 NN, ADJ
My code so far is
import pandas as pd
import nltk
import wordsegment
from wordsegment import segment
from nltk import pos_tag, word_tokenize
from nltk.tag.util import str2tuple
def readdata():
datafileread = pd.read_csv('data.net.lint.csv')
domain_names = datafileread.DOMAIN[0:5]
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
limit_names = postagged
for keys,values in postagged:
print (posttagged)
readdata()
And I get this result
NN
NN
ADJ
NN
This seems OK but it is wrong. Some POS tags should not be on a new line. It should merely be jumbled like NNNN.

The print function will insert a newline each time you use it. You need to avoid this. Try it like this:
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
tags = ", ".join(t for w, t in postagged)
print(domain_name, LENGTH, tags)
The join() method returns the POS tags as a single string, separated with ", ". I've just written LENGTH since I have no idea how you got the 6 in your example. Fill in whatever you meant.
PS. You don't need it here, but you can tell print() not to add the final newline like this: print(word, end=" ")

Create Information content corpora to be used by webnet from a custom dump

I am using Brown corpus ic-brown.dat for calculating the information content of a word using wordnet nltk library. But results are not good looking. I was wondering how can i build my own custome.dat (information content file).
custom_ic = wordnet_ic.ic('custom.dat')

In (...)/nltk_data/corpora/wordnet_ic/ you will find IC-compute.sh that contains some calls to some Perl scripts to generate the IC dat files from a given corpus. I founded the instructions tricky, and I do not have the require Perl scripts, so I decided to create a python script by analyzing the dat files structure and wordnet.ic() function.
You can compute your own IC counts by calling the wordnet.ic() function over a corpus reader object. In fact, you only need an object with a word() function that returns all the words in the corpus. For more details check the ic function (line 1729 to 1789) in the file ..../nltk/corpus/reader/wordnet.py.
For example, for the XML version of the BNC corpus (2007):
reader_bnc = nltk.corpus.reader.BNCCorpusReader(root='../Corpus/2554/2554/download/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
bnc_ic = wn.ic(reader_bnc, False, 0.0)
To generate the .dat file I created the following functions:
def is_root(synset_x):
if synset_x.root_hypernyms()[0] == synset_x:
return True
return False
def generate_ic_file(IC, output_filename):
"""Dump in output_filename the IC counts.
The expected format of IC is a dict
{'v':defaultdict, 'n':defaultdict, 'a':defaultdict, 'r':defaultdict}"""
with codecs.open(output_filename, 'w', encoding='utf-8') as fid:
# Hash code of WordNet 3.0
fid.write("wnver::eOS9lXC6GvMWznF1wkZofDdtbBU"+"\n")
# We only stored nouns and verbs because those are the only POS tags
# supported by wordnet.ic() function
for tag_type in ['v', 'n']:#IC:
for key, value in IC[tag_type].items():
if key != 0:
synset_x = wn.of2ss(of="{:08d}".format(key)+tag_type)
if is_root(synset_x):
fid.write(str(key)+tag_type+" "+str(value)+" ROOT\n")
else:
fid.write(str(key)+tag_type+" "+str(value)+"\n")
print("Done")
generate_ic_file(bnc_ic, "../custom.dat")
Then, just call the function:
custom_ic = wordnet_ic.ic('../custom.dat')
The imports needed are:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import codecs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two gensim Phrases models - python

Related

Question about Ranking of Documents using BM25

How to get the probability of bigrams in a text of sentences?

Gensim LDA Multicore Python script runs much too slow

NLTK POS tags extraction, tried key, values but not there yet

Create Information content corpora to be used by webnet from a custom dump

Categories

Resources