Question about Ranking of Documents using BM25

Question about Ranking of Documents using BM25 - python

I'm trying to rank task sentences of an occupation using bm25. I'm following this tutorial, but getting to this part I get confused "Ranking of documents
Now that we've created our document indexes, we can give it queries and see which documents are the most relevant" i want that the queries be every sentence that i have at my corpus column. How can i do that?
!pip install rank_bm25
import pandas as pd from rank_bm25 import BM25Okapi import string
corpus = pd.read_excel(r'/content/Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
tokenized_corpus = [] for doc in corpus['task']:
print(doc)
doc_tokens = doc.split()
tokenized_corpus.append(doc_tokens)
bm25 = BM25Okapi(tokenized_corpus)
here is my data

Basically you just need to iterate over your list of documents, for example like this:
import pandas as pd
from rank_bm25 import BM25Okapi
import string
def argsort(seq, reverse):
# http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
return sorted(range(len(seq)), key=seq.__getitem__, reverse =reverse)
corpus = pd.read_excel(r'Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
bm25 = BM25Okapi(tokenized_corpus)
for doc_as_query in tokenized_corpus:
scores = bm25.get_scores(doc_as_query)
# show top 3 (optional):
print('\nQuery:',' '.join(doc_as_query))
top_similar = argsort(scores, True)
for i in range(3):
print(' top', i+1,':', corpus['task'][top_similar[i]])
I added sorting and printing the top 3 similar documents, in case you want this (that's why I added the function argsort).

Related

Gensim LDA Multicore Python script runs much too slow

I'm running the following python script on a large dataset (around 100 000 items). Currently the execution is unacceptably slow, it would probably take a month to finish at least (no exaggeration). Obviously I would like it to run faster.
I've added a comment belong to highlight where I think the bottleneck is. I have written my own database functions which are imported.
Any help is appreciated!
# -*- coding: utf-8 -*-
import database
from gensim import corpora, models, similarities, matutils
from gensim.models.ldamulticore import LdaMulticore
import pandas as pd
from sklearn import preprocessing
def getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary):
vec_bow = dictionary.doc2bow([researcher['full_proposal_text']])
vec_lda = ldamodel[vec_bow]
# normalization
try:
vec_lda = preprocessing.normalize(vec_lda)
except:
pass
similar_authors = []
for index, other_author in authors.iterrows():
if(other_author['id'] != author['id']):
other_vec_bow = dictionary.doc2bow([other_author['full_proposal_text']])
other_vec_lda = ldamodel[other_vec_bow]
# normalization
try:
other_vec_lda = preprocessing.normalize(vec_lda)
except:
pass
sim = matutils.cossim(vec_lda, other_vec_lda)
similar_authors.append({'id': other_author['id'], 'cosim': sim})
similar_authors = sorted(similar_authors, key=lambda k: k['cosim'], reverse=True)
return similar_authors[:5]
def get_top_five_similar(author, authors, ldamodel, dictionary):
top_five_similar_authors = getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary)
database.insert_top_five_similar_authors(author['id'], top_five_similar_authors, cursor)
connection = database.connect()
authors = []
authors = pd.read_sql("SELECT id, full_text FROM author WHERE full_text IS NOT NULL;", connection)
# create the dictionary
dictionary = corpora.Dictionary([authors["full_text"].tolist()])
# create the corpus/ldamodel
author_text = []
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
corpus = [dictionary.doc2bow(text) for text in author_text]
ldamodel = LdaMulticore(corpus, num_topics=50, id2word = dictionary, workers=30)
#BOTTLENECK: the script hangs after this point.
authors.apply(lambda x: get_top_five_similar(x, authors, ldamodel, dictionary), axis=1)

I noticed these problems in your code.. but I'm not sure the they are the reason for the slow execution..
this loop here is useless it well never run:
for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)
also there is no need to loop the words of the text it is enough to use split function on it and it will be a list of words, by lopping authors courser..
try to write it like this:
first:
all_authors_text = []
for author in authors:
all_authors_text.append(author['full_text'].split())
and after that make the dictionary:
dictionary = corpora.Dictionary(all_authors_text)

How do I create an inverted index containing information on the number of occurrences of each word in each document

I have a dataset containing 250 data points of text information. Now I need to create an inverted index containing information on the occurrences of each word within each key. I have around 250 keys and a list of the preprocessed collection of words.
I have tried playing around with Whoosh libraries but with little success. The documentation seems a little vague on how it works. I receive Unknown Field Error.
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer
schema = Schema(Title = ID(stored = True), Content = `enter code here`KEYWORD(stored = True))
import os, os.path
from whoosh import index
if not os.path.exists('inv_indexdir'):
os.mkdir('inv_indexdir')
ix = index.create_in('inv_indexdir', schema)
import whoosh.index as index
ix = index.open_dir('inv_indexdir')
writer = ix.writer()
writer.add_document(title = u'Title', content = u'Content', path = u'json_data.json')

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Doing the text analysis of italian text (tokenization, lemmalization) for future use of TF-IDF technics and constructing clusters based on that. For preprocessing I use NLTK and for one text file everything is working fine:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
it_stop_words = nltk.corpus.stopwords.words('italian')
lmtzr = WordNetLemmatizer()
with open('3003.txt', 'r' , encoding="latin-1") as myfile:
data=myfile.read()
word_tokenized_list = nltk.tokenize.word_tokenize(data)
word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]
But the question is that I need to perform the following to bunch of .txt files in the folder. For that I'm trying to use possibilities of PlaintextCorpusReader():
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'reports/'
newcorpus = PlaintextCorpusReader(corpusdir, '.txt')
Basically I can not just apply newcorpus into the previous functions because it's an object and not a string. So my questions are:
How should the functions look like (or how should I change the existing ones for a distinct file) for doing tokenization and lemmatization for a corpus of files (using PlaintextCorpusReader())
How would the TF-IDF approach (standard sklearn approach of vectorizer = TfidfVectorizer() will look like in PlaintextCorpusReader()
Many Thanks!

I think your question can be answered by reading: this question, this another one and [TfidfVectorizer docs][3]. For completeness, I wrapped the answers below:
First, you want to get the files ids, by the first question you can get them as follows:
ids = newcorpus.fileids()
Then, based on the second quetion you can retrieve documents' words, sentences or paragraphs:
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
# Get words
doc_words.append(newcorpus.words(id_))
# Get sentences
doc_sents.append(newcorpus.sents(id_))
# Get paragraph
doc_paras.append(newcorpus.paras(id_))
Now, on the ith position of doc_words, doc_sents and doc_paras you have all words, sentences and paragraphs (respectively) for every document in the corpus.
For tf-idf you probably just want the words. Since TfidfVectorizer.fit's method gets an iterable which yields str, unicode or file objects, you need to either transform your documents (array of tokenized words) into a single string, or use a similar approach to this one. The latter solution uses a dummy tokenizer to deal directly with arrays of words.
You can also pass your own tokenizer to TfidVectorizer and use PlaintextCorpusReader simply for file reading.

Merging two gensim Phrases models

I am trying to build a Phrases model over a big corpus but I keep stumbling over a memory error.
First I tried to fit my entire corpus into a big generator.
Then, I tried to save the model between each document :
import codecs
import gensim
import os
import random
import string
import sys
def gencorp(file_path):
with codecs.open(file_path, 'rb',encoding="utf8") as doc :
for sentence in doc:
yield sentence.split()
out_corpus_dir = "C:/Users/Administrator/Desktop/word2vec/1billionwords_corpus_preprocessed/"
file_nb = 0
bi_detector = gensim.models.Phrases()
for file in os.listdir(out_corpus_dir):
file_nb += 1
file_path = out_corpus_dir+file
bi_detector.add_vocab(gencorp(file_path))
bi_detector.save("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector = gensim.models.Phrases.load("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector.save("generic_EN_bigrams")
But none of these solutions work. However, generic_EN_bigrams_v0 is generated and saved.
So I am wondering if I can train a Phrases model per document and then find a way to merge them after.
Thanks you for any insight :)

According to the gensim's documentation, adding sentences should simply work, and you shouldn't have a memory problem since it's updating the statistics only. Therefore, this minor modification to your code should make it work, i.e. you don't need to recreate the bi_detector object.
import codecs
import gensim
import os
import random
import string
import sys
def gencorp(file_path):
with codecs.open(file_path, 'rb',encoding="utf8") as doc :
for sentence in doc:
yield sentence.split()
out_corpus_dir = "C:/Users/Administrator/Desktop/word2vec/1billionwords_corpus_preprocessed/"
file_nb = 0
bi_detector = gensim.models.Phrases()
for file in os.listdir(out_corpus_dir):
file_nb += 1
file_path = out_corpus_dir+file
bi_detector.add_vocab(gencorp(file_path))
# The following two lines are not required.
# bi_detector.save("generic_EN_bigrams_v%i"%(file_nb/10))
# bi_detector = gensim.models.Phrases.load("generic_EN_bigrams_v%i"%(file_nb/10))
bi_detector.save("generic_EN_bigrams")

NLTK POS tags extraction, tried key, values but not there yet

I have a list of names on which I am using NLTK to POS tag. I use it along with wordsegment, as the names a jumbled up like thisisme.
So I have succesfully POS tagged these names using a loop, however, I am unable to extract the POS tags. The entire exercise is been done from a CSV.
This is what I want the CSV to look like at the end of the day.
name, length, pos
thisisyou 6 NN, ADJ
My code so far is
import pandas as pd
import nltk
import wordsegment
from wordsegment import segment
from nltk import pos_tag, word_tokenize
from nltk.tag.util import str2tuple
def readdata():
datafileread = pd.read_csv('data.net.lint.csv')
domain_names = datafileread.DOMAIN[0:5]
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
limit_names = postagged
for keys,values in postagged:
print (posttagged)
readdata()
And I get this result
NN
NN
ADJ
NN
This seems OK but it is wrong. Some POS tags should not be on a new line. It should merely be jumbled like NNNN.

The print function will insert a newline each time you use it. You need to avoid this. Try it like this:
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
tags = ", ".join(t for w, t in postagged)
print(domain_name, LENGTH, tags)
The join() method returns the POS tags as a single string, separated with ", ". I've just written LENGTH since I have no idea how you got the 6 in your example. Fill in whatever you meant.
PS. You don't need it here, but you can tell print() not to add the final newline like this: print(word, end=" ")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Question about Ranking of Documents using BM25 - python

Related

Gensim LDA Multicore Python script runs much too slow

How do I create an inverted index containing information on the number of occurrences of each word in each document

Tokenization and lemmatization for TF-IDF use for bunch of txt files using NLTK library

Merging two gensim Phrases models

NLTK POS tags extraction, tried key, values but not there yet

Categories

Resources