NLTK title classifier - python

Apologies in advance if this has already been questioned/answered, but I couldn't find any answer close to my problem. I am also somewhat noob as to dealing with Python, so sorry too for the long post.
I am trying to build a Python script that, based on a user-given Pubmed query (i.e., "cancer"), retrieves a file with N article titles, and evaluates their relevance to the subject in question.
I have successfully built the "pubmed search and save" part, having it return a .txt file containing titles of articles (each line corresponds to a different article title), for instance:
Feasibility of an ovarian cancer quality-of-life psychoeducational intervention.
A randomized trial to increase physical activity in breast cancer survivors.
Having this file, the idea is to use it into a classifier and get it to answer if the titles in the .txt file are relevant to a subject, for which I have a "gold standard" of titles that I know are relevant (i.e., I want to know the precision and recall of the queried set of titles against my gold standard). For example: Title 1 has the word "neoplasm" X times and "study" N times, therefore it is considered as relevant to "cancer" (Y/N).
For this, I have been using NLTK to (try to) classify my text. I have pursued 2 different approaches, both unsuccessfully:
Approach 1
Loading the .txt file, preprocessing it (tokenization, lower-casing, removing stopwords), converting the text to NLTK text format, finding the N most-common words. All this runs without problems.
f = open('SR_titles.txt')
raw = f.read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
words = [w for w in words if not w in stopwords.words("english")]
text = nltk.Text(words)
fdist = FreqDist(text)
>>><FreqDist with 116 samples and 304 outcomes>
I am also able to find colocations/bigrams in the text, which is something that might be important afterward.
text.collocations()
>>>randomized controlled; breast cancer; controlled trial; physical
>>>activity; metastatic breast; prostate cancer; randomised study; early
>>>breast; cancer patients; feasibility study; psychosocial support;
>>>group psychosocial; group intervention; randomized trial
Following NLTKs tutorial, I built a feature extractor, so the classifier will know which aspects of the data it should pay attention to.
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
This would, for instance, return something like this:
{'contains(series)': False, 'contains(disorders)': False,
'contains(group)': True, 'contains(neurodegeneration)': False,
'contains(human)': False, 'contains(breast)': True}
The next thing would be to use the feature extractor to train a classifier to label new article titles, and following NLTKs example, I tried this:
featuresets = [(document_features(d), c) for (d,c) in text]
Which gives me the error:
ValueError: too many values to unpack
Quickly googled this and found that this has something to do with tuples, but did not get how can I solve it (like I said, I'm somewhat noob in this), unless by creating a categorized corpus (I would still like to understand how can I solve this tuple problem).
Therefore, I tried approach 2, following Jacob Perkings Text Processing with NLTK Cookbook:
Started by creating a corpus and attributing categories. This time I had 2 different .txt files, one for each subject of title articles.
reader = CategorizedPlaintextCorpusReader('.', r'.*\,
cat_map={'hd_titles.txt': ['HD'], 'SR_titles.txt': ['Cancer']})
With "reader.raw()" I get something like this:
u"A pilot investigation of a multidisciplinary quality of life intervention for men with biochemical recurrence of prostate cancer.\nA randomized controlled pilot feasibility study of the physical and psychological effects of an integrated support programme in breast cancer.\n"
The categories for the corpus seem to be right:
reader.categories()
>>>['Cancer', 'HD']
Then, I try to construct a list of documents, labeled with the appropriate categories:
documents = [(list(reader.words(fileid)), category)
for category in reader.categories()
for fileid in reader.fileids(category)]
Which returns me something like this:
[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary',
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with',
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'],
'Cancer'),
([u'Trends', u'in', u'the', u'incidence', u'of', u'dementia', u':',
u'design', u'and', u'methods', u'in', u'the', u'Alzheimer', u'Cohorts',
u'Consortium', u'.'], 'HD')]
Next step would be creating a list of labeled feature sets, for which I used the next function, that takes a corpus and a feature_detector function (that would be document_features referred above). It then constructs and returns a mapping of the form {label: [featureset]}.
def label_feats_from_corpus(corp, feature_detector=document_features):
label_feats = collections.defaultdict(list)
for label in corp.categories():
for fileid in corp.fileids(categories=[label]):
feats = feature_detector(corp.words(fileids=[fileid]))
label_feats[label].append(feats)
return label_feats
lfeats = label_feats_from_corpus(reader)
>>>defaultdict(<type 'list'>, {'HD': [{'contains(series)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True,
'contains(anilinoquinazoline)': True}], 'Cancer': [{'contains(cancer)':
True, 'contains(of)': True, 'contains(group)': True, 'contains(After)':
True, 'contains(breast)': True}]})
(the list is a lot bigger and everything is set as True).
Then I want to construct a list of labeled training instances and testing instances.
The split_label_feats() function takes a mapping returned from
label_feats_from_corpus() and splits each list of feature sets
into labeled training and testing instances.
def split_label_feats(lfeats, split=0.75):
train_feats = []
test_feats = []
for label, feats in lfeats.items():
cutoff = int(len(feats) * split)
train_feats.extend([(feat, label) for feat in feats[:cutoff]])
test_feats.extend([(feat, label) for feat in feats[cutoff:]])
return train_feats, test_feats
train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
>>>0
len(test_feats)
>>>2
print(test_feats)
>>>[({'contains(series)': True, 'contains(China)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True},
'HD'), ({'contains(cancer)': True, 'contains(of)': True,
'contains(group)': True, 'contains(After)': True, 'contains(breast)':
True}, 'Cancer')]
I should've ended up with a lot more labeled training instances and labeled testing instances, I guess.
This brings me to where I am now. I searched stackoverflow, biostars, etc and could not find how to deal with both problems, so any help would be deeply appreciated.
TL;DR: Can't label a single .txt file to classify text, and can't get a corpus correctly labeled (again, to classify text).
If you've read this far, thank you as well.

You're getting an error on the following line:
featuresets = [(document_features(d), c) for (d,c) in text]
Here, you are supposed to convent each document (i.e. each title) to a dictionary of features. But to train with the results, the train() method needs both the feature dictionaries and the correct answer ("label"). So the normal workflow is to have a list of (document, label) pairs, which you transform to (features, label) pairs. It looks like your variable documents has the right structure, so if you just use it instead of text, this should work correctly:
featuresets = [(document_features(d), c) for (d,c) in documents]
As you go forward, get in the habit of inspecting your data carefully and figuring out what will (and should) happen to them. If text is a list of titles, it makes no sense to unpack each title to a pair (d, c). That should have pointed you in the right direction.

In featuresets = [(document_features(d), c) for (d,c) in text], I'm not sure what you are supposed to be getting from text. text seem to be a nltk class which is simply a wrapper around a generator. It seems it will give you a single string each iteration, which is why you are getting an error as you are asking for two items when it only has one to give.

Related

How can I find the probability of a sentence using GPT-2?

I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?
You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129
I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!
I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.

Doc2Vec not providing adequate results in most_similar

I'm trying to use Doc2Vec to go through the classic exercise of training on Wikipedia articles, using the article title as the tag.
Here's my code and the results, is there something that I'm missing that they would not give the matching results with most_similar? Following this tutorial, but I used the wiki-english-20171001 dataset that came with gensim.
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import re
def cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
return text
wiki = api.load("wiki-english-20171001")
data = [d for d in wiki]
for i in range(10):
print(data[i])
def my_create_tagged_docs(data):
for wikiidx in range(len(data)):
yield TaggedDocument([i for i in data[wikiidx].get('section_texts') for i in cleanText(i).split()], [data[wikiidx].get('title')])
wiki_data = my_create_tagged_docs(data)
del data
del wiki
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, epochs=40)
model.build_vocab(wiki_data)
model.train(wiki_data, total_examples=model.corpus_count, epochs=model.epochs)
model.docvecs.most_similar(positive=["Lady Gaga"], topn=10)
[('Chlorothrix', 0.35521823167800903),
("A Child's Garden of Verses", 0.3533579707145691),
('Fish Mooney', 0.35129639506340027),
('2000 Paris–Roubaix', 0.3463437855243683),
('Calvin C. Chaffee', 0.3439667224884033),
('Murders of Eve Stratford and Lynne Weedon', 0.3397218585014343),
('Black Air', 0.3396576941013336),
('Turzyn', 0.3312540054321289),
('Scott Baker', 0.33018186688423157),
('Amongst the Waves', 0.3297169804573059)]
model.docvecs.most_similar(positive=["Machine learning"], topn=10)
[('Wolf Rock, Connecticut', 0.3855834901332855),
('Amália Rodrigues', 0.3349645137786865),
('Victoria Park, Leicester', 0.33312514424324036),
('List of visual anthropology films', 0.3311382532119751),
('Sadqay Teri Mout Tun', 0.3287636637687683),
('T. Damodaran', 0.32876330614089966),
('Urqu Jawira (Aroma)', 0.32281631231307983),
('Tiggy Wiggy', 0.3226730227470398),
('Frédéric Brun (cyclist, born 1988)', 0.32106447219848633),
('Unholy Crusade', 0.3200794756412506)]
It looks like your wiki_data is a single-pass generator, as returned by my_create_tagged_docs(), which can be iterated over only once - not an iterable object capable of many iterations, as the many steps of the Doc2Vec training requires.
You can test your wiki_data object for whether it's multiply-iterable, just after it's been assigned, by executing:
print(sum(1 for _ in wiki_data))
print(sum(1 for _ in wiki_data))
If you see the same number twice – the total number of documents – all's well. If the 2nd number is 0, you've created a single-use iterator instead of a multiple-use iterable.
As a result, the build_vocab() call will work to initialize the known-vocabulary & model – but then the train() will see an empty iterable, completing instantly with no real training happening. (If you run with logging at the INFO level, this may be obvious in the log timestamps for the various steps.)
Two possible fixes:
If you're lucky enough to have enough RAM to hold the whole corpus as Python objects, converting it into a in-memory list would ensure it's multiple-iterable:
wiki_data = list(my_create_tagged_docs(data))
But, most won't have that much RAM * shouldn't/needn't take that step. Instead, you can define a class for an iterable view on the data, which can return a fresh iterator every time it's needed. There's an example with further explanation in a blog post by the founder of the gensim project at:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

NLP for multi feature data set using TensorFlow

I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.
This second topic is interesting for me.
using
sentences = [
'some test sentence',
'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)
will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.
that prepared training set I can pass into keras model.fit
what when I have multiple features in my data set?
let's say I would like to build an event prioritization algorithm and my data structure would look like:
[event_description, event_category, event_location, label]
trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number
how to prepare such a dataset so a model could be trained on such data?
would this approach be ok:
# Going through training set to get all features into specific ararys
for data in dataset:
training_sentence.append(data['event_description'])
training_category.append(data['event_category'])
training_location.append(data['event_location'])
training_labels.append(data['label'])
# Tokenize each array which contains tokenized value
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)
# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])
#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.
Is this a correct approach to create NPL using NN?
I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.
One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.
Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.
It could look something like this:
# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)
categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)
# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))
# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)
# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)
# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)
The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.
It would look something like this:
XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS
You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.
Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.
If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.

Finding closest related words using word2vec

My goal is to find most relevant words given set of keywords using word2vec. For example, if I have a set of words [girl, kite, beach], I would like relevants words to be output from word2vec: [flying, swimming, swimsuit...]
I understand that word2vec will vectorize a word based on the context of surround words. So what I did, was use the following function:
most_similar_cosmul([girl, kite, beach])
However, it seems to give out words not very related to the set of keywords:
['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]
This is the code I used to train word2vec
def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
ckpt_filename=None):
"""
Train word2vec model
data_filepath path of the data file in csv format
:param epochs: number of times to train
:param num_features: increase to improve generality, more computationally expensive to train
:param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
:param context_size: context window length
:param downsampling: reduce frequency for frequent keywords
:param seed: make results reproducible for random generator. Same seed means, after training model produces same results.
:returns path of the checkpoint after training
"""
if ckpt_filename == None:
data_base_filename = os.path.basename(data_filepath)
data_filename = os.path.splitext(data_base_filename)[0]
ckpt_filename = data_filename + ".wv.ckpt"
num_workers = multiprocessing.cpu_count()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
nltk.download("punkt")
nltk.download("stopwords")
print("Training %s ..." % data_filepath)
sentences = _get_sentences(data_filepath)
word2vec = w2v.Word2Vec(
sg=1,
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
word2vec.build_vocab(sentences)
print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
return _save_ckpt(word2vec, ckpt_filename)
def _save_ckpt(model, ckpt_filename):
if not os.path.exists("checkpoints"):
os.makedirs("checkpoints")
ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
model.save(ckpt_filepath)
return ckpt_filepath
def _get_sentences(data_filename):
print("Found Data:")
sentences = []
print("Reading '{0}'...".format(data_filename))
with codecs.open(data_filename, "r") as data_file:
reader = csv.DictReader(data_file)
for row in reader:
sentences.append(ast.literal_eval((row["highscores"])))
print("There are {0} sentences".format(len(sentences)))
return sentences
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Train Word2vec model')
parser.add_argument('data_filepath',
help='path to training CSV file.')
args = parser.parse_args()
data_filepath = args.data_filepath
train(data_filepath)
This is a sample of training data used for word2vec:
22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]
For prediction, I simply used the following function for a set of keywords:
most_similar_cosmul
I was wondering whether it is possible to find relevant keywords with word2vec. If it is not, then what machine learning model would be more suitable for this. Any insights would be very helpful
When supplying multiple positive-word examples, like ['girl', 'kite', 'beach'], to most_similar()/most_similar_cosmul(), the vectors for those words will be averaged-together first, then a list of words most similar to the average returned. Those might not be as obviously related to any one of the words than a simple check of a single word. So:
When you try most_similar() (or most_similar_cosmul()) on a single word, what kind of results do you get? Are they words that seem related to the input word, in the way that you care about?
If not, you have deeper problems in your setup that should be fixed before trying a multi-word similarity.
Word2Vec gets its usual results from (1) lots of training data; and (2) natural-language sentences. With enough data, a typical number of epochs training-passes (and thus the default) is 5. You can sometimes, somewhat make up for less data by using more epoch iterations, or a smaller vector size, but not always.
It's not clear how much data you have. Also, your example rows aren't real natural-language sentences – they appear to have had some other preprocessing/reordering applied. That may be hurting rather than helping.
Word-vectors often improve by throwing away more low-frequency words (increasing min_count above the default 5, rather than reducing it to 2.) Low-frequency words don't have enough examples to get good vectors – and the few examples they have, even if repeated with many iterations, tend to be idiosyncratic examples of the words' usage, not the generalizable broad representations that you'd get from many varied examples. And by keeping these doomed-to-be-weak words still in the training-data, the training of other more-frequent words is interfered with. (When you get a word that you don't think belongs in a most-similar ranking, it may be a rare-word that, given its its few occurrence contexts, found its way to those coordinates as the least-bad location among plenty of other unhelpful coordinates.)
If you do get good results from single-word checks, but not from the average-of-multiple-words, the results might improve with more and better data, or adjusted training parameters – but to achieve that you'd need to more rigorously define what you consider good results. (Your existing list doesn't look that bad to me: it includes many words related to sun/sand/beach activities.)
On the other hand, your expectations of Word2Vec may be too high: it may not be that the average of ['girl', 'kite', 'beach'] is necessarily closed to those desired words, compared to the individual words themselves, or that may only be achievable with lots of dataset/parameter tweaking.

Document topical distribution in Gensim LDA

I've derived a LDA topic model using a toy corpus as follows:
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word
I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:
test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]
However when I use a large number of topics, the report is no longer complete:
test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]
It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.
I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?
Thank you for your kind answer!
Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.
I realise this is an old question but in case someone stumbles upon it, here is a solution (the issue has actually been fixed in the current development branch with a minimum_probability parameter to LdaModel but maybe you're running an older version of gensim).
define a new function (this is just copied from the source)
def get_doc_topics(lda, bow):
gamma, _ = lda.inference([bow])
topic_dist = gamma[0] / sum(gamma[0]) # normalize distribution
return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]
the above function does not filter the output topics based on the probability but will output all of them. If you don't need the (topic_id, value) tuples but just values, just return the topic_dist instead of the list comprehension (it'll be much faster as well).

Categories

Resources