Document topical distribution in Gensim LDA - python

I've derived a LDA topic model using a toy corpus as follows:
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word
I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:
test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]
However when I use a large number of topics, the report is no longer complete:
test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]
It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.
I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?
Thank you for your kind answer!

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.

I realise this is an old question but in case someone stumbles upon it, here is a solution (the issue has actually been fixed in the current development branch with a minimum_probability parameter to LdaModel but maybe you're running an older version of gensim).
define a new function (this is just copied from the source)
def get_doc_topics(lda, bow):
gamma, _ = lda.inference([bow])
topic_dist = gamma[0] / sum(gamma[0]) # normalize distribution
return [(topicid, topicvalue) for topicid, topicvalue in enumerate(topic_dist)]
the above function does not filter the output topics based on the probability but will output all of them. If you don't need the (topic_id, value) tuples but just values, just return the topic_dist instead of the list comprehension (it'll be much faster as well).

Related

How can I find the probability of a sentence using GPT-2?

I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?
You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129
I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!
I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.

How to group wikipedia categories in python?

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.
hypertriglyceridemia: ['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
enzyme inhibitor: ['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
bypass surgery: ['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
perth: ['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
climate: ['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']
As you can see, the first three concepts belong to medical domain (whereas the remaining two terms are not medical terms).
More precisely, I want to divide my concepts as medical and non-medical. However, it is very difficult to divide the concepts using the categories alone. For example, even though the two concepts enzyme inhibitor and bypass surgery are in medical domain, their categories are very different to each other.
Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)
I am currently using pymediawiki and pywikibot. However, I am not restricted to only those two libraries and happy to have solutions using other libraries as well.
EDIT
As suggested by #IlmariKaronen I am also using the categories of categories and the results I got is as follows (The small font near the category is the categories of the category).
However, I still could not find a way to use these category details to decide if a given term is a medical or non-medical.
Moreover, as pointed by #IlmariKaronen using Wikiproject details can be potential. However, it seems like the Medicine wikiproject do not seem to have all the medical terms. Therefore we also need to check other wikiprojects as well.
EDIT:
My current code of extracting categories from wikipedia concepts is as follows. This could be done using pywikibot or pymediawiki as follows.
Using the librarary pymediawiki
import mediawiki as pw
p = wikipedia.page('enzyme inhibitor')
print(p.categories)
Using the library pywikibot
import pywikibot as pw
site = pw.Site('en', 'wikipedia')
print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])
The categories of categories can also be done in the same way as shown in the answer by #IlmariKaronen.
If you are looking for longer list of concepts for testing I have mentioned more examples below.
['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
For a very long list please check the link below. https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing
NOTE: I am not expecting the solution to work 100% (if the proposed algorithm is able to detect many of the medical concepts that is enough for me)
I am happy to provide more details if needed.
Solution Overview
Okay, I would approach the problem from multiple directions. There are some great suggestions here and if I were you I would use an ensemble of those approaches (majority voting, predicting label which is agreed upon by more than 50% of classifiers in your binary case).
I'm thinking about following approaches:
Active learning (example approach provided by me below)
MediaWiki backlinks provided as an answer by #TavoGC
SPARQL ancestral categories provided as a comment to your question by #Stanislav Kralin and/or parent categories provided by #Meena Nagarajan (those two could be an ensemble on their own based on their differences, but for that you would have to contact both creators and compare their results).
This way 2 out of three would have to agree a certain concept is a medical one, which minimizes chance of an error further.
While we're at it I would argue against approach presented by #ananand_v.singh in this answer, because:
distance metric should not be euclidean, cosine similarity is much better metric (used by, e.g. spaCy) as it does not take into account magnitude of the vectors (and it shouldn't, that's how word2vec or GloVe were trained)
many artificial clusters would be created if I understood correctly, while we only need two: medicine and non-medicine one. Furthermore, centroid of medicine is not centered on the medicine itself. This poses additional problems, say centroid is moved far away from the medicine and other words like, say, computer or human (or any other not-fitting in your opinion into medicine) might get into the cluster.
it's hard to evaluate results, even more so, the matter is strictly subjective. Furthermore word vectors are hard to visualize and understand (casting them into lower dimensions [2D/3D] using PCA/TSNE/similar for so many words, would give us totally non-sensical results [yeah, I have tried to do it, PCA gets around 5% explained variance for your longer dataset, really, really low]).
Based on the problems highlighted above I have come up with solution using active learning, which is pretty forgotten approach to such problems.
Active Learning approach
In this subset of machine learning, when we have a hard time coming up with an exact algorithm (like what does it mean for a term to be a part of medical category), we ask human "expert" (doesn't actually have to be expert) to provide some answers.
Knowledge encoding
As anand_v.singh pointed out, word vectors are one of the most promising approach and I will use it here as well (differently though, and IMO in a much cleaner and easier fashion).
I'm not going to repeat his points in my answer, so I will add my two cents:
Do not use contextualized word-embeddings as currently available state of the art (e.g. BERT)
Check how many of your concepts have no representation (e.g. is represented as a vector of zeros). It should be checked (and is checked in my code,, there will be further discussion when the time comes) and you may use the embedding which has most of them present.
Measuring similarity using spaCy
This class measures similarity between medicine encoded as spaCy's GloVe word vector and every other concept.
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
This code will return a number for each concept measuring how similar it is to centroid. Furthermore, it records indices of concepts missing their representation. It might be called like this:
import json
import typing
import numpy as np
import spacy
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
You may substitute you data in place of new_concepts.json.
Look at spacy.load and notice I have used en_vectors_web_lg. It consists of 685.000 unique word vectors (which is a lot), and may work out of the box for your case. You have to download it separately after installing spaCy, more info provided in the links above.
Additionally you may want to use multiple centroid words, e.g. add words like disease or health and average their word vectors. I'm not sure whether that would affect positively your case though.
Other possibility might be to use multiple centroids and calculate similiarity between each concept and multiple of centroids. We may have a few thresholds in such case, this is likely to remove some false positives, but may miss some terms which one could consider to be similar to medicine. Furthermore it would complicate the case much more, but if your results are unsatisfactory you should consider two options above (and only if those are, don't jump into this approach without previous thought).
Now, we have a rough measure of concept's similarity. But what does it mean that a certain concept has 0.1 positive similarity to medicine? Is it a concept one should classify as medical? Or maybe that's too far away already?
Asking expert
To get a threshold (below it terms will be considered non medical), it's easiest to ask a human to classify some of the concepts for us (and that's what active learning is about). Yeah, I know it's a really simple form of active learning, but I would consider it such anyway.
I have written a class with sklearn-like interface asking human to classify concepts until optimal threshold (or maximum number of iterations) is reached.
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
max_steps: int,
samples: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.max_steps: int = max_steps
self.samples: int = samples
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
samples argument describes how many examples will be shown to an expert during each iteration (it is the maximum, it will return less if samples were already asked for or there is not enough of them to show).
step represents the drop of threshold (we start at 1 meaning perfect similarity) in each iteration.
change_multiplier - if an expert answers concepts are not related (or mostly unrelated, as multiple of them are returned), step is multiplied by this floating point number. It is used to pinpoint exact threshold between step changes at each iteration.
concepts are sorted based on their similarity (the more similar a concept is, the higher)
Function below asks expert for an opinion and find optimal threshold based on his answers.
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
Example question looks like this:
Are those concepts related to medicine?
0. anesthetic drug
1. child and adolescent psychiatry
2. tertiary care center
3. sex therapy
4. drug design
5. pain disorder
6. psychiatric rehabilitation
7. combined oral contraceptive
8. family practitioner committee
9. cancer family syndrome
10. social psychology
11. drug sale
12. blood system
[y]es / [n]o / [any]quit y
... parsing an answer from expert:
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
And finally whole code code of ActiveLearner, which finds optimal threshold of similiarity accordingly to expert:
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
All in all, you would have to answer some questions manually but this approach is way more accurate in my opinion.
Furthermore, you don't have to go through all of the samples, just a small subset of it. You may decide how many samples constitute a medical term (whether 40 medical samples and 10 non-medical samples shown, should still be considered medical?), which let's you fine-tune this approach to your preferences. If there is an outlier (say, 1 sample out of 50 is non-medical), I would consider the threshold to still be valid.
Once again: This approach should be mixed with others in order to minimalize the chance for wrong classification.
Classifier
When we obtain the threshold from expert, classification would be instantenous, here is a simple class for classification:
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
And for brevity, here is the final source code:
import json
import typing
import numpy as np
import spacy
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
if __name__ == "__main__":
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
learner = ActiveLearner(
np.array(concepts), concepts_similarity, samples=20, max_steps=50
).fit()
print(f"Found threshold {learner.threshold_}\n")
classifier = Classifier(centroid, learner.threshold_)
pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
predictions = classifier.predict(pipe)
print(
"\n".join(
f"{concept}: {label}"
for concept, label in zip(concepts[20:40], predictions[20:40])
)
)
After answering some questions, with threshold 0.1 (everything between [-1, 0.1) is considered non-medical, while [0.1, 1] is considered medical) I got the following results:
kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True
As you can see this approach is far from perfect, so the last section described possible improvements:
Possible improvements
As mentioned in the beginning using my approach mixed with other answers would probably leave out ideas like sport shoe belonging to medicine out and active learning approach would be more of a decisive vote in case of a draw between two heuristics mentioned above.
We could create an active learning ensemble as well. Instead of one threshold, say 0.1, we would use multiple of them (either increasing or decreasing), let's say those are 0.1, 0.2, 0.3, 0.4, 0.5.
Let's say sport shoe gets, for each threshold it's respective True/False like this:
True True False False False,
Making a majority voting we would mark it non-medical by 3 out of 2 votes. Furthermore, too strict threshold would me mitigated as well if thresholds below it out-vote it (case if True/False would look like this: True True True False False).
Final possible improvement I came up with: In the code above I'm using Doc vector, which is a mean of word vectors creating the concept. Say one word is missing (vectors consisting of zeros), in such case, it would be pushed further away from medicine centroid. You may not want that (as some niche medical terms [abbreviations like gpv or others] might be missing their representation), in such case you could average only those vectors which are different from zero.
I know this post is quite lengthy, so if you have any questions post them below.
"Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)"
MediaWiki categories are themselves wiki pages. A "parent category" is just a category which the "child" category page belongs to. So you can get the parent categories of a category in exactly the same way as you'd obtain the categories of any other wiki page.
For example, using pymediawiki:
p = wikipedia.page('Category:Enzyme inhibitors')
parents = p.categories
You could try to classify the wikipedia categories by the mediawiki links and backlinks returned for each category
import re
from mediawiki import MediaWiki
#TermFind will search through a list a given term
def TermFind(term,termList):
responce=False
for val in termList:
if re.match('(.*)'+term+'(.*)',val):
responce=True
break
return responce
#Find if the links and backlinks lists contains a given term
def BoundedTerm(wikiPage,term):
aList=wikiPage.links
bList=wikiPage.backlinks
responce=False
if TermFind(term,aList)==True and TermFind(term,bList)==True:
responce=True
return responce
container=[]
wikipedia = MediaWiki()
for val in termlist:
cpage=wikipedia.page(val)
if BoundedTerm(cpage,'term')==True:
container.append('medical')
else:
container.append('nonmedical')
The idea is to try to guess a term that is shared by most of the categories, I try biology, medicine and disease with good results. Perhaps you can try to use mulpile calls of BoundedTerms to make the clasification, or a single call for multiple terms and combine the result for the classification. Hope it helps
There is a concept of word Vectors in NLP, what it basically does is by looking through mass volumes of text, it tries to convert words to multi-dimensional vectors and then lesser the distance between those vectors, greater the similarity between them, the good thing is that many people have already generated this word vectors and made them available under very permissive licences, and in your case you are working with Wikipedia and there exist word vectors for them here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Now these would be the most suited for this task since they contain most words from Wikipedia's corpora, but in case they are not suited for you, or are removed in the future you can use one from I will list below more of these, with that said, there is a better way to do this, i.e. by passing them to tensorflow's universal language model embed module in which you don't have to do most of the heavy lifting, you can read more about that here. The reason I put it after the Wikipedia text dump is because I have heard people say that they are a bit hard to work with when working with medical samples. This paper does propose a solution to tackle that but I have never tried that so I cannot be sure of it's accuracies.
Now how you can use the word embeddings from tensorflow is simple, just do
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed(["Input Text here as"," List of strings"])
session.run(embeddings)
Since you might not be familiar with tensorflow and trying to run just this piece of code you might run into some troubles, Follow this link where they have mentioned completely how to use this and from there you should be able to easily modify this to your needs.
With that said I would recommend first checking out he tensorlfow's embed module and their pre-trained word embedding's, if they don't work for you check out the Wikimedia link, if that also doesn't work then proceed to the concepts of the paper I have linked. Since this answer is describing an NLP approach, it will not be 100% accurate, so keep that in mind before you proceed.
Glove Vectors https://nlp.stanford.edu/projects/glove/
Facebook's fast text: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Or this http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
If you run into problems implementing this after following the colab tutorial add your problem to the question and comment below, from there we can proceed further.
Edit Added code to cluster topics
Brief, Rather than using words vector, I am encoding their summary sentences
file content.py
def AllTopics():
topics = []# list all your topics, not added here for space restricitons
for i in range(len(topics)-1):
yield topics[i]
File summaryGenerator.py
import wikipedia
import pickle
from content import Alltopics
summary = []
failed = []
for topic in Alltopics():
try:
summary.append(wikipedia.summary(tuple((topic,str(topic)))))
except Exception as e:
failed.append(tuple((topic,e)))
with open("summary.txt", "wb") as fp:
pickle.dump(summary , fp)
with open('failed.txt', 'wb') as fp:
pickle.dump('failed', fp)
File SimilartiyCalculator.py
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import pickle
import sys
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
try:
with open("summary.txt", "rb") as fp: # Unpickling
summary = pickle.load(fp)
except Exception as e:
print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
sys.exit('Read the error message')
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
tf.logging.set_verbosity(tf.logging.ERROR)
messages = [x[1] for x in summary]
labels = [x[0] for x in summary]
with tf.Session() as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)
X = message_embeddings
agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
agl.fit(X)
dist_matrix = distance_matrix(X,X)
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)
cluster_labels = agl.labels_
This is also hosted on GitHub at https://github.com/anandvsingh/WikipediaSimilarity Where you can find the similarity.txt file, and other files, In my case I couldn't run it on all the topics, but I would urge you to run it on the full list of topics (Directly clone the repository and run SummaryGenerator.py), and upload the similarity.txt via a pull request in case you don't get expected result. And if possible also upload the message_embeddings in a csv file as topics and there embeddings.
Changes after edit 2
Switched the similarityGenerator to a hierarchy based clustering(Agglomerative) I would suggest you to keep the title names at the bottom of the dendrogram and for that look at the definition of dendrogram here, I verified viewing some samples and the results look quite good, you can change the n_clusters value to fine tune your model. Note: This requires you to run summary generator again. I think you should be able to take it from here, what you have to do is try a few values of n_cluster and see in which all medical terms are grouped together, then find the cluster_label for that cluster and you are done. Since here we group by summary, the clusters will be more accurate. If you run into any problems or don't understand something, comment below.
The wikipedia library is also a good bet to extract the categories from a given page, as wikipedia.WikipediaPage(page).categories returns a simple list. The library also lets you search multiple pages should they all have the same title.
In medicine there seems to be a lot of key roots and suffixes, so the approach of finding key words may be a good approach to finding medical terms.
import wikipedia
def categorySorter(targetCats, pagesToCheck, mainCategory):
targetList = []
nonTargetList = []
targetCats = [i.lower() for i in targetCats]
print('Sorting pages...')
print('Sorted:', end=' ', flush=True)
for page in pagesToCheck:
e = openPage(page)
def deepList(l):
for item in l:
if item[1] == 'SUBPAGE_ID':
deepList(item[2])
else:
catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
if e[1] == 'SUBPAGE_ID':
deepList(e[2])
else:
catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
print()
print()
print('Results:')
print(mainCategory, ': ', targetList, sep='')
print()
print('Non-', mainCategory, ': ', nonTargetList, sep='')
def openPage(page):
try:
pageList = [page, wikipedia.WikipediaPage(page).categories]
except wikipedia.exceptions.PageError as p:
pageList = [page, 'NONEXIST_ID']
return
except wikipedia.exceptions.DisambiguationError as e:
pageCategories = []
for i in e.options:
if '(disambiguation)' not in i:
pageCategories.append(openPage(i))
pageList = [page, 'SUBPAGE_ID', pageCategories]
return pageList
finally:
return pageList
def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):
# unhash to view the categories of each page
#print(pageCategories)
pageCategories = [i.lower() for i in pageCategories]
any_in = False
for i in targetCats:
if i in pageTitle:
any_in = True
if any_in:
print('', end = '', flush=True)
elif compareLists(targetCats, pageCategories):
any_in = True
if any_in:
targetList.append(pageTitle)
else:
nonTargetList.append(pageTitle)
# Just prints a pretty list, you can comment out until next hash if desired
if any_in:
print(pageTitle, '(T)', end='', flush=True)
else:
print(pageTitle, '(F)',end='', flush=True)
if pageTitle != lastPage:
print(',', end=' ')
# No more commenting
return any_in
def compareLists (a, b):
for i in a:
for j in b:
if i in j:
return True
return False
The code is really just comparing a lists of key words and suffixes to the titles of each page as well as their categories to determine if a page is medically related. It also looks at related pages/sub pages for the bigger topics, and determines if those are related as well. I am not well versed in my medicine so forgive the categories but here is an example to tag onto the bottom:
medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')
This example list gets ~70% of what should be on the list, at least to my knowledge.
The question appears a little unclear to me and does not seem like a straightforward problem to solve and may require some NLP model. Also,the words concept and categories are interchangeably used. What I understand is that the concepts such as enzyme inhibitor, bypass surgery and hypertriglyceridimia need to be combined together as medical and the rest as non medical. This problem will require more data than just the category names. A corpus is required to train an LDA model(for instance) where the entire text information is fed to the algorithm and it returns the most likely topics for each of the concepts.
https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

How to plot a document topic distribution in structural topic modeling R-package?

If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.
As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)

Finding closest related words using word2vec

My goal is to find most relevant words given set of keywords using word2vec. For example, if I have a set of words [girl, kite, beach], I would like relevants words to be output from word2vec: [flying, swimming, swimsuit...]
I understand that word2vec will vectorize a word based on the context of surround words. So what I did, was use the following function:
most_similar_cosmul([girl, kite, beach])
However, it seems to give out words not very related to the set of keywords:
['charade', 0.30288437008857727]
['kinetic', 0.3002534508705139]
['shells', 0.29911646246910095]
['kites', 0.2987399995326996]
['7-9', 0.2962781488895416]
['showering', 0.2953910827636719]
['caribbean', 0.294752299785614]
['hide-and-go-seek', 0.2939240336418152]
['turbine', 0.2933803200721741]
['teenybopper', 0.29288050532341003]
['rock-paper-scissors', 0.2928623557090759]
['noisemaker', 0.2927709221839905]
['scuba-diving', 0.29180505871772766]
['yachting', 0.2907838821411133]
['cherub', 0.2905363440513611]
['swimmingpool', 0.290039986371994]
['coastline', 0.28998953104019165]
['Dinosaur', 0.2893030643463135]
['flip-flops', 0.28784963488578796]
['guardsman', 0.28728148341178894]
['frisbee', 0.28687697649002075]
['baltic', 0.28405341506004333]
['deprive', 0.28401875495910645]
['surfs', 0.2839275300502777]
['outwear', 0.28376665711402893]
['diverstiy', 0.28341981768608093]
['mid-air', 0.2829524278640747]
['kickboard', 0.28234976530075073]
['tanning', 0.281939834356308]
['admiration', 0.28123530745506287]
['Mediterranean', 0.281186580657959]
['cycles', 0.2807052433490753]
['teepee', 0.28070521354675293]
['progeny', 0.2775532305240631]
['starfish', 0.2775339186191559]
['romp', 0.27724218368530273]
['pebbles', 0.2771730124950409]
['waterpark', 0.27666303515434265]
['tarzan', 0.276429146528244]
['lighthouse', 0.2756190896034241]
['captain', 0.2755546569824219]
['popsicle', 0.2753356397151947]
['Pohoda', 0.2751699686050415]
['angelic', 0.27499720454216003]
['african-american', 0.27493417263031006]
['dam', 0.2747344970703125]
['aura', 0.2740659713745117]
['Caribbean', 0.2739778757095337]
['necking', 0.27346789836883545]
['sleight', 0.2733519673347473]
This is the code I used to train word2vec
def train(data_filepath, epochs=300, num_features=300, min_word_count=2, context_size=7, downsampling=1e-3, seed=1,
ckpt_filename=None):
"""
Train word2vec model
data_filepath path of the data file in csv format
:param epochs: number of times to train
:param num_features: increase to improve generality, more computationally expensive to train
:param min_word_count: minimum frequency of word. Word with lower frequency will not be included in training data
:param context_size: context window length
:param downsampling: reduce frequency for frequent keywords
:param seed: make results reproducible for random generator. Same seed means, after training model produces same results.
:returns path of the checkpoint after training
"""
if ckpt_filename == None:
data_base_filename = os.path.basename(data_filepath)
data_filename = os.path.splitext(data_base_filename)[0]
ckpt_filename = data_filename + ".wv.ckpt"
num_workers = multiprocessing.cpu_count()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
nltk.download("punkt")
nltk.download("stopwords")
print("Training %s ..." % data_filepath)
sentences = _get_sentences(data_filepath)
word2vec = w2v.Word2Vec(
sg=1,
seed=seed,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context_size,
sample=downsampling
)
word2vec.build_vocab(sentences)
print("Word2vec vocab length: %d" % len(word2vec.wv.vocab))
word2vec.train(sentences, total_examples=len(sentences), epochs=epochs)
return _save_ckpt(word2vec, ckpt_filename)
def _save_ckpt(model, ckpt_filename):
if not os.path.exists("checkpoints"):
os.makedirs("checkpoints")
ckpt_filepath = os.path.join("checkpoints", ckpt_filename)
model.save(ckpt_filepath)
return ckpt_filepath
def _get_sentences(data_filename):
print("Found Data:")
sentences = []
print("Reading '{0}'...".format(data_filename))
with codecs.open(data_filename, "r") as data_file:
reader = csv.DictReader(data_file)
for row in reader:
sentences.append(ast.literal_eval((row["highscores"])))
print("There are {0} sentences".format(len(sentences)))
return sentences
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Train Word2vec model')
parser.add_argument('data_filepath',
help='path to training CSV file.')
args = parser.parse_args()
data_filepath = args.data_filepath
train(data_filepath)
This is a sample of training data used for word2vec:
22751473,"[""lover"", ""sweetheart"", ""couple"", ""dietary"", ""meal""]"
28738542,"[""mallotus"", ""villosus"", ""shishamo"", ""smelt"", ""dried"", ""fish"", ""spirinchus"", ""lanceolatus""]"
25163686,"[""Snow"", ""Removal"", ""snow"", ""clearing"", ""female"", ""females"", ""woman"", ""women"", ""blower"", ""snowy"", ""road"", ""operate""]"
32837025,"[""milk"", ""breakfast"", ""drink"", ""cereal"", ""eating""]"
23828321,"[""jogging"", ""female"", ""females"", ""lady"", ""woman"", ""women"", ""running"", ""person""]"
22874156,"[""lover"", ""sweetheart"", ""heterosexual"", ""couple"", ""man"", ""and"", ""woman"", ""consulting"", ""hear"", ""listening""]
For prediction, I simply used the following function for a set of keywords:
most_similar_cosmul
I was wondering whether it is possible to find relevant keywords with word2vec. If it is not, then what machine learning model would be more suitable for this. Any insights would be very helpful
When supplying multiple positive-word examples, like ['girl', 'kite', 'beach'], to most_similar()/most_similar_cosmul(), the vectors for those words will be averaged-together first, then a list of words most similar to the average returned. Those might not be as obviously related to any one of the words than a simple check of a single word. So:
When you try most_similar() (or most_similar_cosmul()) on a single word, what kind of results do you get? Are they words that seem related to the input word, in the way that you care about?
If not, you have deeper problems in your setup that should be fixed before trying a multi-word similarity.
Word2Vec gets its usual results from (1) lots of training data; and (2) natural-language sentences. With enough data, a typical number of epochs training-passes (and thus the default) is 5. You can sometimes, somewhat make up for less data by using more epoch iterations, or a smaller vector size, but not always.
It's not clear how much data you have. Also, your example rows aren't real natural-language sentences – they appear to have had some other preprocessing/reordering applied. That may be hurting rather than helping.
Word-vectors often improve by throwing away more low-frequency words (increasing min_count above the default 5, rather than reducing it to 2.) Low-frequency words don't have enough examples to get good vectors – and the few examples they have, even if repeated with many iterations, tend to be idiosyncratic examples of the words' usage, not the generalizable broad representations that you'd get from many varied examples. And by keeping these doomed-to-be-weak words still in the training-data, the training of other more-frequent words is interfered with. (When you get a word that you don't think belongs in a most-similar ranking, it may be a rare-word that, given its its few occurrence contexts, found its way to those coordinates as the least-bad location among plenty of other unhelpful coordinates.)
If you do get good results from single-word checks, but not from the average-of-multiple-words, the results might improve with more and better data, or adjusted training parameters – but to achieve that you'd need to more rigorously define what you consider good results. (Your existing list doesn't look that bad to me: it includes many words related to sun/sand/beach activities.)
On the other hand, your expectations of Word2Vec may be too high: it may not be that the average of ['girl', 'kite', 'beach'] is necessarily closed to those desired words, compared to the individual words themselves, or that may only be achievable with lots of dataset/parameter tweaking.

NLTK title classifier

Apologies in advance if this has already been questioned/answered, but I couldn't find any answer close to my problem. I am also somewhat noob as to dealing with Python, so sorry too for the long post.
I am trying to build a Python script that, based on a user-given Pubmed query (i.e., "cancer"), retrieves a file with N article titles, and evaluates their relevance to the subject in question.
I have successfully built the "pubmed search and save" part, having it return a .txt file containing titles of articles (each line corresponds to a different article title), for instance:
Feasibility of an ovarian cancer quality-of-life psychoeducational intervention.
A randomized trial to increase physical activity in breast cancer survivors.
Having this file, the idea is to use it into a classifier and get it to answer if the titles in the .txt file are relevant to a subject, for which I have a "gold standard" of titles that I know are relevant (i.e., I want to know the precision and recall of the queried set of titles against my gold standard). For example: Title 1 has the word "neoplasm" X times and "study" N times, therefore it is considered as relevant to "cancer" (Y/N).
For this, I have been using NLTK to (try to) classify my text. I have pursued 2 different approaches, both unsuccessfully:
Approach 1
Loading the .txt file, preprocessing it (tokenization, lower-casing, removing stopwords), converting the text to NLTK text format, finding the N most-common words. All this runs without problems.
f = open('SR_titles.txt')
raw = f.read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
words = [w for w in words if not w in stopwords.words("english")]
text = nltk.Text(words)
fdist = FreqDist(text)
>>><FreqDist with 116 samples and 304 outcomes>
I am also able to find colocations/bigrams in the text, which is something that might be important afterward.
text.collocations()
>>>randomized controlled; breast cancer; controlled trial; physical
>>>activity; metastatic breast; prostate cancer; randomised study; early
>>>breast; cancer patients; feasibility study; psychosocial support;
>>>group psychosocial; group intervention; randomized trial
Following NLTKs tutorial, I built a feature extractor, so the classifier will know which aspects of the data it should pay attention to.
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
This would, for instance, return something like this:
{'contains(series)': False, 'contains(disorders)': False,
'contains(group)': True, 'contains(neurodegeneration)': False,
'contains(human)': False, 'contains(breast)': True}
The next thing would be to use the feature extractor to train a classifier to label new article titles, and following NLTKs example, I tried this:
featuresets = [(document_features(d), c) for (d,c) in text]
Which gives me the error:
ValueError: too many values to unpack
Quickly googled this and found that this has something to do with tuples, but did not get how can I solve it (like I said, I'm somewhat noob in this), unless by creating a categorized corpus (I would still like to understand how can I solve this tuple problem).
Therefore, I tried approach 2, following Jacob Perkings Text Processing with NLTK Cookbook:
Started by creating a corpus and attributing categories. This time I had 2 different .txt files, one for each subject of title articles.
reader = CategorizedPlaintextCorpusReader('.', r'.*\,
cat_map={'hd_titles.txt': ['HD'], 'SR_titles.txt': ['Cancer']})
With "reader.raw()" I get something like this:
u"A pilot investigation of a multidisciplinary quality of life intervention for men with biochemical recurrence of prostate cancer.\nA randomized controlled pilot feasibility study of the physical and psychological effects of an integrated support programme in breast cancer.\n"
The categories for the corpus seem to be right:
reader.categories()
>>>['Cancer', 'HD']
Then, I try to construct a list of documents, labeled with the appropriate categories:
documents = [(list(reader.words(fileid)), category)
for category in reader.categories()
for fileid in reader.fileids(category)]
Which returns me something like this:
[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary',
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with',
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'],
'Cancer'),
([u'Trends', u'in', u'the', u'incidence', u'of', u'dementia', u':',
u'design', u'and', u'methods', u'in', u'the', u'Alzheimer', u'Cohorts',
u'Consortium', u'.'], 'HD')]
Next step would be creating a list of labeled feature sets, for which I used the next function, that takes a corpus and a feature_detector function (that would be document_features referred above). It then constructs and returns a mapping of the form {label: [featureset]}.
def label_feats_from_corpus(corp, feature_detector=document_features):
label_feats = collections.defaultdict(list)
for label in corp.categories():
for fileid in corp.fileids(categories=[label]):
feats = feature_detector(corp.words(fileids=[fileid]))
label_feats[label].append(feats)
return label_feats
lfeats = label_feats_from_corpus(reader)
>>>defaultdict(<type 'list'>, {'HD': [{'contains(series)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True,
'contains(anilinoquinazoline)': True}], 'Cancer': [{'contains(cancer)':
True, 'contains(of)': True, 'contains(group)': True, 'contains(After)':
True, 'contains(breast)': True}]})
(the list is a lot bigger and everything is set as True).
Then I want to construct a list of labeled training instances and testing instances.
The split_label_feats() function takes a mapping returned from
label_feats_from_corpus() and splits each list of feature sets
into labeled training and testing instances.
def split_label_feats(lfeats, split=0.75):
train_feats = []
test_feats = []
for label, feats in lfeats.items():
cutoff = int(len(feats) * split)
train_feats.extend([(feat, label) for feat in feats[:cutoff]])
test_feats.extend([(feat, label) for feat in feats[cutoff:]])
return train_feats, test_feats
train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
>>>0
len(test_feats)
>>>2
print(test_feats)
>>>[({'contains(series)': True, 'contains(China)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True},
'HD'), ({'contains(cancer)': True, 'contains(of)': True,
'contains(group)': True, 'contains(After)': True, 'contains(breast)':
True}, 'Cancer')]
I should've ended up with a lot more labeled training instances and labeled testing instances, I guess.
This brings me to where I am now. I searched stackoverflow, biostars, etc and could not find how to deal with both problems, so any help would be deeply appreciated.
TL;DR: Can't label a single .txt file to classify text, and can't get a corpus correctly labeled (again, to classify text).
If you've read this far, thank you as well.
You're getting an error on the following line:
featuresets = [(document_features(d), c) for (d,c) in text]
Here, you are supposed to convent each document (i.e. each title) to a dictionary of features. But to train with the results, the train() method needs both the feature dictionaries and the correct answer ("label"). So the normal workflow is to have a list of (document, label) pairs, which you transform to (features, label) pairs. It looks like your variable documents has the right structure, so if you just use it instead of text, this should work correctly:
featuresets = [(document_features(d), c) for (d,c) in documents]
As you go forward, get in the habit of inspecting your data carefully and figuring out what will (and should) happen to them. If text is a list of titles, it makes no sense to unpack each title to a pair (d, c). That should have pointed you in the right direction.
In featuresets = [(document_features(d), c) for (d,c) in text], I'm not sure what you are supposed to be getting from text. text seem to be a nltk class which is simply a wrapper around a generator. It seems it will give you a single string each iteration, which is why you are getting an error as you are asking for two items when it only has one to give.

Categories

Resources