I've trained an LDA model using gensim. I am under the impression that Lda reduces the data to two lower level matrices (ref: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/) but I cannot seem to figure out how to access the term-topic matrix. The only reference I could find in gensim's documentation is for the .get_topics() attribute, however the format it provides makes no sense to me.
It is easy enough to apply a transformation to retrieve the Document-topic matrix, like so:
doc_topic_matrix = lda_model[doc_term_matrix]
so I am hoping that there is a similarly functional method to generate the topic-term matrix.
Ideally, output should look like this:
word1 word2 word3 word4 word5
topic_a .12 .38 .07 .24 .19
topic_b .41 .11 .04 .14 .30
Any thoughts on whether or not this is possible?
It's easy, you could get it like this:
#get raw topic > word estimates
topics_terms = model.state.get_lambda()
#convert estimates to probability (sum equals to 1 per topic)
topics_terms_proba = np.apply_along_axis(lambda x: x/x.sum(),1,topics_terms)
# find the right word based on column index
words = [model.id2word[i] for i in range(topics_terms_proba.shape[1])]
#put everything together
pd.DataFrame(topics_terms_proba,columns=words)
You already mentioned the appropriate method get_topics().
Here is how you can interpret the results with pandas:
import pandas as pd
from gensim.models import LdaModel
from gensim.test.utils import common_dictionary, common_corpus
model = LdaModel(common_corpus, id2word=common_dictionary, num_topics=2)
pd.DataFrame(model.get_topics(), columns=model.id2word.values(), index=[f'topic {i}' for i in range(model.num_topics)])
The final result looks like:
Related
I'm using customized text with 'Prompt' and 'Completion' to train new model.
Here's the tutorial I used to create customized model from my data:
beta.openai.com/docs/guides/fine-tuning/advanced-usage
However even after training the model and sending prompt text to the model, I'm still getting generic results which are not always suitable for me.
How I can make sure completion results for my prompts will be only from the text I used for the model and not from the generic OpenAI models?
Can I use some flags to eliminate results from generic models?
Wrong goal: OpenAI API should answer from the fine-tuning dataset if the prompt is similar to the one from the fine-tuning dataset
It's the completely wrong logic. Forget about fine-tuning. As stated on the official OpenAI website:
Fine-tuning lets you get more out of the models available through the
API by providing:
Higher quality results than prompt design
Ability to train on more examples than can fit in a prompt
Token savings due to shorter prompts
Lower latency requests
Fine-tuning is not about answering with a specific answer from the fine-tuning dataset.
Fine-tuning helps the model gain more knowledge, but it has nothing to do with how the model answers. Why? The answer we get from the fine-tuned model is based on all knowledge (i.e., fine-tuned model knowledge = default knowledge + fine-tuning knowledge).
Although GPT-3 models have a lot of general knowledge, sometimes we want the model to answer with a specific answer (i.e., "fact").
Correct goal: Answer with a "fact" when asked about a "fact", otherwise answer with the OpenAI API
Note: For better (visual) understanding, the following code was ran and tested in Jupyter.
STEP 1: Create a .csv file with "facts"
To keep things simple, let's add two companies (i.e., ABC and XYZ) with a content. The content in our case will be a 1-sentence description of the company.
companies.csv
Run print_dataframe.ipynb to print the dataframe.
print_dataframe.ipynb
import pandas as pd
df = pd.read_csv('companies.csv')
df
We should get the following output:
STEP 2: Calculate an embedding vector for every "fact"
An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents (source).
Let's test the Embeddings endpoint first. Run get_embedding.ipynb with an input This is a test.
Note: In the case of Embeddings endpoint, the parameter prompt is called input.
get_embedding.ipynb
import openai
openai.api_key = '<OPENAI_API_KEY>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = model,
input = text
)
return result['data'][0]['embedding']
print(get_embedding('text-embedding-ada-002', 'This is a test'))
We should get the following output:
What we see in the screenshot above is This is a test as an embedding vector. More precisely, we get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with a 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space which is very hard to imagine.
There are two things we need to understand at this point:
Why do we need to transform text into an embedding vector (i.e., numbers)? Because later on, we can compare embedding vectors and figure out how similar the two texts are. We can't compare texts as such.
Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.
Now we can create an embedding vector for each "fact". Run get_all_embeddings.ipynb.
get_all_embeddings.ipynb
import openai
from openai.embeddings_utils import get_embedding
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
df = pd.read_csv('companies.csv')
df['embedding'] = df['content'].apply(lambda x: get_embedding(x, engine = 'text-embedding-ada-002'))
df.to_csv('companies_embeddings.csv')
The code above will take the first company (i.e., x), get its 'content' (i.e., "fact") and apply the function get_embedding using the text-embedding-ada-002 model. It will save the embedding vector of the first company in a new column named 'embedding'. Then it will take the second company, the third company, the fourth company, etc. At the end, the code will automatically generate a new .csv file named companies_embeddings.csv.
Saving embedding vectors locally (i.e., in a .csv file) means we don't have to call the OpenAI API every time we need them. We calculate an embedding vector for a given "fact" once and that's it.
Run print_dataframe_embeddings.ipynb to print the dataframe with the new column named 'embedding'.
print_dataframe_embeddings.ipynb
import pandas as pd
import numpy as np
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df
We should get the following output:
STEP 3: Calculate an embedding vector for the input and compare it with embedding vectors from the companies_embeddings.csv using cosine similarity
We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb.
get_cosine_similarity.ipynb
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
openai.api_key = '<OPENAI_API_KEY>'
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
input_embedding_vector = get_embedding(my_model, my_input)
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df
The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc.
If my_input = 'Tell me something about company ABC':
If my_input = 'Tell me something about company XYZ':
If my_input = 'Tell me something about company Apple':
We can see that when we give Tell me something about company ABC as an input, it's the most similar to the first "fact". When we give Tell me something about company XYZ as an input, it's the most similar to the second "fact". Whereas, if we give Tell me something about company Apple as an input, it's the least similar to any of these two "facts".
STEP 4: Answer with the most similar "fact" if similarity is above our threshold, otherwise answer with the OpenAI API
Let's set our similarity threshold to >= 0.9. The code below should answer with the most similar "fact" if similarity is >= 0.9, otherwise answer with the OpenAI API. Run get_answer.ipynb.
get_answer.ipynb
# Imports
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import numpy as np
# Insert your API key
openai.api_key = '<OPENAI_API_KEY>'
# Insert OpenAI text embedding model and input
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT>'
# Calculate embedding vector for the input using OpenAI Embeddings endpoint
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
# Save embedding vector of the input
input_embedding_vector = get_embedding(my_model, my_input)
# Calculate similarity between the input and "facts" from companies_embeddings.csv file which we created before
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
# Find the highest similarity value in the dataframe column 'similarity'
highest_similarity = df['similarity'].max()
# If the highest similarity value is equal or higher than 0.9 then print the 'content' with the highest similarity
if highest_similarity >= 0.9:
fact_with_highest_similarity = df.loc[df['similarity'] == highest_similarity, 'content']
print(fact_with_highest_similarity)
# Else pass input to the OpenAI Completions endpoint
else:
response = openai.Completion.create(
model = 'text-davinci-003',
prompt = my_input,
max_tokens = 30,
temperature = 0
)
content = response['choices'][0]['text'].replace('\n', '')
print(content)
If my_input = 'Tell me something about company ABC' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company XYZ' and the threshold is >= 0.9 we should get the following answer from the companies_embeddings.csv:
If my_input = 'Tell me something about company Apple' and the threshold is >= 0.9 we should get the following answer from the OpenAI API:
My use case is that I have a corpus of documents, and when new docs come in I update Dictionary and vectorize. The result should be a sparse matrix of TFIDF vectors, which I'm using corpus2csc for.
I think I have a solution, but I have seen other answers that suggest my solution is impossible, and I've seen some unexpected behavior. I'm seeking a gut-check on the approach, with some specific questions below.
Overall approach:
Use Dictionary.doc2bow with allow_update=True
Construct TFIDF model using that updated Dictionary.
Questions and example code below.
from gensim import corpora
from gensim import models
from gensim import matutils
docs = [['definition', 'addition', 'term', 'defined'],
['term', 'sweet', 'generally', 'subject'],
['gene', 'extent', 'provided', 'gene'],
['additional', 'cost', 'gene', 'adequacy'],
['initial', 'condition', 'sweet', 'effectiveness']]
gensim_dict = corpora.Dictionary([], prune_at=None) # start with empty dict because I don't know what the corpus looks like at this point
BoW_corpus0 = [gensim_dict.doc2bow(d, allow_update=True) for d in docs] # Add to the dictionary by setting allow_update=True
# documentation says: corpus (iterable of iterable of (int, number)) – Input corpus in BoW format
matutils.corpus2csc(corpus=BoW_corpus0) # Expected. A 16x5 matrix with 19 stored elements
matutils.corpus2csc(corpus=BoW_corpus0, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus0), num_nnz=sum([len(doc) for doc in BoW_corpus0])) # Also expected for the 'efficient' path
model = models.TfidfModel(BoW_corpus0)
vecs = model[BoW_corpus0]
matutils.corpus2csc(corpus=vecs) # Unexpected - asks for a BoW format, but this is TransformedCorpus. But ignoring that, it's expected -- 16x5 with 19 stored elements.
matutils.corpus2csc(corpus=vecs, num_terms=len(gensim_dict.keys()), num_docs=vecs.obj.num_docs, num_nnz=vecs.obj.num_nnz) # Expected
corpus2csc documentation asks for a BoW format, not a TransformedCorpus. Does it accept a TransformedCorpus just by chance?
Now I add new docs:
new_docs = [['provided', 'communication', 'provided', 'writing']]
BoW_corpus1 = [gensim_dict.doc2bow(d, allow_update=True) for d in new_docs]
len(gensim_dict.cfs) # 18
gensim_dict.num_nnz # 22
len(BoW_corpus1) # 1
matutils.corpus2csc(corpus=BoW_corpus1) # Expected. An 18x1 matrix, with 3 stored elements
matutils.corpus2csc(corpus=BoW_corpus1, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus1), num_nnz=sum([len(doc) for doc in BoW_corpus1])) # Expected. An 18x1 matrix, with 3 stored elements
# Let's use the original model. Ideally we would update the model with the new document but I'm not sure of the best way to do that.
new_vecs = model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs) # Unexpected. Using the original model, a 10x1 with 1 stored elements. Why not 18x1?
Why does this approach result in a 10x1 matrix?
# let's try using the Dictionary that's been updated on the fly
dict_based_model = models.TfidfModel(dictionary=gensim_dict)
new_vecs2 = dict_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs2) # Expected. 18x1 with 3 stored elements.
matutils.corpus2csc(corpus=new_vecs2, num_terms=len(gensim_dict.keys()), num_docs=len(new_vecs2), num_nnz=sum([len(v) for v in new_vecs2])) # Also expected. 18x1 with 3 stored elements.
# But why are these not the same?
assert new_vecs2.obj.num_docs == len(new_vecs)
assert new_vecs2.obj.num_nnz == sum([len(v) for v in new_vecs2])
# Finally, let's make a model based on the new corpus, I know this isn't right but curious why the output is what it is
new_corpus_based_model = models.TfidfModel(BoW_corpus1)
new_vecs3 = new_corpus_based_model[BoW_corpus1]
matutils.corpus2csc(corpus=new_vecs3) # 0x1 with 0 elements. Kinda expected. But I would have thought it would have produced a 2x1 or 3x1 matrix
Can you confirm that dict_based_model is the right approach?
What is the new_vecs2.obj all about?
Why do I get a 0x1 instead of a 2x1?
I'm trying to write a program that, given a list of sentences, returns the most probable one. I want to use GPT-2, but I am quite new to using it (as in I don't really know how to do it). I'm planning on finding the probability of a word given the previous words and multiplying all the probabilities together to get the overall probability of that sentence occurring, however I don't know how to find the probability of a word occurring given the previous words. This is my (psuedo) code:
sentences = # my list of sentences
max_prob = 0
best_sentence = sentences[0]
for sentence in sentences:
prob = 1 #probability of that sentence
for idx, word in enumerate(sentence.split()[1:]):
prob *= probability(word, " ".join(sentence[:idx])) # this is where I need help
if prob > max_prob:
max_prob = prob
best_sentence = sentence
print(best_sentence)
Can I have some help please?
You can also try lm-scorer, a tiny wrapper around transformers that allows you to get sentences probabilities using models that support it (only GPT2 models are implemented at the time of writing).
https://github.com/simonepri/lm-scorer
I just used it myself and works perfectly.
Warning: If you use other transformers / pipelines in the same environment, things may get messy.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def score(tokens_tensor):
loss=model(tokens_tensor, labels=tokens_tensor)[0]
return np.exp(loss.cpu().detach().numpy())
texts = ['i would like to thank you mr chairman', 'i would liking to thanks you mr chair in', 'thnks chair' ]
for text in texts:
tokens_tensor = tokenizer.encode( text, add_special_tokens=False, return_tensors="pt")
print (text, score(tokens_tensor))
This code snippet could be an example of what are you looking for. You feed the model with a list of sentences, and it scores each whereas the lowest the better.
The output of the code above is:
i would like to thank you mr chairman 122.3066
i would liking to thanks you mr chair in 1183.7637
thnks chair 14135.129
I wrote a set of functions that can do precisely what you're looking for. Recall that GPT-2 parses its input into tokens (not words): the last word in 'Joe flicked the grasshopper' is actually three tokens: ' grass', 'ho', and 'pper'. The cloze_finalword function takes this into account, and computes the probabilities of all tokens (conditioned on the tokens appearing before them). You can adapt part of this function so that it returns what you're looking for. I hope you find the code useful!
I think GPT-2 is a bit overkill for what you're trying to achieve. You can build a basic language model which will give you sentence probability using NLTK. A tutorial for this can be found here.
I'm trying to implement a similarity function using
N-Grams
TF-IDF
Cosine Similaity
Example
Concept:
words = [...]
word = '...'
similarity = predict(words,word)
def predict(words,word):
words_ngrams = create_ngrams(words,range=(2,4))
word_ngrams = create_ngrams(word,range=(2,4))
words_tokenizer = tfidf_tokenizer(words_ngrams)
word_vec = words_tokenizer.transform(word)
return cosine_similarity(word_ved,words_tokenizer)
I searched the web for a simple and safe implementation but I couldn't find one that was using known python packages as sklearn, nltk, scipy etc.
most of them using "self made" calculations.
I'm trying to avoid coding every step by hand, and I'm guessing there is an easy fix for all of 'that pipeline'.
any help(and code) would be appreciated. tnx :)
Eventualy I figured it out...
For who ever will find the need of a solution for this Q, here's a function I wrote that takes care of it...
'''
### N-Gram & TD-IDF & Cosine Similarity
Using n-gram on 'from column' with TF-IDF to predict the 'to column'.
Adding to the df a 'cosine_similarity' feature with the numeric result.
'''
def add_prediction_by_ngram_tfidf_cosine( from_column_name,ngram_range=(2,4) ):
global df
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer( analyzer='char',ngram_range=ngram_range )
vectorizer.fit(df.FromColumn)
w = from_column_name
vec_word = vectorizer.transform([w])
df['vec'] = df.FromColumn.apply(lambda x : vectorizer.transform([x]))
df['cosine_similarity'] = df.vec.apply(lambda x : cosine_similarity(x,vec_word)[0][0])
df = df.drop(['vec'],axis=1)
Note: it's not production ready
I am working on a project where I need to apply topic modelling to a set of documents and I need to create a matrix :
DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj.
So far I have followed this tut: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
I am new to gensim and so far I have
1. created a document list
2. preprocessed and tokenized the documents.
3. Used corpora.Dictionary() to create id-> term dictionary (id2word)
4. convert tokenized documents into a document-term matrix
generated an LDA model.
So now I get the topics.
How can I now get the matrix that I mentioned before.
I will be using this matrix to calculate similarity between 2 documents on topic t as :
sim(a,b) = 1- |DT(a,t) - DT(b, t)|
There is an implementation in the pyLDAvis source code which returns the lists that may be helpful for building the matrix you are interested in.
Snippet from the _extract_data method in gensim.py:
def _extract_data(topic_model, corpus, dictionary, doc_topic_dists=None):
...
...
...
return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}
The number of topics for your model will be static. Maybe you're interested in finding the document topic distribution for the T matrix. In that case, the DxT matrix would be doc_lengths x doc_topic_dists.
Showing your code would be helpful, but if we were to go off of the example in the tutorial you linked then the model is identified by:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
you could put into your script something like:
model_name = "name_of_my_model"
ldamodel.save(model_name)
Then when you run it, this will create a model in the same directory that the script is run from.
Then you can get topic probability distribution with:
print(ldamodel[doc_bow])
If you want to get similarity to this model then you need to create a model for the query document, too, and then get cosine similarity between the two:
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("name_of_my_model.lda")
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims