search engine with Tf-Idf in python - python

here is my code
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"this is first document ","this is second document","this is third","which document is first", ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()
now
this is what i want to do?
when i search document it should give me [ 1,2,4]documents(sentence)
when i search first document it should give me [1]documents(sentence)
when i search second it should give me [2]documents(sentence)
i want to do this with TfIdf (i can't do normal searching )
how can i do that?

First of all, you have to ask yourself the question: what does the TfidfVectorizer do? The answer is: it transforms your documents into vectors. How can you proceed further? One solution is to transform your query also into a vector by using the vectorizer. Then, you can compare the cosine similarity between the transformed query vector and each of the vectors of the documents in your database. The document with the highest cosine similarity to your query vector is the most relevant one (at least according to the Vector space model).
Here https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089 is an example implementation.

Related

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.
>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()
However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.
Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:
>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)
I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors.
Thank you so much in advance!
I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.
You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.
According to CountVectorizer user guide (same for the TfidfVectorizer):
max_features int, default=None
If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.
In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".
One way I can think of doing that is by using the inverse_transform performing the following steps:
vectorizer = TfidfVectorizer()
model = vectorizer.fit_transform(corpus)
# We use the inverse_transform which returns the
# terms per document with nonzero entries
inverse_model = vectorizer.inverse_transform(model)
# Each line in the inverse model corresponds to a document
# and contains a list of feature names (the terms).
# As we want to rank the documents we tranform the list
# of feature names to a number of features
# that each document is represented by.
inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
# As we are going to sort the list, we need to keep track of the
# document id (its index in the corpus), so we create tuples with
# the list index of each item before we sort the list.
inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
inverse_model_count))
# Then we sort the list by the count of terms
# in each document (the second component)
max_features = 100
top_documents_tuples = sorted(inverse_model_count_tuples,
key=lambda item: item[1],
reverse=True)[:max_features]
# We are interested only in the document ids (the first tuple component)
top_documents, _ = zip(*top_documents_tuples)
# Having the top_documents ids we can slice the initial model
# to keep only the documents indicated by the top_documents list
reduced_model = model[top_documents]
Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer).
If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.
I hope this helps!

Do I use TD-IDF correctly over corpus of raw documents?

I am in doubt as to whether I use my TD-IDF calculations correctly. I have a large corpus of different documents, each document is stored in it's own row using pandas dataframe. I feed each row to TD-IDF in scikit-learn and store feature_names (words) in a list.
I am using following code:
term_tdidf = []
def tdidf_f(vec, matrix):
f_array = np.array(vec.get_feature_names())
t_sort = np.argsort(matrix.toarray()).flatten()[::-1]
n = 100
top_term = f_array[t_sort][:n]
term_tdidf.append(set(top_term))
for row in df.document:
x = TfidfVectorizer(stop_words='english')
tfidf_matrix = x.fit_transform(row)
terms = x.get_feature_names()
tdidf_f(x, tfidf_matrix)
After that I create new dataframe where each set of tdidf from each document is stored in a separate column.
Is that correct use of TD-IDF? I am running it only on single document, so terms I am getting are only calculated within this one document, correct? As I understand td-idf should be used across all documents to find one set of frequent terms, not multiple sets. Are there any consequences of such application?
My manual review of extracted features from each document indicates that terms I am getting are fitting. Afterwards I am using those terms to calculate similarity between documents and it seems to be correct.
To compute the IDF part of the weighting, you need to count the number of times the term occurs in the whole corpus, so your code is incorrect.
Here is a minimal example of how to use it:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["My first document", "My second document"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X is then a matrix where each row is a document and each column represents a term.

How to run spaCy's sentence similarity function to an array of strings to get an array of scores?

I have to compare one spacy document to a list of spacy documents and want to get a list of similarity scores as an output. Of course, I can do this using a for loop, but I'm looking for some optimized solution like numpy offers to broadcast etc.
I have one document against a list of documents:
oneDoc = 'Hello, I want to be compared with a list of documents'
listDocs = ["I'm the first one", "I'm the second one"]
spaCy offers us a document similarity function:
oneDoc = nlp(oneDoc)
listDocs = nlp(listDocs)
similarity_score = np.zeros(len(listDocs))
for i, doc in enumerate(listDocs):
similarity_score[i] = oneDoc.similarity(doc)
Since one document is compared with a list of two documents, the similarity score would be like this:
[0.7, 0.8]
I'm looking for a way to avoid this for loop. In other words, I want to vectorize this function.
Use nlp.pipe to process all of your text documents. Grab the embeddings .vector from each document. Apply numpy pairwise distance function with cosine as metric to create matrix.

Check the tf-idf scores of sklearn in python

I am following the example here to calculate the TF-IDF values using sklearn.
My code is as follows.
from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
I want to calculate the tf-idf values for the two words life and learning for the 3 documents in corpus.
According to the article I am referring (see Table below) I should get the following values for my example.
However, the values I get from my code is completely different. Please help me find what is wrong in my code and how to fix it.
The main point is that you should not restrict the vocabulary to just two words ('life', 'learning') before constructing the term frequency matrix. If you do that, all other words will be ignored and it will affect the term frequency counting.
There are also several other steps that need to be taken into account if one wants to get exactly the same numbers as in the example by using sklearn:
The features in the example are unigrams (single words) so I have
set ngram_range=(1,1).
The example uses different normalization than sklearn for the term
frequency part (the term counts are normalized by document lengths
in the example, whereas sklearn uses raw term counts by default).
Because of this, I have counted and normalized the term frequencies
separately before calculating the idf part.
The normalization in the example for the idf part is also not the
default for sklearn. This can be adjusted to match the example by
setting smooth_idf to false.
Sklearn's vectorizers discard by default words with just one
character, but such words are kept in the example. In the code
below, I have modified token_pattern to allow also 1-character
words.
The final tfidf matrix is obtained by multiplying the normalized counts by the idf vector.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
cvect = CountVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b')
counts = cvect.fit_transform(corpus.values())
normalized_counts = normalize(counts, norm='l1', axis=1)
tfidf = TfidfVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b', smooth_idf=False)
tfs = tfidf.fit_transform(corpus.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df.loc[['life', 'learning']])
However, in practice such modifications are rarely needed. One usually obtains good results just by using TfidfVectorizer directly.

Using Sklearn's TfidfVectorizer transform

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.
from sklearn.feature_extraction.text import TfidfVectorizer
self.vocabulary = "a list of words I want to look for in the documents".split()
self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
stop_words='english')
self.vect.fit_transform(self.vocabulary)
...
doc = "some string I want to get tf-idf vector for"
tfidf = self.vect.transform(doc)
The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.
If you want to compute tf-idf only for a given vocabulary, use vocabulary argument to TfidfVectorizer constructor,
vocabulary = "a list of words I want to look for in the documents".split()
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
stop_words='english', vocabulary=vocabulary)
Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:
vect.fit(corpus)
Method fit_transform is a shortening for
vect.fit(corpus)
corpus_tf_idf = vect.transform(corpus)
Last, transform method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.
doc_tfidf = vect.transform([doc])

Categories

Resources