Creating vector space - python

I've got a question:
I have a lot of documents and each line built by some pattern.
Of course, I have this array of patterns.
I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space.
Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors).
I don't know how can I do this?
I know about sklearn libraries and CountVectorizer/TfidfVectorizer/HashingVectorizer - but this depends on the vocabulary size. But, again, I have a lot of documents, that's why it'll be too much words in vocabulary (if do this way, but in next new document it can be new word which this vocabulary wouldn't have. That's way this is wrong way to solve my problem)
Also Keras library with it's Text Preprocessing won't solve my problem two. E.x. "one hot" encodes a text into a list of word indexes of size . But each document may have different size and of course the order. That's way comparing two vectors may give big distance, but in fact this vectors (words, that encoded by this vectors) are very similar.

Related

How to create word embedding using Word2Vec on Python?

I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar word or similarity between two words.
But, how if I have text data X and I want to produce the word embedding vector X_vector?
So that, this X_vector can be used for classification algorithms?
If X is a word (string token), you can look up its vector with word_model[X].
If X is a text - say, a list-of-words – well, a Word2Vec model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy is imported as np):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector() method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

Construct a suffix tree of a concatination of a million words and query it with a test set to find the closest match and classify

The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the closest match of each of those words in the training corpora and hence classify that word as the corresponding class of its closest match.
My Solution: Initially, I did this brute force which doesn't scale. Now I'm thinking I build a suffix tree over the concatenation of the training corpora (O(n)) and query the testing corpora (constant time). Trying to do this in python.
I'm looking for tools or packages that get me started or for other more efficient ways to solve the problem at hand. Thanks in advance.
Edit 1: As for how I am finding the closest match, I was thinking a combination of exact match alignment (from the suffix tree) and then for the part of the input string that is left over, I thought of doing a local alignment with affine gap penalty functions.
What distance metric are you using for the closest match?
There are papers that cover how to do an edit distance search using a suffix tree. For each suffix there is an extension of the edit matrix and theses can be ordered so to let one do a ranked search of the suffix tree to find the matching items in order of increasing distance.
An example for this is Top-k String Similarity Search with Edit-Distance Constraints (2013) https://doi.org/10.1109/ICDE.2013.6544886 https://scholar.google.com/scholar?cluster=13387662751776693983
The solution presented avoids computing all the entries in the table as columns are added.
In your problem it seems that for each word there are classes that apply to them if they don't depend on context then the above would work and a word to class map would all that would be needed. But if they depend on context then this seems closer to part of speech tagging.

Filtering Word Embeddings from word2vec

I have downloaded Google's pretrained word embeddings as a binary file here (GoogleNews-vectors-negative300.bin.gz). I want to be able to filter the embedding based on some vocabulary.
I first tried loading the bin file as a KeyedVector object, and then creating a dictionary that uses its vocabulary along with another vocabulary as a filter. However, it takes a long time.
# X is the vocabulary we are interested in
embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin.gz', binary=True)
embeddings_filtered = dict((k, embeddings[k]) for k in X if k in list(embeddings.wv.vocab.keys()))
It takes a very long time to run. I am not sure if this is the most efficient solution. Should I filter it out in the load_word2vec_format step first?
Your dict won't have all the features of a KeyedVectors object, and it won't be stored as compactly. The KeyedVectors stores all vectors in a large contiguous native 2D array, with a dict indicating the row for each word's vector. Your second dict, with a separate vector for each word, will involve more overhead. (And further, as the vectors you get back from embeddings[k] will be "views" into the full vector – so your subset may actually indirectly retain the larger array, even after you try to discard the KeyedVectors.)
Since it's likely that a reason you only want a subset of the original vectors is that the original set was too large, having a dict that takes as much or more memory probably isn't ideal.
You should consider two options:
load_word2vec_format() includes an optional limit parameter that only loads the first N words from the supplied file. As such files are typically sorted from most-frequent to least-frequent words, and the less-frequent words are both far less useful and of lower vector quality, it is often practical to just use the first 1 million, or 500,000, or 100,000, etc entries for a large memory & speed savings.
You could try filtering on load. You'd need to adapt the loading code to do this. Fortunately you can review the full source code for load_word2vec_format() (it's just a few dozen lines) inside your local gensim instalation, or online at the project source code hosting at:
https://github.com/RaRe-Technologies/gensim/blob/9c5215afe3bc4edba7dde565b6f2db982bba5113/gensim/models/utils_any2vec.py#L123
You'd write your own version of this routine that skips words not of interest. (It might have to do two passes over the file, one to count the words of interest, then a second to actually allocate the right-sized in-memory arrays and do the real reading.)

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?
To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.
Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.
Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

Detecting duplicates in text files

I am trying to find the best way to detect/remove duplicates in text data. By duplicates I mean those texts that have a really high similarity, for example all equal but in one sentence. Furthermore the length can vary (by one or two sentence more or less), for this reason Hamming distances is not an option. Any way to compute a similarity factor? should I use term frequency matrices?
About my data: I have it in JSON file with Date, title and body (content). Therefore the similarity coefficient could include this three levels.
Since I am looking for the approach (not the code) I do not think presenting the data is necessary.
kind regards,
You can use the tf-idf ranking method. Look here for more details : Similarity between two text documents

Categories

Resources