Training a Machine Learning predictor - python

I have been trying to build a prediction model using a user’s data. Model’s input is documents’ metadata (date published, title etc) and document label is that user’s preference (like/dislike). I would like to ask some questions that I have come across hoping for some answers:
There are way more liked documents than disliked. I read somewhere that if somebody train’s a model using way more inputs of one label than the other this affects the performance in a bad way (model tends to classify everything to the label/outcome that has the majority of inputs
Is there possible to have input to a ML algorithm e.g logistic regression be hybrid in terms of numbers and words and how that could be done, sth like:
input = [18,23,1,0,’cryptography’] with label = [‘Like’]
Also can we use a vector ( that represents a word, using tfidf etc) as an input feature (e.g. 50-dimensions vector) ?
In order to construct a prediction model using textual data the only way to do so is by deriving a dictionary out of every word mentioned in our documents and then construct a binary input that will dictate if a term is mentioned or not? Using such a version though we lose the weight of the term in the collection right?
Can we use something as a word2vec vector as a single input in a supervised learning model?
Thank you for your time.

You either need to under-sample the bigger class (take a small random sample to match the size of the smaller class), over-sample the smaller class (bootstrap sample), or use an algorithm that supports unbalanced data - and for that you'll need to read the documentation.
You need to turn your words into a word vector. Columns are all the unique words in your corpus. Rows are the documents. Cell values are one of: whether the word appears in the document, the number of times it appears, the relative frequency of its appearance, or its TFIDF score. You can then have these columns along with your other non-word columns.
Now you probably have more columns than rows, meaning you'll get a singularity with matrix-based algorithms, in which case you need something like SVM or Naive Bayes.


Contextual word embeddings from pretrained word2vec vectors

I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand].
As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this.
Now I would like to try to create and train a neural network that would take entire sentences, e.g.
Jaguar F-PACE is a great SUV sports car.
Among cats, only tigers and lions are bigger than jaguars.
And then it would undertake the task of text classification (I have a dataset with several categories like animals, cars, etc.), but the result would be new representations for the word jaguar, but in different contexts, so two different embeddings.
Does anyone have any idea how I could create such a network? I don't hide that I'm a beginner and have no idea how to go about it.
If you've already been able to perform sense-disambiguation outside word2vec, then you can change the word-tokens to reflect your external judgement. For example, change some appearances of the token 'jaguar' to 'jaguar*car' and others to 'jaguar*animal'. Proceeding with normal word2vec training will then get your two different tokens two different word-vectors.
If you're hoping for the training to discover these itself, as ~Erwan mentioned in a comment, that seems like an open research question, without a standard or off-the-shelf solution that a beginner could drop-in.
I'd once seen a paper (around the time of the original word2vec papers, but can't find the link now) that tried to do this in a word2vec-compatible way by 1st proceeding with traditional polysemy-oblivious training. Then, for every appearance of a word X, model its surrounding context via some combination of the word-vectors of neighbors within a certain number of positions. (That in itself is very similar to the preparation of a context-vector in the CBOW mode of word2vec.) Perform some clustering on that collection-of-all-contexts to come up with some idea of alternate senses – each associated with one cluster. Then, in a followup pass on the original corpus, replace word-tokens with those that also reflect their nearby-context cluster. (EG: 'jaguar' might be replaced with 'jaguar*1', 'jaguar*2', etc based on which discrete cluster its context suggested.) Then, repeat (or continue) word2vec training to get sense-specific word-vectors. Of course, the devil would be in the details of how contexts are defined, how clusters are deduced, and tough edge-cases (where potentially the text's author is themselves deploying the multiple senses).
Some other interesting efforts to model or deduce polysemy in word2vec models:
"Linear Algebraic Structure of Word Meanings"
"A Simple Approach to Learn Polysemous Word Embeddings"
But per above, I've not seen these sorts of techniques widely implemented/adopted in a form that's easy to drop-in to another project.

NLP - which technique to use to classify labels of a paragraph?

I'm fairly new to NLP and trying to learn the techniques that can help me get my job done.
Here is my task: I have to classify stages of a drilling process based on text memos.
I have to classify labels for "Activity", "Activity Detail", "Operation" based on what's written in "Com" column.
I've been reading a lot of articles online and all the different kinds of techniques that I've read really confuses me.
The buzz words that I'm trying to understand are
Skip-gram (prediction based method, Word2Vec)
TF-IDF (frequency based method)
Co-Occurrence Matrix (frequency based method)
I am given about ~40,000 rows of data (pretty small, I know), and I came across an article that says neural-net based models like Skip-gram might not be a good choice if I have small number of training data. So I was also looking into frequency based methods too. Overall, I am unsure which technique is the best for me.
Here's what I understand:
Skip-gram: technique used to represent words in a vector space. But I don't understand what to do next once I vectorized my corpus
TF-IDF: tells how important each word is in each sentence. But I still don't know how it can be applied on my problem
Co-Occurence Matrix: I don'y really understand what it is.
All the three techniques are to numerically represent texts. But I am unsure what step I should take next to actually classify labels.
What approach & sequence of techniques should I use to tackle my problem? If there's any open source Jupyter notebook project, or link to an article (hopefully with codes) that did the similar job done, please share it here.
Let's get things a bit clearer. You task is to create a system that will predict labels for given texts, right? And label prediction (classification) can't be done for unstructured data (texts). So you need to make your data structured, and then train and infer your classifier. Therefore, you need to induce two separate systems:
Text vectorizer (as you said, it helps to numerically represent texts).
Classifier (to predict the labels for numerically represented texts).
Skip-Gram and co-occurrence matrix are ways to vectorize your texts (here is a nice article that explains their difference). In case of skip-gram you could download and use a 3rd party model that already has mapping of vectors to most of the words; in case of co-occurrence matrix you need to build it on your texts (if you have specific lexis, it will be a better way). In this matrix you could use different measures to represent the degree of co-occurrence of words with words or documents with documents. TF-IDF is one of such measures (that gives a score for every word-document pair); there are a lot of others (PMI, BM25, etc). This article should help to implement classification with co-occurrence matrix on your data. And this one gives an idea how to do the same with Word2Vec.
Hope it helped!

Doc2vec and word2vec with negative sampling

My current doc2vec code is as follows.
# Train doc2vec model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4, iter = 20)
I also have a word2vec code as below.
# Train word2vec model
model = word2vec.Word2Vec(sentences, size=300, sample = 1e-3, sg=1, iter = 20)
I am interested in using both DM and DBOW in doc2vec AND both Skip-gram and CBOW in word2vec.
In Gensim I found the below mentioned sentence:
"Produce word vectors with deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling"
Thus, I am confused either to use hierarchical softmax or negative sampling. Please let me know what are the differences in these two methods.
Also, I am interested in knowing what are the parameters that need to be changed to use hierarchical softmax AND/OR negative sampling with respect to dm, DBOW, Skip-gram and CBOW?
P.s. my application is a recommendation system :)
Skip-gram or CBOW are different ways to choose the input contexts for the neural-network. Skip-gram picks one nearby word, then supplies it as input to try to predict a target word; CBOW averages together a bunch of nearby words, then supplies that average as input to try to predict a target word.
DBOW is most similar to skip-gram, in that a single paragraph-vector for a whole text is used to predict individual target words, regardless of distance and without any averaging. It can mix well with simultaneous skip-gram training, where in addition to using the single paragraph-vector, individual nearby word-vectors are also used. The gensim option dbow_words=1 will add skip-gram training to a DBOW dm=0 training.
DM is most similar to CBOW: the paragraph-vector is averaged together with a number of surrounding words to try to predict a target word.
So in Word2Vec, you must choose between skip-gram (sg=1) and CBOW (sg=0) – they can't be mixed. In Doc2Vec, you must choose between DBOW (dm=0) and DM (dm=1) - they can't be mixed. But you can, when doing Doc2Vec DBOW, also add skip-gram word-training (with dbow_words=1).
The choice between hierarchical-softmax and negative-sampling is separate and independent of the above choices. It determines how target-word predictions are read from the neural-network.
With negative-sampling, every possible prediction is assigned a single output-node of the network. In order to improve what prediction a particular input context creates, it checks the output-nodes for the 'correct' word (of the current training example excerpt of the corpus), and for N other 'wrong' words (that don't match the current training example). It then nudges the network's internal weights and the input-vectors to make the 'correct' word output node activation a little stronger, and the N 'wrong' word output node activations a little weaker. (This is called a 'sparse' approach, because it avoids having to calculate every output node, which is very expensive in large vocabularies, instead just calculation N+1 nodes and ignoring the rest.)
You could set negative-sampling with 2 negative-examples with the parameter negative=2 (in Word2Vec or Doc2Vec, with any kind of input-context mode). The default mode, if no negative specified, is negative=5, following the default in the original Google word2vec.c code.
With hierarchical-softmax, instead of every preictable word having its own output node, some pattern of multiple output-node activations is interpreted to mean specific words. Which nodes should be closer to 1.0 or 0.0 in order to represent a word is matter of the word's encoding, which is calculated so that common words have short encodings (involving just a few nodes), while rare words will have longer encodings (involving more nodes). Again, this serves to save calculation time: to check if an input-context is driving just the right set of nodes to the right values to predict the 'correct' word (for the current training-example), just a few nodes need to be checked, and nudged, instead of the whole set.
You enable hierarchical-softmax in gensim with the argument hs=1. By default, it is not used.
You should generally disable negative-sampling, by supplying negative=0, if enabling hierarchical-softmax – typically one or the other will perform better for a given amount of CPU-time/RAM.
(However, following the architecture of the original Google word2vec.c code, it is possible but not recommended to have them both active at once, for example negative=5, hs=1. This will result in a larger, slower model, which might appear to perform better since you're giving it more RAM/time to train, but it's likely that giving equivalent RAM/time to just one or the other would be better.)
Hierarchical-softmax tends to get slower with larger vocabularies (because the average number of nodes involved in each training-example grows); negative-sampling does not (because it's always N+1 nodes). Projects with larger corpuses tend to trend towards preferring negative-sampling.

Doc2Vec: Differentiate Sentence and Document

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.
Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.
Since a question can sometimes span multiple sentences,
I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?
And this notebook : Doc2Vec-Notebook
seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?
Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.
The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.
The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)
The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.
Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.
You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)
Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).
The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)
(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)

Review spam detection using SVM

I have a dataset of reviews from various e-commerce sites.
My task is to classify them into spam or not using SVM in Python.
How should I convert text dataset into SVM features? Are there other features need to be consider and if so, how to convert them into SVM feature vectors?
Is there any sample code or tutorial available to do this task? I need to implement this task, so please guide me on this.
A classic way of converting text input to input you can provide to a machine learning algorithm like SVM:
Divide your text into a list of tokens (for instance each word, each group of 2 words, etc.)
Represent the number of occurrences of your tokens according to a given model. For instance TFIDF is a model that weighs each token according to it's frequency into the whole corpus of documents.
Each document is therefore represented by a vector where each component is one word of your corpus of texts vocabulary, and the associated weigh represents a statistical indicator about this word relatively to the document considered.
See scikit-learn for more information about it, and an implementation of the most classic methods for representing a text as a valid input for machine learning algorithms.

