I have a dataset of reviews from various e-commerce sites.
My task is to classify them into spam or not using SVM in Python.
How should I convert text dataset into SVM features? Are there other features need to be consider and if so, how to convert them into SVM feature vectors?
Is there any sample code or tutorial available to do this task? I need to implement this task, so please guide me on this.
A classic way of converting text input to input you can provide to a machine learning algorithm like SVM:
Divide your text into a list of tokens (for instance each word, each group of 2 words, etc.)
Represent the number of occurrences of your tokens according to a given model. For instance TFIDF is a model that weighs each token according to it's frequency into the whole corpus of documents.
Each document is therefore represented by a vector where each component is one word of your corpus of texts vocabulary, and the associated weigh represents a statistical indicator about this word relatively to the document considered.
See scikit-learn http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction for more information about it, and an implementation of the most classic methods for representing a text as a valid input for machine learning algorithms.
Related
I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.
Does anyone know if what I want to do is possible?
If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.
For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.
A Brief Look at TF-IDF
TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
It is commonly used as a statistical measure to determine how important terms are in a corpus.
For a longer, but readable explanation of it, check out the wiki: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
Code Implementation
Check out Scikit-Learn's TfidfVectorizer.
This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.
To use your own tokenization, do the following:
dummy = lambda x: x
vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
This overrides the default tokenization and lets you use your own list of tokens.
From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.
Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.
if you want exact phrases to be recognised, you can compile a list of those phrases and use spaCy's PhraseMatcher component to train and recognise it later.
https://spacy.io/usage/rule-based-matching#phrasematcher
The only thing is it will only recognise the exact phrases supplied to it. This is in contrary to how NER works, it can recognise additional phrases based on training provided , but PhraseMatcher will only recognise the ones you provide it.
I am trying to see what pre-trained model has included common phrases in news and I thought GoogleNews-vectors-negative300.bin should be a comprehensive one but it turned out that it does not even include deep_learning, machine_learning, social_network, social_responsibility. What pre-trained model could include those words that often occur in news, public reports?
import gensim
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
model.similarity('deep_learning', 'machine_learning')
These are MWE (Multi-Word Expressions) that are unlikely to be included. You could theoretically model them by taking an average of the vectors obtained for each of the words comprising an MWE.
The different considerations for operations applicable for comprising vectors and the obtained results is: word2vec - what is best? add, concatenate or average word vectors?
The GoogleNews vectors were trained by Google, circa 2012-2013, on a large internal corpus of news articles.
Further, the promotion of individual words into multiword phrases seems to have been done using a purely-statistical co-occurrence analysis (similar to that implemented by gensim Phrases class) - so often won't match human-level perception of entities/concepts, missing some word combos, over-combining others.
So, concepts that were obscure (or not even yet coined!) then, or seldom covered in news articles, will be missing or underrepresented.
Training your own vectors on text from your own domain of interest is often best, for both coverage, and to ensure the vectors reflect word/phrase senses that are dominant in your texts – not general news or reference materials.
Background: I have been evaluating a variety of text classification methods on my dataset, including using feature vectors derived from word counts and TF-IDF, and then running these through various classifiers. My dataset is very small (about 2300 sentences and about 5 classes), and considering the above approaches treat different ones as completely separate, would like to use a word vector approach to classification. I have used pretrained word vectors with a shallow NN with little success.
Problem: I am looking for an alternative method of using word vectors to classify my sentences and have thought of taking the word vectors for a sentence, combining them into a single vector, then taking the centroid of each class of sentence vectors - classification would then happen via a distance measure between a new sentence and the centroid.
How can I combine the word vectors into a "sentence vector" given my small dataset?
A great feature of word2vecs is that you can perform simple operations on them. One common way of getting from words to sentences, is to simply take the average your word vectors for all words in your sentence.
since your sample data is small, I'd use a pertained embedding from Gensim Data, retrain using your own sample, and at the end use a simpler classifier like logistic regression.
To Nathan's point if you want to classify documents, Doc2Vec is a great extension of Word2Vec which reduces a lot of the steps. With a few iterations, you can actually achieve really good results. Here is a great implementation of Doc2Vec.
Basically you need to know where to split your sentences first, then you can use a doc2vec model for those sentences.
https://radimrehurek.com/gensim/models/doc2vec.html
Determine where your sentence boundaries are
Model sentence splitting
Train Doc2Vec model on sentences
Input sentence vectors to NN model
Ive done this with limited success. Your corpus is small but you can always try it out and then test/validate/evaluate!
Good luck
I would use gensim's implementation of Paragraph Vector, Doc2Vec for this. I've just written an article describing how to use it classify movie reviews, which might help you!
I have been trying to build a prediction model using a user’s data. Model’s input is documents’ metadata (date published, title etc) and document label is that user’s preference (like/dislike). I would like to ask some questions that I have come across hoping for some answers:
There are way more liked documents than disliked. I read somewhere that if somebody train’s a model using way more inputs of one label than the other this affects the performance in a bad way (model tends to classify everything to the label/outcome that has the majority of inputs
Is there possible to have input to a ML algorithm e.g logistic regression be hybrid in terms of numbers and words and how that could be done, sth like:
input = [18,23,1,0,’cryptography’] with label = [‘Like’]
Also can we use a vector ( that represents a word, using tfidf etc) as an input feature (e.g. 50-dimensions vector) ?
In order to construct a prediction model using textual data the only way to do so is by deriving a dictionary out of every word mentioned in our documents and then construct a binary input that will dictate if a term is mentioned or not? Using such a version though we lose the weight of the term in the collection right?
Can we use something as a word2vec vector as a single input in a supervised learning model?
Thank you for your time.
You either need to under-sample the bigger class (take a small random sample to match the size of the smaller class), over-sample the smaller class (bootstrap sample), or use an algorithm that supports unbalanced data - and for that you'll need to read the documentation.
You need to turn your words into a word vector. Columns are all the unique words in your corpus. Rows are the documents. Cell values are one of: whether the word appears in the document, the number of times it appears, the relative frequency of its appearance, or its TFIDF score. You can then have these columns along with your other non-word columns.
Now you probably have more columns than rows, meaning you'll get a singularity with matrix-based algorithms, in which case you need something like SVM or Naive Bayes.
The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).
The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.
So, what I would like to know:
1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?
2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?
As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.
P.S. I have permission to analyze this data (in case it matters)
Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.
If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).
All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.
I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.
There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.
For determining the topic, many have used simple keywords but there are some more complex methods available.
You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.
Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.