I'm performing different sentiment analysis techniques for a set of Twitter data I have acquired. They are lexicon based (Vader Sentiment and SentiWordNet) and as such require no pre-labeled data.
I was wondering if there was a method (like F-Score, ROC/AUC) to calculate the accuracy of the classifier. Most of the methods I know require a target to compare the result to.
What I did for my research is take a small random sample of those tweets and manually label them as either positive or negative. You can then calculate the normalized scores using VADER or SentiWordNet and compute the confusion matrix for each which will give you your F-score etc.
Although this may not be a particularly good test, as it depends on the sample of tweets you use. For example you may find that SentiWordNet classes more things as negative than VADER and thus appears to have the higher accuracy if your random sample are mostly negative.
The short answer is no, I don't think so. (So, I'd be very interested if someone else posts a method.)
With some unsupervised machine learning techniques you can get some measurement of error. E.g. an autoencoder gives you an MSE (representing how accurately the lower-dimensional representation can be reconstructed back to the original higher-dimensional form).
But for sentiment analysis all I can think of is to use multiple algorithms and measure agreement between them on the same data. Where they all agree on a particular sentiment you mark it as more reliable prediction, where they all disagree you mark it as unreliable prediction. (This relies on none of the algorithms have the same biases, which is probably unlikely.)
The usual approach is to label some percentage of your data, and assume/hope it is representative of the whole data.
Related
I am dealing with patient data. I want to predict the top N diseases given a set of symptoms.
This is a sample of my dataset: In total I have around 1200 unique Symptoms and around 200 unique Diagnosis
ID Symptom combination Diagnosis
Patient1: fever, loss of appetite, cold Flu
Patient2: hair loss, blood pressure Thyroid
Patient3: hair loss, blood pressure Flu
Patient4: throat pain, joint pain Viral Fever
..
..
Patient30000: vomiting, nausea Diarrohea
What I am planning to do with this dataset is to use the Symptoms column to generate word vectors using Word2vec for each row of patient data. After generating the vectors I want to build a classifier, with the vectors in each row being my independent variable and the Diagnosis being the target categorical variable.
Shall I take the average of the vectors to generate feature vectors generated from word2vec? If so, any clarifications on the same?
You can average a bunch of word-vectors for symptoms together to get a single feature-vector of the same dimensionality. (If your word-vectors are 100d each, averaging them together gets a single 100d summary vector.)
But such averaging is fairly crude, and has some risk of diluting the information of each symptom in the averaging.
(As a simplified, stylized example, imagine a nurse took a patients' temperature at 9pm, and found it to be 102.6°F. Then again, at 7am, and found it to be 94.6°F. A doctor asks, "how's our patient's temperature?", and the nurse says the average, "98.6°F". "Wow," says the doctor, "it's rare for someone to be so on-the-dot for the normal healthy temperature. Next patient!" Averaging hid the important information: that the patient had both a fever and dangerous hypothermia.)
It sounds like you have a controlled-vocabulary of symptoms, with just some known, capped, and not-very-large number of symptom tokens: about 1200.
In such a case, turning those into a categorical vector for the presence/absence of each symptom may work far better than word2vec-based approaches. Maybe you have 100 different symptoms or 10,000 different symptoms. Either way, you can turn them into a large vector of 1s and 0s representing each possible symptom in order, and lots of classifiers will do pretty well with that input.
If treating the list-of-symptoms like a text-of-words, a simple "bag of words" representation of the text will essentially be this categorical representation: a 1200-dimensional 'one-hot' vector.
And unless this is some academic exercise where you've been required to use word2vec, it's not a good place to start, and may not be a part of the best solution. To train good word-vectors, you need more data than you have. (To re-use word-vectors from elsewhere, they should be well-matched to your domain.)
Word-vectors are most likely to help if you've gots tens-of-thousands to hundreds-of-thousands of terms, and many contextual examples of each of their uses, to plot their subtle variations-of-meaning in a dense shared space. Only 30,000 'texts', of ~3-5 tokens each, and only ~1200 unique tokens, is fairly small for word2vec.
(I made similar points in my comments on one of your earlier questions.)
Once you've turned each row into a feature vector – whether it's by averaging symptom word-vectors, or probably better creating a bag-of-words representation – you can and should try many different classifiers to see which works best.
Many are drop-in replacements for each other, and with the size of your data, testing many against each other in a loop may take less than an hour or few.
If at a total loss where to start, anything listed in the 'classifiers' upper-left area of this scikit-learn graphical guide is worth trying:
If you want to consider an even wider range of possibilities, and get a vaguely-intuitive idea of which ones can best discover certain kinds of "shapes" in the underlying high-dimensional data, you can look at all those demonstrated in this scikit-learn "classifier comparison" page, with these graphical representations of how well they handle a noisy 2d classification challenge (instead of your 1200d challenge):
I would suggest you start without the word2vec and instead use a binary vectorizer. You will get a sparce binary matrix for your data. Then apply any of the multi-class classificators. Both are available from scikit-learn.
It is not clear how the vectors should add to the power of your model. They may be even counterproductive if the word2vec model is trained on an irrelevant dataset. Close vectors learned from that dataset may be actually representing contrasting symptoms for your target.
I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.
I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?
The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.
I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.
To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!
That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.
If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.
I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?
Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.