I have some images with which I will test an object detection classifier.
I'll have the classifier output the coordinates for the rectangles where it believes the target objects are, but I wonder how the results are tested?
I'm guessing I should have a reference file of coordinates of true object positions against which I can compare the classifier's results.
What if the classifier does make a correct classification, just with the coordinates not exactly the same as the ones in the reference file?
How's this usually solved?
The answer depends of the method you are going to use. One of them you've proposed, and in that case I would put an fixed possible error rate for bad detecting case which, if the classifier result lays inside the error interval for one of the examples from data test file, is considered as a proper classified example. Of course this fixed error rate should be as small as not to "overdetect" in the data set.
I would suggest trying cross-validation as a technique to test classifier. From data set, it chooses some vectors (images) as a test set and the rest of them as a training set. Repeat few times and average the errors resulting in estimated classificator error. And you don't have to have separate data test set and you don't have to worry about the problem you stated.
Related
Firstly I want to apologize if any of my questions seem obvious or dumb. I'm still doing a lot of learning in this field so I would still appreciate any help I can get, even if the answer is obvious. I'm not sure if this belongs in stack overflow or not, but I thought I would give it a try here.
My understanding for clustering has always been that you never need to split the data to training and testing since you don't have any labels or are ignoring the labels. But from some recent reading that I have done online, I see that some people are saying that train/test splitting is still necessary for some clustering algorithms (those that generate centroids). So I just have some scenarios listed below, and I was hoping if some people could help me better understand what I have done wrong in those scenarios and what I need to do to fix it.
For all scenarios, my dataset is 95% unlabeled and 5% labeled, with 2 features. There has also been some scaling done on the dataset entirely prior to these scenarios. I know you're supposed to train/test split before scaling, but again, I thought you don't need to split for clustering.
Scenario 1
Using sklearn.cluster.KMeans, I ran KMeans on the dataset via the fit_predict() method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. I now feel that what I have done now is incorrect because the method introduces data leakage. Was I supposed to train/test split first with the unlabeled/labeled data, then do the scaling separately, then fit() on the unlabeled data, then predict() on the labeled data?
Scenario 2
Using scipy.cluster.hierarchy.linkage, I ran linkage on the dataset to get my linkage array. I drew the dendrogram, selected a distance cutoff point and saw how many clusters I would get, and generated the cluster labels via the fcluster method. Then for the 5% of the labeled data, I use the clusters that were returned and compare it to the labels I have to see how close the clusters are to the labels. Am I doing anything wrong in this situation so far? I'm not sure what my next steps are from here. Since hierarchical clustering doesn't generate any centroids, how can it be used to properly classify/predict data?
Thank you.
I have a problem that I have been treating as a classification problem. I am trying to predict whether a machine will pass or fail a particular test based on a number of input features.
What I am really interested in is actually whether a new machine is predicted to pass or fail the test. It can pass or fail the test by having certain signatures (such as speed, vibration etc) go out of range.
Therefore, I could either:
1) Treat it as a pure regression problem; try to predict the actual values of speed, vibration etc
2) Treat it as a pure classification problem; for each observation, feed in whether it passed or failed on the labels, and try to predict this in the tool I am making
3) Treat it as a pseudo problem; where I predict the actual value, and come up with some measure of how confident I am that it is a pass or fail based on distance from the threshold of pass/fail
To be clear; I am working on a real problem. I am not interested in getting a super precise prediction of a certain value, just whether a machine is predicted to pass or fail (and bonus extension; how likely that it is to be true).
I have been working with classification model as I only have a couple hundred observations and some previous research showed that this might be the best way to treat the problem. However I am wondering now whether this is the right thing to do.
What would you do!?
Many thanks.
Without having the data and running classification or regression, a comparison would be hard because of the metric you use for each family is different.
For example, comparing RMSE of a regression with F1 score (or accuracy) of a classification problem would be apple to orange comparison.
It would be ideal if you can train a good regression model (low RMSE) because that would give you information more than the original pass/fail question. From my past experiences with industrial customers,
First, train all 3 models you have mentioned and then present the outcome to your customer and let them give you more direction on which models/outputs are more meaningful for them.
I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?
The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.
This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!
Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.
I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.
To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!
That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.
If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.