Deciding on the best algorithm for a classification problem - python

I have a data set consisting of census data (age, sex, employment type, race, education level etc.). My task is to write an algorithm that predicts whether a data point (30, male, white etc.) will have a gross annual income of above $50k.
So far I implemented a KNN algorithm that runs for 30 hours, but achieves ~90% accuracy on test data. I was hoping to achieve higher accuracy using a SVM algorithm, or Naive Bayes, or anything else that might work here.
I'm looking for an algorithm that will be relatively simple to implement(about as hard as KNN) in python, and is likely to achieve good accuracy. What is the best choice in this case? If KNN is the best choice, which algorithm will be easiest to implement for comparison purposes?

It is hard to tell a priori which algorithm will perform better. Usually, for traditional classification tasks such as yours, random forest, gradient boosted machines and SVM are often giving the best results.
I dont' know what you mean by looking for an algorithm that is "relatively simple to implement", but if you use scikit-learn, a lot of algorithms are already implemented and will fit in one or two lines of code so you can try them all!

Related

What to do when only a portion of training/testing data generates confident predictions?

I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.

I want to implement a machine learning or deep learning model for text classification (100 classes)

I have a dataset that is similar to the one where we have movie plots and their genres. The number of classes is around 100. What algorithm should I choose for this 100 class classification? The classification is multi-label because 1 movie can have multiple genres
Please recommend anyone from the following. You are free to suggest any other model if you want to.
1.Naive Bayesian
2.Neural networks
3.SVM
4.Random forest
5.k nearest neighbours
It would be useful if you also give the necessary library in python
An important step in machine learning engineering consists of properly inspecting the data. Herby you get some insight that determines what algorithm to choose. Sometimes, you might try out more than one algorithm and compare the models, in order to be sure, that you tried your best on the data.
Since you did not disclose your data, I can only give you the following advice: If your data is "easy", meaning that you need only little features and a slight combination of them to solve the task, use Naive Bayes or k-nearest neighbors. If your data is "medium" hard, then use Random Forest or SVM. If solving the task requires a very complicated decision boundary combining many dimensions of the features in a non-linear fashion, choose a Neural Network architecture.
I suggest you use python and the scikit-learn package for SVM or Random forest or k-NN.
For Neural Networks, use keras.
I am sorry that I can not give you THE recipe you might expect for solving your problem. Your question is posed really broad.

How to incorporate uncertainty of features into machine learning algorithms?

I am using decision trees from Scikit Learn to do regression on a data set.
I am getting very good results, but one issue that concerns me is that the relative uncertainty on many of the features is very high.
I have tried just dropping the cases with high uncertainty, but that reduces the performance of the model significantly.
The features themselves are experimentally determined, so they have associated experimental uncertainty. The data itself is not noisy.
So my question, is there a good way to incorporate the uncertainty associated with the features to machine learning algorithms?
Thanks for all the help!
If the uncertain features are improving the algorithm that suggests that together, they are useful. However, some of them may not be. My suggestion would be to get rid of those features that don't improve the algorithm. You could use a greedy feature elimination algorithm.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
This begins by training a model on all the features in the model and then gets rid of the feature deemed to be the least useful. It trains the model again but with one less feature.
Hope that helps

What classification model should I use? New to machine learning. Recommendation needed

the goal:
Hey guys, I'm trying to create a classification model in Python to predict when a bike-share station will have too much relative inflow or outflow per hour.
what we're workin with:
The first 5 rows of my dataframe (over 200,000 rows in all) look like this, and I've assigned values 0, 1, 2 in the 'flux' column - 0 if no significant action, 1 if too much inflow, 2 if too much outflow.
And I'm thinking of using the station_name (over 300 stations), hour of day, and day of week as the predictor variables to classify 'flux'.
the model choice:
What should I go with? Naive Bayes? KNN? Random Forest? anything else that would be a good fit? GDMs? SVMs?
fyi: the baseline prediction of always 0 is pretty high at 92.8%. unfortunately the accuracy of logistic regression and decision tree is right on par w that and doesn't improve it much. and KNN just takes forever....
Recommendations from those more experienced with machine learning in dealing with a classification question like this?
The Azure machine learning team has an article on how to choose algorithms which could help even if you aren't using AzureML. From that article:
How large is your training data? If your training set is small, and
you're going to train a supervised classifier, then machine learning
theory says you should stick to a classifier with high bias/low
variance, such as Naive Bayes. These have an advantage over low
bias/high variance classifiers such as kNN since the latter tends to
overfit. But low bias/high variance classifiers are more appropriate
if you have a larger training set because they have a smaller
asymptotic error - in these cases a high bias classifier isn't
powerful enough to provide an accurate model. There are theoretical
and empirical results that indicate that Naive Bayes does well in such
circumstances. But note that having better data and good features
usually can give you a greater advantage than having a better
algorithm. Also, if you have a very large dataset classification
performance may not be affected as much by the algorithm you use, so
in that case it's better to choose your algorithm based on such things
as its scalability, speed, or ease of use.
Do you need to train incrementally or in a batched mode? If you have a
lot of data, or your data is updated frequently, you probably want to
use Bayesian algorithms that update well. Both neural nets and SVMs
need to work on the training data in batch mode.
Is your data exclusively categorical or exclusively numeric or a
mixture of both kinds? Bayesian works best with categorical/binomial
data. Decision trees can't predict numerical values.
Do you or your audience need to understand how the classifier works?
Bayesian or decision trees are more easily explained. It's much harder
to see or explain how neural networks and SVMs classify data.
How fast does your classification need to be generated? Decision trees
can be slow when the tree is complex. SVMs, on the other hand,
classify more quickly since they only need to determine which side of
the "line" your data is on.
How much complexity does the problem present or require? Neural nets
and SVMs can handle complex non-linear classification.
Now, regarding your comment about "fyi: the baseline prediction of always 0 is pretty high at 92.8%": there are anomaly detection algorithms - meaning that the classification is highly unbalanced, with one classification being an "anomaly" that occurs very rarely, like credit card fraud detection (true fraud is hopefully a very small percentage of your total dataset). In Azure Machine Learning, we use one-class support vector machine (SVM) and PCA-based anomaly detection algorithms. Hope that helps!
Just use anything different from average accuracy for model evaluation in case of such unbalanced data: precision/recall/f1/confusion matrix:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Try different models and choose best according to chosen metrics on test set.

Random Forest for multi-label classification

I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?
The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.
It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.

Categories

Resources