the goal:
Hey guys, I'm trying to create a classification model in Python to predict when a bike-share station will have too much relative inflow or outflow per hour.
what we're workin with:
The first 5 rows of my dataframe (over 200,000 rows in all) look like this, and I've assigned values 0, 1, 2 in the 'flux' column - 0 if no significant action, 1 if too much inflow, 2 if too much outflow.
And I'm thinking of using the station_name (over 300 stations), hour of day, and day of week as the predictor variables to classify 'flux'.
the model choice:
What should I go with? Naive Bayes? KNN? Random Forest? anything else that would be a good fit? GDMs? SVMs?
fyi: the baseline prediction of always 0 is pretty high at 92.8%. unfortunately the accuracy of logistic regression and decision tree is right on par w that and doesn't improve it much. and KNN just takes forever....
Recommendations from those more experienced with machine learning in dealing with a classification question like this?
The Azure machine learning team has an article on how to choose algorithms which could help even if you aren't using AzureML. From that article:
How large is your training data? If your training set is small, and
you're going to train a supervised classifier, then machine learning
theory says you should stick to a classifier with high bias/low
variance, such as Naive Bayes. These have an advantage over low
bias/high variance classifiers such as kNN since the latter tends to
overfit. But low bias/high variance classifiers are more appropriate
if you have a larger training set because they have a smaller
asymptotic error - in these cases a high bias classifier isn't
powerful enough to provide an accurate model. There are theoretical
and empirical results that indicate that Naive Bayes does well in such
circumstances. But note that having better data and good features
usually can give you a greater advantage than having a better
algorithm. Also, if you have a very large dataset classification
performance may not be affected as much by the algorithm you use, so
in that case it's better to choose your algorithm based on such things
as its scalability, speed, or ease of use.
Do you need to train incrementally or in a batched mode? If you have a
lot of data, or your data is updated frequently, you probably want to
use Bayesian algorithms that update well. Both neural nets and SVMs
need to work on the training data in batch mode.
Is your data exclusively categorical or exclusively numeric or a
mixture of both kinds? Bayesian works best with categorical/binomial
data. Decision trees can't predict numerical values.
Do you or your audience need to understand how the classifier works?
Bayesian or decision trees are more easily explained. It's much harder
to see or explain how neural networks and SVMs classify data.
How fast does your classification need to be generated? Decision trees
can be slow when the tree is complex. SVMs, on the other hand,
classify more quickly since they only need to determine which side of
the "line" your data is on.
How much complexity does the problem present or require? Neural nets
and SVMs can handle complex non-linear classification.
Now, regarding your comment about "fyi: the baseline prediction of always 0 is pretty high at 92.8%": there are anomaly detection algorithms - meaning that the classification is highly unbalanced, with one classification being an "anomaly" that occurs very rarely, like credit card fraud detection (true fraud is hopefully a very small percentage of your total dataset). In Azure Machine Learning, we use one-class support vector machine (SVM) and PCA-based anomaly detection algorithms. Hope that helps!
Just use anything different from average accuracy for model evaluation in case of such unbalanced data: precision/recall/f1/confusion matrix:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Try different models and choose best according to chosen metrics on test set.
Related
I am working with health dataset.
The dataset is about body signals (8 features) and the target variable is body failing Temperature.
There were 6 different temperatures or Multi classes. (targets)
My data set is of shape (1500*9) - Numerical Data
I fitted my data with RMClassifier, but it shows a accuracy of around 80%
But i needed my accuracy & F1 score to be improved even more.
On the other hand I am tweaking some parameters for better accuracy.
Apart from Random Forest, I would like to get some suggestion, which model would be the best choice fr my above problem. Since my dataset is small, I am not sure about selecting the best ML model
I thought of going with boosting,SVM or Neural Nets.
Kindly share your thoughts.
To find the best model for your problem you can use GridSearchCV of Scikit-learn. Use pipeline and configure the GridSearchCV to experiment with different learning methods changing their hyper-parameters. It will find the best ML model for you.
A group of researchers found with quality and quantity data the performance of different ML models vary a little (Hands-On Machine Learning with Scikit-Learn and TensorFlow, first edition, page 23). You should also spend some effort on feature engineering to see if you can increase the number of features. You can get some idea from this Titanic solution
I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.
I am using decision trees from Scikit Learn to do regression on a data set.
I am getting very good results, but one issue that concerns me is that the relative uncertainty on many of the features is very high.
I have tried just dropping the cases with high uncertainty, but that reduces the performance of the model significantly.
The features themselves are experimentally determined, so they have associated experimental uncertainty. The data itself is not noisy.
So my question, is there a good way to incorporate the uncertainty associated with the features to machine learning algorithms?
Thanks for all the help!
If the uncertain features are improving the algorithm that suggests that together, they are useful. However, some of them may not be. My suggestion would be to get rid of those features that don't improve the algorithm. You could use a greedy feature elimination algorithm.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
This begins by training a model on all the features in the model and then gets rid of the feature deemed to be the least useful. It trains the model again but with one less feature.
Hope that helps
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.
I am making an application for multilabel text classification .
I've tried different machine learning algorithm.
No doubt the SVM with linear kernel gets the best results.
I have also tried to sort through the algorithm Radom Forest and the results I have obtained have been very bad, both the recall and precision are very low.
The fact that the linear kernel to respond better result gives me an idea of the different categories are linearly separable.
Is there any reason the Random Forest results are so low?
The ensemble of the random forest performs well across many domains and types of data. They are excellent at reducing error from variance and don't over fit if trees are kept simple enough.
I would expect a forest to perform comparably to a SVM with a linear kernel.
The SVM will tend to overfit more because it does not benefit from being an ensemble.
If you are not using cross validation of some kind. At minimum measuring performance on unseen data using a test/training regimen than i could see you obtaining this type of result.
Go back and make sure performance is measured on unseen data and likelier you'll see the RF performing more comparably.
Good luck.
It is very hard to answer this question without looking at the data in question.
SVM does have a history of working better with text classification - but machine learning by definition is context dependent.
Consider the parameters by which you are running the random forest algorithm. What are your number and depth of trees, are you pruning branches? Are you searching a larger parameter space for SVMs therefore are more likely to find a better optimum.