Python multiclass classification - python

I have a working example of a multiclass classifier (using sklearn.svm) on text data. At one pass, I can only train/test one feature. Is it possible to stack several features in one classifier? For concreteness, my data has the following characteristics:
feature 1: 1c1, 1c2, 1c3, 1c4
feature 2: 2c1,2c2
feature 3: 3c1,3c2,3c3,3c4,3c5
feature 4: 4c1,4c2,4c3
Currently, I can run a training pass for feature 1 and repeat for feature 2 etc.
How can I stack them together to get an output vector like [1c4,2c1,3c5,4c2]? This is not a multi-label problem because feature sets {1..n} are mutually exclusive.

Apparently, there is no way to do this, per Alan Sz's answer
One obvious advantage of artificial neural networks over support vector machines is that artificial neural networks may have any number of outputs, while support vector machines have only one. The most direct way to create an n-ary classifier with support vector machines is to create n support vector machines and train each of them one by one. On the other hand, an n-ary classifier with neural networks can be trained in one go.

Related

Multiple Input - One output Neural Network in small dataset

The dataset i am working on has 7 input features and 4 output class. The length of my dataset is 160. Will neural network be a good choice here? If so, how should i take my inputs to the neural network. Since I have 4 output class, i am going to use Softmax in the final layer.
If neural network makes no sense in such a small dataset, then what are the possible good Machine Learning Algorithms for have a great result in this kind of problems?.
Thanks 😊
What kind of a dataset do you have? I am assuming a tabular dataset.
You can use a neural network if you must. However, for such a small dataset, a neural network isn't usually advisable. You should rather look into the following classifiers:
Decision Tree
Naive Bayes
Multi-class Logistic Regression
Support Vector Machine
Ensemble models (Random Forest and/or Gradient Boosting)

I want to implement a machine learning or deep learning model for text classification (100 classes)

I have a dataset that is similar to the one where we have movie plots and their genres. The number of classes is around 100. What algorithm should I choose for this 100 class classification? The classification is multi-label because 1 movie can have multiple genres
Please recommend anyone from the following. You are free to suggest any other model if you want to.
1.Naive Bayesian
2.Neural networks
3.SVM
4.Random forest
5.k nearest neighbours
It would be useful if you also give the necessary library in python
An important step in machine learning engineering consists of properly inspecting the data. Herby you get some insight that determines what algorithm to choose. Sometimes, you might try out more than one algorithm and compare the models, in order to be sure, that you tried your best on the data.
Since you did not disclose your data, I can only give you the following advice: If your data is "easy", meaning that you need only little features and a slight combination of them to solve the task, use Naive Bayes or k-nearest neighbors. If your data is "medium" hard, then use Random Forest or SVM. If solving the task requires a very complicated decision boundary combining many dimensions of the features in a non-linear fashion, choose a Neural Network architecture.
I suggest you use python and the scikit-learn package for SVM or Random forest or k-NN.
For Neural Networks, use keras.
I am sorry that I can not give you THE recipe you might expect for solving your problem. Your question is posed really broad.

Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn package. My machine learning problem has an a input of 3 features an needs to predict two output variables. Some ML models in the sklearn package support multioutput regression nativly. If the models do not support this, the sklearn multioutput regression algorithm can be used to convert it. The multioutput class fits one regressor per target.
Does the mulioutput regressor class or supported multi-output regression algorithms take the underlying relationship of the input variables in to account?
Instead of a multi-output regression algorithm should I use a Neural network?
1) For your first question, I have divided that into two parts.
First part has the answer written in the documentation you linked and also in this user guide topic, which states explicitly that:
As MultiOutputRegressor fits one regressor per target it can not take
advantage of correlations between targets.
Second part of first question asks about other algorithms which support this. For that you can look at the "inherently multiclass" part in the user-guide. Inherently multi-class means that they don't use One-vs-Rest or One-vs-One strategy to be able to handle multi-class (OvO and OvR uses multiple models to fit multiple classes and so may not use the relationship between targets). Inherently multi-class means that they can structure the multi-class setting into a single model. This lists the following:
sklearn.naive_bayes.BernoulliNB
sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.naive_bayes.GaussianNB
sklearn.neighbors.KNeighborsClassifier
sklearn.semi_supervised.LabelPropagation
sklearn.semi_supervised.LabelSpreading
sklearn.discriminant_analysis.LinearDiscriminantAnalysis
sklearn.svm.LinearSVC (setting multi_class=”crammer_singer”)
sklearn.linear_model.LogisticRegression (setting multi_class=”multinomial”)
...
...
...
Try replacing the 'Classifier' at the end with 'Regressor' and see the documentation of fit() method there. For example let's take DecisionTreeRegressor.fit():
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The target values (real numbers).
Use dtype=np.float64 and order='C' for maximum efficiency.
You see that it supports a 2-d array for targets (y). So it may be able to use correlation and underlying relationship of targets.
2) Now for your second question about using neural network or not, it depends on personal preference, the type of problem, the amount and type of data you have, the training iterations you want to do. Maybe you can try multiple algorithms and choose what gives best output for your data and problem.

Keras Neural Networks and SKlearn SVM.SVC

Lately I was on a Data Science meetup in my city, there was a talk about connecting Neural Networks with SVM. Unfortunately presenter had to quit right after presentation, so I wasn't able to ask some questions.
I was wondering how is that possible ? He was talking about using neural networks for his classification, and later on, he was using SVM classifier to improve his accuracy and precision by about 10%.
I am using Keras for Neural Networks and SKlearn for the rest of ML.
This is completely possible and actually quite common. You just select the output of a layer of the neural network and use that as a feature vector to train a SVM. Generally one normalizes the feature vectors as well.
Features learned by (Convolutional) Neural Networks are powerful enough that they generalize to different kinds of objects and even completely different images. For examples see the paper CNN Features off-the-shelf: an Astounding Baseline for Recognition.
About implementation, you just have to train a neural network, then select one of the layers (usually the ones right before the fully connected layers or the first fully connected one), run the neural network on your dataset, store all the feature vectors, then train an SVM with a different library (e.g sklearn).

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Categories

Resources