A project I'm working on requires me to combine 2 kernels and obtain results. The first kernel is formed using the features, and the second kernel needs to be the correlation between the features. Combining these kernels and using an SVM classifier is the requirement. Can anyone help me implement this in python?
Related
I´m trying to cluster a dataframe with 36 features and a lot (88%) of zeros. It's my first job at ML, I started with K-Means, but for any K I choose, 99.5% of my data remains on cluster 0. I've tried some PCA do reduce features, but the same problem appeared.
Any thoughts on approaches I can try?
Have you tried techniques such as sequential feature selection? These are so-called 'wrapper methods' where you add (for forward selection) or eliminate (for backward elimination) one feature at a time and assess the performance of the model accordingly. I tend to use supervised learning models in my job but I believe you can use sequential selection algorithms to assess unsupervised models as well. I have used the sklearn library for this: https://scikit-learn.org/stable/modules/feature_selection.html
I am using python 3.5 with tensorflow 0.11.
I have a dataset with large number of features (>5000) and relatively small number of samples(<200). I am using wrapper skflow function DNNClassifier for deep learning.
It seems to work work well for classification task, but I want to find some important features from large number of features.
Internally, DNNClassifier seems to perform feature selection(or feature
extraction). Is there any way to perform feature selection with tensorflow?
Or, is there some function to extract the weights of the features?
(There was a function DNNClassifier.weights_, but it seems to be deprecated)
If Tensorflow does not support feature selection or weight information, will it be reasonable to conduct feature selection using other method(such as univariate feature selection) and then try deep learning?
Thank you for help.
You can eval the weights.
For example if your variable is define by
weights = tf.Variable(np.ones([100,10],dtype='float32'), name=weights)
you can get it value at the tensorflow session
value = weights.eval();
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.
I've been tasked with solving a sentiment classification problem using scikit-learn, python, and mapreduce. I need to use mapreduce to parallelize the project, thus creating multiple SVM classifiers. I am then supposed to "average" the classifiers together, but I am not sure how that works or if it is even possible. The result of the classification should be one classifier, the trained, averaged classifier.
I have written the code using scikit-learn SVM Linear kernel, and it works, but now I need to bring it into a map-reduce, parallelized context, and I don't even know how to begin.
Any advice?
Make sure that all of the required libraries (scikit-learn, NumPy, pandas) are installed on every node in your cluster.
Your mapper will process each line of input, i.e., your training row and emit a key that basically represents the fold for which you will be training your classifier.
Your reducer will collect the lines for each fold and then run the sklearn classifier on all lines for that fold.
You can then average the results from each fold.
I've got BOW vectors and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn or gensim capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
Actually I'm trying to find a proper metric for the classification/regression, and I believe using dimensionality can help me. I know there's unsupervised methods, but I want to keep the label information along the way.
FastText - implementation from Facebook research, essentially help you achieve what you have been asking for. Since you were asking about gensim, I assume you might be aware of word2vec in gensim.
Now word2vec was proposed Mikolov while at google. Mikolov and his team at Facebook ahs come up with fastText, which takes into consideration the word and sub-word information. It also allows for classification of text.
you can only perform dimensionality reduction in an unsupervised manner OR supervised but with different labels than your target labels.
For example you could train a logistic regression classifier with a dataset containing 100 topics. the output of this classifier (100 values) using your training data could be your dimensionality reduced feature set.