Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Which feature extractor (Countvectorizer, TfIdf) will be best for sentiment analysis of tweets?
Can someone please explain the difference between each and which is most relevant for different classifiers.
I have planned to use 3 different classifiers- Naive Bayes,SVM and MaxEnt
You can try using the SelectKBest method for selecting the top k most informative features for sentiment analysis. This is present in the scikit-learn library in Python.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
You can import it as:
from sklearn.feature_selection import SelectKBest, chi2, f_classif
Once you've read the documentation you can try using both the 'chi2' as well as 'f-classif' scores for feature extraction. SelectKBest is a good method to select your features because it selects the features that have the strongest association with the output variable. You can keep changing the value of k to experiment and see which value of k gives you the best results.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I want to create a model that can predict who has speak with different word.
In this case i try to use feature
Mfcc
Melspectogram
Tempo
Chroma stft
Spectral Centroid
Spectral Bandwidth
Tempo
And for train that i am use RandomforestRegressor
It's possible to create model like that?
For the sound processing and feature extraction part, librosa is definitely going to provide you all you need.
For the machine learning part however, speaker identification (also called "voice recognition") is a relatively complex task. You probably will get more success using techniques from deep learning. You can certainly try to use random forests if you like, but you'll probably get a lower accuracy and will have to spend more time doing feature engineering. In fact, it will be a good exercise for you to compare the results you can get with the various techniques.
For an example tutorial on speaker identification using Keras, see e.g. this article.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
This question is specific for a XGBClassifier API using a "gblinear" booster.
As mentioned here, the .coef_ property returns, as the xgboost doc says here an array of type [n_classes, n_features].
Using this array how can I order the features by importance?
The short answer is no, although the base learner is a linear model, the magnitude of the coefficients will not indicate how important they are. Even more so when the coefficients are not scaled. You can look at it as the magnitude of the coefficients are dependent of the scale / variation of your predictors, but does not tell you how useful it will be in predicting the correct value. You can check this post on more details of how the base learner works.
If you are already using scikit-learn and xgboost underneath it, there is a help page on plotting the importance of the variables, and you can work with that.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My dataset has over 200 variables and I am running a classification model on it, which is leading to a model OverFit. Which suggested for reducing the number of features? I started with Feature Importance, however due to such a large number of variables, I am unable to visualise it. Is there a way I can plot or showcase these values with respect to the given variable?
Below is the code that am trying:
F_Select = ExtraTreesClassifier(n_estimators=50)
F_Select.fit(X_train,y_train)
print(F_Select.feature_importances_)
You could try plotting the feature importances from largest to smallest and seeing which features capture a certain amount (say 95%) of the variance, like a scree plot used in PCA. Ideally, this should be a small number of features:
import matplotlib.pyplot as plt
from sklearn import *
model = ensemble.RandomForestClassifier()
model.fit(features, labels)
model.feature_importances_
importances = np.sort(model.feature_importances_)[::-1]
cumsum = np.cumsum(importances)
plt.bar(range(len(importances)), importances)
plt.plot(cumsum)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm attempting a text classification task, where I have training data of around 500 restaurant reviews that are labelled across 12 categories. I spent longer than I should have implementing TF.IDF and cosine similarity for the classification of test data, only to get some very poor results (0.4 F-measure). With time not on my side now, I need to implement something significantly more effective that doesn't have a steep learning curve. I am considering using the TF.IDF values in conjunction with Naive Bayes. Does this sound sensible? I know if I can get my data in the right format, I can do this with Scikit learn. Is there anything else you recommend I consider?
Thank you.
You should try to use fasttext: https://pypi.python.org/pypi/fasttext . It can be used to classify text like this:
(don't forget to download a pretrained model here https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip by changing the language if it's not english)
import fasttext
model = fasttext.load_model('wiki.en.bin') # the name of the pretrained model
classifier = fasttext.supervised('train.txt', 'model', label_prefix='__label__')
result = classifier.test('test.txt')
print ('P#1:', result.precision)
print ('R#1:', result.recall)
print ('Number of examples:', result.nexamples)
Every line in your training and test sets should be like this:
__label__classname Your restaurant review blah blah blah
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 2 arrays, one with sizes and one with prices. How can I train or predict or use a cost function (i'm a begginner yeah) so i can predict prices according to a random size?
Maybe i'm confused with the terms but I hope someone can understand. thanks.
You must use a regressor and fit it to your data. Once fitted, you can use this regressor to predict unseen samples.
Here is a link that shows all the regressors available on sklearn.
Amongst the regressors you could use I can cite : OLS, Ridge, K-NN, Decision trees, Random Forest ...
The documentation is very clear so you won't find (a priori) any difficulty.
NB :
A training dataset with 14 elements is clearly not sufficient.
Try to find out other samples to add to your dataset.