Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm attempting a text classification task, where I have training data of around 500 restaurant reviews that are labelled across 12 categories. I spent longer than I should have implementing TF.IDF and cosine similarity for the classification of test data, only to get some very poor results (0.4 F-measure). With time not on my side now, I need to implement something significantly more effective that doesn't have a steep learning curve. I am considering using the TF.IDF values in conjunction with Naive Bayes. Does this sound sensible? I know if I can get my data in the right format, I can do this with Scikit learn. Is there anything else you recommend I consider?
Thank you.
You should try to use fasttext: https://pypi.python.org/pypi/fasttext . It can be used to classify text like this:
(don't forget to download a pretrained model here https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip by changing the language if it's not english)
import fasttext
model = fasttext.load_model('wiki.en.bin') # the name of the pretrained model
classifier = fasttext.supervised('train.txt', 'model', label_prefix='__label__')
result = classifier.test('test.txt')
print ('P#1:', result.precision)
print ('R#1:', result.recall)
print ('Number of examples:', result.nexamples)
Every line in your training and test sets should be like this:
__label__classname Your restaurant review blah blah blah
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I want to create a model that can predict who has speak with different word.
In this case i try to use feature
Mfcc
Melspectogram
Tempo
Chroma stft
Spectral Centroid
Spectral Bandwidth
Tempo
And for train that i am use RandomforestRegressor
It's possible to create model like that?
For the sound processing and feature extraction part, librosa is definitely going to provide you all you need.
For the machine learning part however, speaker identification (also called "voice recognition") is a relatively complex task. You probably will get more success using techniques from deep learning. You can certainly try to use random forests if you like, but you'll probably get a lower accuracy and will have to spend more time doing feature engineering. In fact, it will be a good exercise for you to compare the results you can get with the various techniques.
For an example tutorial on speaker identification using Keras, see e.g. this article.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have seen it on Shutterstock or on many websites. if you upload an image with automatically generate suggested tags.
That's commonly done using (Deep) Artificial Neural Networks (NNs).
The idea is that you feed an image into a trained NN model and it will predict a classification of the entire image, detect objects present in the image, or even label regions inside the image. There's a lot of freedom in what can be achieved. Since good models are not easy to obtain (without large amounts of data and intense training resources), there exist pretrained models that can be finetuned by the user in order to make it work on your own particular dataset (unfortunately, these models are often somewhat overfit to the dataset they have been trained on such that finetuning is necessary most of the time). I think this link will point you further into the direction how these automatically suggested tags can be generated.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am planning on building a gender classifier. I know the two popular models are tf-idf and word2vec.
While tf-idf focuses on the importance of a word in a document and similarity of documents, word2vec focuses more on the relationship between words and similarity between them.
However none of theme seem to be perfect for building vector features to be used for gender classification. Is there any other alternative vectorization model that might suit this task?
Yes, there is another alternative to w2v: GloVe.
GloVe stands for Global Vector Embeddings.
As someone who has used this technique before to good effect, I would recommend GloVe.
GloVe optimally trains neural word embeddings not just by looking at local windows but considering a much larger width (30+ size), thereby embedding a much deeper level of semantics to the embedding.
With glove, it is easy to model relationships such as: X[man] - X[woman] = X[king] - X[queen], where these are all vectors.
Credits: GloVe GitHub page (linked below).
You can train your own GloVe embeddings, or you may use their retrained models available. Even for specific domains, the general models seem to work reasonably well, although you would get a lot more out of your models if you trained them yourself. Please look at the GitHub page for instructions on how to train your own models. It is very easy.
Additional reading:
GloVe: Global Vectors for Word Representation
GloVe repository
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 2 arrays, one with sizes and one with prices. How can I train or predict or use a cost function (i'm a begginner yeah) so i can predict prices according to a random size?
Maybe i'm confused with the terms but I hope someone can understand. thanks.
You must use a regressor and fit it to your data. Once fitted, you can use this regressor to predict unseen samples.
Here is a link that shows all the regressors available on sklearn.
Amongst the regressors you could use I can cite : OLS, Ridge, K-NN, Decision trees, Random Forest ...
The documentation is very clear so you won't find (a priori) any difficulty.
NB :
A training dataset with 14 elements is clearly not sufficient.
Try to find out other samples to add to your dataset.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Which feature extractor (Countvectorizer, TfIdf) will be best for sentiment analysis of tweets?
Can someone please explain the difference between each and which is most relevant for different classifiers.
I have planned to use 3 different classifiers- Naive Bayes,SVM and MaxEnt
You can try using the SelectKBest method for selecting the top k most informative features for sentiment analysis. This is present in the scikit-learn library in Python.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
You can import it as:
from sklearn.feature_selection import SelectKBest, chi2, f_classif
Once you've read the documentation you can try using both the 'chi2' as well as 'f-classif' scores for feature extraction. SelectKBest is a good method to select your features because it selects the features that have the strongest association with the output variable. You can keep changing the value of k to experiment and see which value of k gives you the best results.