How to automatic classification of app user reviews?

How to automatic classification of app user reviews? - python

I have received tens of thousands of user reviews on the app.
I know the meaning of many of the comments are the same.
I can not read all these comments.
Therefore, I would like to use a python program to analyze all comments,
Identify the most frequently the most important feedback information.
I would like to ask, how can I do that?
I can download an app all comments, also a preliminary understanding of the Google Prediction API.

You can use the Google Prediction API to characterize your comments as important or unimportant. What you'd want to do is manually classify a subset of your comments. Then you upload the manually classified model to Google Cloud Storage and, using the Prediction API, train your model. This step is asynchronous and can take some time. Once the trained model is ready, you can use it to programmatically classify the remaining (and any future) comments.
Note that the more comments you classify manually (i.e. the larger your training set), the more accurate your programmatic classifications will be. Also, you can extend this idea as follows: instead of a binary classification (important/unimportant), you could use grades of importance, e.g. on a 1-5 scale. Of course, that entails more manual labor in constructing your model so the best strategy will be a function of your needs and how much time you can spend building the model.

Related

Evaluate recommendations after training a model

First of all, I would like to create a recommender system. With the help of a neural network, this is supposed to make a prediction of which articles user X is most likely to buy.
I have already trained a model with the right datasets and the help of the neuMF model (you can also look at the different layers in the picture).
[Source https://arxiv.org/abs/1708.05031]
My dataset contains the following:
The column event contains whether the user has looked at an item (view), placed it in the shopping cart (addtocart) or bought it (transaction).
I have already found example implementations of how they determine the recommendations. The following was written about it:
Now that I've trained my model, I'm ready to recommend songs for a
given playlist! However, one issue that I encountered (see below) is
that I need the embedding of that new playlist (as stored in my model)
in order to find the closest relevant playlists in that embedding
space using kmeans. I am not sure how to get around this issue- as is,
it seems that I have to retrain my whole model each time I get an
input playlist in order to get that playlist embedding. Therefore, I
just test my model on a randomly chosen playlist (which happens to be
rock and oldies, mostly!) from the training set.
To recommend songs, I first cluster the learned embeddings for all of
the training playlists, and then select "neighbor" playlists for my
given test playlist as all of the other playlists in that same
cluster. I then take all of the tracks from these playlists and feed
the test playlist embedding and these "neighboring" tracks into my
model for prediction. This ranks the "neighboring" tracks by how
likely they are (under my model) to occur next in the given test
playlist.
[Source https://github.com/caravanuden/spotify_recsys]
I've just trained my model and now I'd like to make a recommendation as to which items User X is most likely to buy.
Do I have to carry out another implementation of an algorithm that determines, for example, the nearest neighbors (knn) or is it sufficient to train the model and then derive the data from it?
How do I proceed after I have trained the model with the data, how do I get the recommendations from it? What is state of the art in this area in order to receive the recommendations from the trained model?
Thanks in advance. Looking forward to suggestions, ideas and answers.

It depends on your use case for the model. This is twofold, firstly because of the performance (speed) required for your specific use case, and secondly in regards to the main weakness (in my opinion) with the neuMF model: if a user interacts with some more items, the predictions will not change, since they were not part of the training. Because of this, if it is used in an real-time-online setting, the recommendations will essentially be based on previous behavior, and will not take into account the current session, unless the model is retrained.
The neuMF model is particularly good at batch predictions for interval recommendations. If you, for example, would like to recommend items to users in a weekly email, then you would for each user, predict the output probability for each item, and then select top n (eg. 10) probabilities and recommend those. (You would have to retrain the model next week, in order to get other predictions based on the users' latest item interactions.) So if there are 10000 unique items, for each user, make 10000 individual predictions, and recommend n-items based on those. The main drawback is of course that these 10000 predictions takes a while to perform. Because of this, it might not be suitable for real-time online predictions. On the other hand, if you are clever with parallelization of the predictions, this limitation could be surpassed as well, although, might be unnecessary. This because, as explained previously, the predictions will not change depending on current user interactions.
Using knn to cluster users in the embedding-space, and then take these users' items, and feed them into the model seems unnecessary, and in my option, defeats the purpose of the whole model-architecture. This because the whole point of the neuMF model is to generalize a given user's interaction with items among all the other users' interaction, and base the recommendations on that, so that you can, given a user and an item, get the probability for that specific item.

Is predicting with model is more CPU consuming than training and predicting in python app?

I recently made a Disease prediction API ( Still not solved )
but that's not the matter
In the same app, I first deployed the app in a way that it trains and predicts every when requested that worked fine but when I saved a model and used the same model to predict the value I got 500 internal server error
Because I believe that would directly hit on the response time of the
So, I was curious whether predicting through model is more CPU consuming task or training and predicting so that I can work further on my API as Cloud computers have Specific CPU performance, etc
Of course, It also depends on a Tier we choose and I am working on a free tier of Heroku
It would really nice if guys answer it
Regards,
Roshan

If I understand it correctly, you are hitting some API endpoint with your request and the code that runs when the same endpoint is hit trains a model and then returns some prediction.
I can't really imagine how this should work in general. Training is a time consuming process that can take hours or months (how knows how long). Also, how are you sending a training data to your backend (assuming this data can be arbitrarily large)?
General approach is to build/train a model offline and then perform only predictions via you API. (unless you are building some very low level cloud API that is to be consumed by some other ML developers)
But to answer your question. No, predicting can't take more time than training-and-predicting (assuming that you are making the prediction on the same data). You are just adding one more (much more computationally intensive) operation to the equation. And since training and predicting are two separate steps that do not influence each other directly, your prediction time stays the same whether you are just predicting or training-and-predicting.

Training + Predicting is definitely more intensive as compared to only Predicting.
Typically, we train a model and save it as a binary file. Once saved, we use to for predicting.
Keep in mind that you would need to perform the same pre-processing steps you used during training while predicting.
As for the error, I'd suggest you do the following step-by-step to pin-point what is causing the error -
Try to access the API end point by sending a simple json reply.
Send the input data to the API end point and and try to return the input as json just to verify whether your server is receiving data as intended. You can also print it out as opposed sending back a json file.
Now, once you have the data, perform same pre-processing steps (like in training), make a prediction, and send it back to your frontend.

Training a model from multiple corpus

Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website).
Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?
To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.
My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)

Adjusting more-generic models with your specific domain training data is often called "fine-tuning".
The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).
But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.
As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)
Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.
(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)
But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.
In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?
So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)
Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.

What's the purpose of the different kinds of TensorFlow SignatureDefs?

It seems like the Predict SignatureDef encompasses all the functionality of the Classification and Regression SignatureDefs. When would there be an advantage to using Classification or Regression SignatureDefs rather than just using Predict for everything? We're looking to keep complexity down in our production environment, and if it's possible to use just Predict SignatureDefs in all cases, that would seem like a good idea.

From what I can see on the documentation (https://www.tensorflow.org/serving/signature_defs) it seems the "Classify" and "Regress" SigDefs try to force a simple and consistent interface for the simple cases (classify and regress), respectively, "inputs"-->"classes+scores" and "inputs"->"outputs". There seems to be an added benefit that the "Classify" and "Regress" SigDefs dont require a serving function to be constructed as part of the model export function.
Also from the docs, it seems the Predict SigDef allows a more generic interface with the benefit of being able to swap in and out models. From the docs:
Predict SignatureDefs enable portability across models. This means
that you can swap in different SavedModels, possibly with different
underlying Tensor names (e.g. instead of x:0 perhaps you have a new
alternate model with a Tensor z:0), while your clients can stay online
continuously querying the old and new versions of this model without
client-side changes.
Predict SignatureDefs also allow you to add optional additional
Tensors to the outputs, that you can explicitly query. Let's say that
in addition to the output key below of scores, you also wanted to
fetch a pooling layer for debugging or other purposes.
However the docs dont explain, aside from the minor benefit of not having to export a serving function, why one wouldn't just use the Predict SigDef for everything since it appears to be a superset with plenty of upside. I'd love to see a definitive answer on this, as the benefits of the specialized functions (classify, regress) seem quite minimal.

The differences I've seen so far are...
1) If utilizing the tf.feature_column.indicator_column wrapping the tf.feature_column.categorical_column_with_vocabulary_* in a DNNClassifier model, when you query the tensorflow server, I've had problems with the Predict API sometimes not being able to parse/map string inputs according to the vocabulary file/list. On the other hand, the Classify API properly mapped strings to their index (categorical_column) on the vocabulary, and then to the one-hot/multi-hot (indicator_column), and provided (what seems to be) the correct classification response to the query.
2) The response format of [[class, score],[class,score],....] for Classify API vs [class[], score[]] for Predict API. One or the other may be preferable if you need to parse the data in some way afterwards.
TLDR; With indicator_column wrapped in categorical_column_with_vocabulary_*, I've experienced issues with the vocabulary mapping when serving with Predict API. So, using Classify API.

twitter/facebook comments classification into various categories

I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks

This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.

Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!

What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.