I wonder if there is any way I can get recommendations for a new user, using an already trained WALS model, and given the list of items the user liked.
Currently, to get a recommendation you must provide the id of the user, which must be among the users the model was trained with. I would like to get a recommendation by providing the list of items that were liked by a new user.
There is a similar feature in the implicit python library
What you are describing is called "cold start problem".
You want a recommendation from no information on user's past behaviour.
As Matrix factorization needs historical data from the user, the best recommendation you can make if a most-popular recommendation.
To tackle this problem, you can add another type of model that handle this. You can, for example, add user's data in the algorithm in order not to be absolutely dependant on past data or by doing the online recommendation by analysing data during session.
Related
I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)
Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.
Thank you for taking the time to check this question.
I am interested in creating a profile for customers buying pattern.
Once I created a profile for everyone, we take unseen data and check with the profile to see if the customers followed their profile if not raise a flag. In this manner we do not create a set alert for all buyers but we can detect anomaly based on individual buyers to benchmark against their profile.
Any thoughts or inputs to how to approach this problem.
If you have a course or tutorial on this matter please feel free to suggest it.
Thanks in advance.
You can either go by supervised learning method, basically machine learning. Also, buying pattern, I would suggest to explore more about RFM rule i.e. recency, frequency and monetary value. This will help you in creating features for model or profile customers.
Not sure if the title makes complete sense so sorry about that.
I'm new to Machine Learning and I'm using Scikit and decision trees.
Here's what I want to do; I want to take all of my inputs and include a unique feature which is a client ID. Now, the client ID is unique and can't be summed up in the normal way a feature would in decision tree analysis. What's happening now is that the tree is taking the client ID's as any other integer value and then branching it saying for instance, client ID's less than 430 go in a different path than those over 430. This isn't correct and not what I want to do. What I want to do is make the decision tree understand that the specific field can't be analyzed in such a way and each client will have their own branch. Is this possible with decision trees?
I do have a couple workarounds, one of which would be to develop unique decision trees for each client but training this would be a nightmare. I could also do another workaround, and lets say we have 800 clients, I would create 800 features with a bit field, but this is also crazy.
This is a fairly common problem in machine learning. A machine learning feature can't be unique to each instance in any case. Intuitively it makes sense; the algorithm doesn't learn anything if it can't extrapolate from that feature.
What you can do is just separate out that piece of information from the decision tree before you pass the rest of the features, and just re-merge the ID and the prediction after it is made.
I would strongly discourage any kind of manipulation of the feature vector to include the ID in any form. Features are only supposed to be things that the algorithm is supposed to use to make decisions. Don't give it information you don't want it to use. You're right in wanting to avoid using an ID as a feature because (most likely) the ID has no bearing on whatever you're trying to predict.
If you do want individual models (and have enough data for each user that you can make them), its not as big a pain as you might be thinking. You can use Scikit's model saving feature and this answer on saving pickles to MySQL to easily create and store personalized models. Unless you have a very large number of users, creating personalized decision trees shouldn't take very long.
I'm dealing with topic-modelling of Twitter to define profiles of invidual Twitter users. I'm using Gensim module to generate a LDA model. My question is about choosing good input data. I'd like to generate topics which then I'd assign to specific users. Question is about input data. Now I'm using a supervised method of choosing users from different categories on my own (sports, IT, politics etc) and putting their tweets into the model but it's not very efficient and effective.
What would be a good method for generating meaningful topics of the whole Twitter?
Here is one profiling that I used to perform when I worked for a social media company.
Let's say you want to profile "sports" followers.
First, using Twitter API, download all the followers of one famous sports handle, say "ESPN". Looks like this:
"ESPN": 51879246, #These are IDs who follow ESPN
2361734293,
778094964,
23000618,
2828513313,
2687406674,
2402689721,
2209802017,
Then you also download all handles that 51879246, 2361734293... are following. Those "topics" will be your features.
Now all you need to do is to create the matrix X whose size is same as the number of features * number of followers. Then start filling that Matrix with 1 whenever that follower follows the specific topic(feature) in your feature dictionary.
Then here are simple 2 lines to start playing with.
model = lda.LDA(n_topics=5, n_iter=1000, random_state=1)
model.fit(X)
I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows