Applying machine learning to recommend items from an existing database

Applying machine learning to recommend items from an existing database - python

I've got an existing database full of objects (I'll use books as an example). When users login to a website I'd like to recommend books to them.
I can recommend books based on other people they follow etc but I'd like to be more accurate so I've collected a set of training data for each user.
The data is collected by repeatedly presenting each user with a book and asking them if they like the look of it or not.
The training data is stored in mongodb, the books are stored in a postgres database.
I've written code to predict wether or not a given user will like a given book based on their training data, but my question is this:
How should I apply the data / probability to query books in the postgres database?
Saving the probability a user likes a book for every user and every book would be inefficient.
Loading all of the books form the database and calculating the probability for each one would also be inefficient.

I've written code to predict wether or not a given user will like a given book based on their training data
What does this code look like? Ideally it's some kind of decision tree based on attributes of the book like genre, length, etc, and is technically called a classifier. A simple example:
if ( user.genres.contains(book.genre) ) {
if ( user.maxLength < book.length ) {
print "10% off, today only!"
}
}
print "how about some garden tools?"
Saving the probability a user likes a book for every user and every book would be inefficient.
True. Note that the above decision tree may be formulated as a database query:
SELECT * FROM Books WHERE Genre IN [user.genres] AND Length < [user.maxLength]
Which will give you all books that have the highest probability of being liked by the user, with respect to the training data.

Related

Recommendation system for frequently changing data in MongoDB

I have a website built with Node.js and MongoDB. Documents are structured something like this:
{
price: 500,
location: [40.23, 49.52],
category: "A"
}
Now I want to create a recommendation system, so when a user is watching item "A" I can suggest to him/her similar items "B", "C" and "D".
The thing is collection of items is changing relatively often. New items are created every hour and they do exist only for about a month.
So my questions are:
What algorithm should I use? Cosine similarity seems to be the most suitable one.
Is there a way to create such recommendation system with Node.js or it's better to use python/R?
When similarity score must be calculated? Only once (when a new item is created) or I should recalculate it every time a user visits an item page?

What algorithm should I use? Cosine similarity seems to be the most suitable one.
No one can really answer this for you, what makes a product similar to you? this is 100% product decision, it sounds like this is more of a pet side project and in that case I'd say use whatever you'd like.
If this is not the case I would assume best recommendations would be based on purchase correlation, i.e previous Users that bought product "A" also bought (or looked) at product "B" the most, hence it should be the top recommendation. Obviously you can create a much more complex model in the future.
Is there a way to create such recommendation system with Node.js or it's better to use python/R?
If it's a basic rule based system it can be done in node with ease, for any more data science related approach it will be more natural to implement this in python/R
When similarity score must be calculated? Only once (when a new item is created) or I should recalculate it every time a user visits an item page?
Again it depends on what your score is, how many resources you can invest, what the scale is etc.
as I mentioned before It sounds like this is a personal project. If this is the case I would try and choose the simpler solution for all these questions. Once you have the entire project up and running it'll be easier to improve on.

Python Classification Model

I have a df with many columns of info about Home Depot customer accounts. Some fields are accountname, industry, territory, country, state, city, services, etc...
I need to build a model using python that will allow me to put in a customer accountname and I will get an output of customer accounts similar to the one I put in.
So let’s say I put in customeraccount ‘Jon Doe’
I want to get other customer accounts similar to Jon Doe based on features like industry, country, other categorical variables etc..
How can I approach this? What kind of a model would I need to build?

You need to create some metric for "closeness" - your definition of distance.
You need a way to compare all (or all relevant to you) fields of a record with the others.
The best/easiest skeletal function I can come up with right now is
def rowDist(rowA, rowB):
return industryDistance(rowA.industry, rowB.industry) \
* industryDistanceWeight + geographicalDistance(rowA, rowB) \
* geographicalDistanceWeight
Then you just search for rows with lowest distance.

Storing unstructured data for sentiment analysis

I am doing an NLP term project and am analyzing over 100,000 news articles from this corpus. https://github.com/philipperemy/financial-news-dataset
I am looking to perform sentiment analysis on this dataset using NLTK. However, I am a bit confused about how this pipeline should look for storing and accessing all of these articles.
The articles are text files that I read and perform some preprocessing on in order to extract some metadata and extract the main article text. Currently, I am storing the data from each article in a python object such as this:
{
'title' : title,
'author' : author,
'date' : date,
'text' : text,
}
I would like to store these objects in a database so I don't have to read all of these files every time I want to do analysis. My problem is, I'm not really sure which database to use. I want to be able to use regexes on certain fields such as date and title so I can isolate documents by date and company names. I was thinking of going the NoSql route and using a DB like MongoDb or CouchDB or maybe even a search engine such as ElasticSearch.
After I query for the documents I want to use for analysis, I will tokenize the text, POS tag it, and perform NER using NLTK. I have already implemented this part of the pipeline. Is it smart to do this after the database is already indexed in the database? Or should I look at storing the processed data in the database well?
Finally, I will use this processed data to classify each article, using a trained model I've already developed. I already have a gold standard, so I will compare the classification against the gold standard.
Does this pipeline generally look correct? I don't have much experience with using large datasets like this.

Generating a good LDA model of Twitter in Python with correct input data

I'm dealing with topic-modelling of Twitter to define profiles of invidual Twitter users. I'm using Gensim module to generate a LDA model. My question is about choosing good input data. I'd like to generate topics which then I'd assign to specific users. Question is about input data. Now I'm using a supervised method of choosing users from different categories on my own (sports, IT, politics etc) and putting their tweets into the model but it's not very efficient and effective.
What would be a good method for generating meaningful topics of the whole Twitter?

Here is one profiling that I used to perform when I worked for a social media company.
Let's say you want to profile "sports" followers.
First, using Twitter API, download all the followers of one famous sports handle, say "ESPN". Looks like this:
"ESPN": 51879246, #These are IDs who follow ESPN
2361734293,
778094964,
23000618,
2828513313,
2687406674,
2402689721,
2209802017,
Then you also download all handles that 51879246, 2361734293... are following. Those "topics" will be your features.
Now all you need to do is to create the matrix X whose size is same as the number of features * number of followers. Then start filling that Matrix with 1 whenever that follower follows the specific topic(feature) in your feature dictionary.
Then here are simple 2 lines to start playing with.
model = lda.LDA(n_topics=5, n_iter=1000, random_state=1)
model.fit(X)

How to store multi-dimensional data

I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.