Think about a platform where an user choose what factors he give more importance. For example 5 factors of criteria A, B, C, D, E
Then each product review has a weighing for A1, B1, C1, D1, E1. So, if he gave more importance to A, then the weighing will take that in consideration. The result is that each review can have an different overall for each user.
My problem is about the algorithm for that. Currently the processing is slow.
For each category summary, I need to iterate over all companies of that category, and all reviews for each company.
#1 step
find companies of category X with more than 1 review published
companies_X = [1, 2, 3, 5, n]
#2 step
iterate all companies, and all reviews of these companies
for company in companies:
for review in company:
#calculate the weighing of the review for the current user criteria
#give more importance to recent reviews
#3 step
avg of all reviews for each company data
#4 step
make the avg of all companies of this category to create a final score for the category x
This works, but I can't have a page that takes 30 seconds to load.
I am thinking about cache this page, but in that case i need to process this page for all users in background. Not a good solution, definitely.
Any ideas about improvements? Any insight will be welcome.
First option: using numpy and pandas could improve your speed, if leveraged in a smart way, so by avoiding loops whenever it is possible. This can be made by using the apply method, working on both numpy and pandas, along with some condition or lambda function.
for company in companies:
for review in company:
can be replaced by review_data["note"] = note_formula(review_data["number_reviews"])
Edit: here note_formula is a function returning the weighting of the review, as indicated in the comments of the question:
# calculate the weighing of the review for the current user criteria
# give more importance to recent reviews
Your step 4 can be performed by using groupby method from pandas along with a calculation of average.
Second option: where are your data stored? If they are in a data base, a good rule to boost performance is: move the data as little as possible, so perform the request directly in the data base, I think all your operations can be written in SQL, and then redirect only the result to the python script. If your data are stored in an other way, consider using a data base engine, SQLite for instance at the beginning if you don't aim at scaling fast.
Related
I have a website built with Node.js and MongoDB. Documents are structured something like this:
{
price: 500,
location: [40.23, 49.52],
category: "A"
}
Now I want to create a recommendation system, so when a user is watching item "A" I can suggest to him/her similar items "B", "C" and "D".
The thing is collection of items is changing relatively often. New items are created every hour and they do exist only for about a month.
So my questions are:
What algorithm should I use? Cosine similarity seems to be the most suitable one.
Is there a way to create such recommendation system with Node.js or it's better to use python/R?
When similarity score must be calculated? Only once (when a new item is created) or I should recalculate it every time a user visits an item page?
What algorithm should I use? Cosine similarity seems to be the most suitable one.
No one can really answer this for you, what makes a product similar to you? this is 100% product decision, it sounds like this is more of a pet side project and in that case I'd say use whatever you'd like.
If this is not the case I would assume best recommendations would be based on purchase correlation, i.e previous Users that bought product "A" also bought (or looked) at product "B" the most, hence it should be the top recommendation. Obviously you can create a much more complex model in the future.
Is there a way to create such recommendation system with Node.js or it's better to use python/R?
If it's a basic rule based system it can be done in node with ease, for any more data science related approach it will be more natural to implement this in python/R
When similarity score must be calculated? Only once (when a new item is created) or I should recalculate it every time a user visits an item page?
Again it depends on what your score is, how many resources you can invest, what the scale is etc.
as I mentioned before It sounds like this is a personal project. If this is the case I would try and choose the simpler solution for all these questions. Once you have the entire project up and running it'll be easier to improve on.
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
I have two different product data of 5.4 Million and 4.5 Million Products, which scraped from the competitor website. Most products are non branded products that don't have any unique standard SKU. I want to compare 300K product data with similar products which our competitor are selling and want to find out the price difference.
I have tired compare dataset using two different sphinx with similar words but not able to find out a good result because of the title are not similar of non branded products with a standard brand name, title or SKU
Is there any way to get the result using ML or some big data algorithm ?
If you use Sphinx/Manticore you can:
take each of your products from dataset 1
convert it into a query using a quorum operator with percentile and a ranking formula of your choice
run the query against dataset 2
find results
take top K
There're some additional tricks that can help like:
IDF boosting
skipping stop-words
use of atc-based ranking
The tricks and the concept of finding similar content in general are described in this interactive course - https://play.manticoresearch.com/mlt/
I'm trying to prepare a dataset for scikit learn, planning to build pandas dataframe to feed it to a decision tree classifier.
The data represents different companies with varying criteria, but some criteria can have multiple values - such as "Customer segment" - which, for any given company, could be any, or all of: SMB, midmarket, enterprise, etc. There are other criteria/columns like this with multiple possible values. I need decisions made upon individual values, not the aggregate - so company A for SMB, company A for Midmarket, and not for the "grouping" of customer A for SMB AND midmarket.
Is there guidance on how to handle this? Do I need to generate rows for every variant for a given company to be fed into the learning routine? Such that an input of:
Company,Segment
A,SMB:MM:ENT
becomes:
A, SMB
A, MM
A, ENT
As well as for any other variants that may come from additional criteria/columns - for example "customer vertical" which could also include multiple values? It seems like this will greatly increase the dataset size. Is there a better way to structure this data and/or handle this scenario?
My ultimate goal is to let users complete a short survey with simple questions, and map their responses to values to get a prediction of the "right" company, for a given segment, vertical, product category, etc. But I'm struggling to build the right learning dataset to accomplish that.
Let's try.
df = pd.DataFrame({'company':['A','B'], 'segment':['SMB:MM:ENT', 'SMB:MM']})
expended_segment = df.segment.str.split(':', expand=True)
expended_segment.columns = ['segment'+str(i) for i in range(len(expended_segment.columns))]
wide_df = pd.concat([df.company, expended_segment], axis=1)
result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company'])))
result.dropna()
lets say i have a set of users, a set of songs, and a set of votes on each song:
=========== =========== =======
User Song Vote
=========== =========== =======
user1 song1 [score]
user1 song2 [score]
user1 song3 [score]
user2 song1 [score]
user2 song2 [score]
user2 song3 [score]
user3 song1 [score]
user3 song2 [score]
user3 song3 [score]
user-n song-n [score]
=========== =========== =======
whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?
There are two common metrics that can be used to find similarities between users:
Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:
for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add all together into sum_of_powers. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers)).
Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong has been voted 2 from u1 and 4 from u2 it will plot the point (2,4) (assuming that user1 is x-axis and u2 is y-axis).
Just to clarify, you use linear regression to find two coefficients A and B, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B. If two sets are similar points should be near to the main diagonal so A should tend to 1 while B to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.
EDIT:
like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).
I recommend the book Programming Collective Intelligence from Toby Segaran. Chapter 3 describes different clustering methods like Hierarchical Clustering and K-means Clustering.
The source code for the examples is available here
If you want the most accurate results, then no, you'd have to iterate over everything.
If your database is large enough, you could just take a statistical sampling, say taking between 1,000 -10,000 users and matching against that.
You would also be better off to add some more tables to the database, store the results, and only update it every so often, instead of calculating this on the fly.
Ilya Grigorik did a series on recommendation algorithms, though he was focusing on Ruby. It appears to be under the machine learning section in his archives, but there isn't a direct section link.
If you want to do it in a approximate way without visiting all the records, you can use the Jaccard Coefficient. Probably needs some adaptation if you want to consider the scores. But I guess that's the best solutions if your system is too big and you don't have the time to check all the records.
I think a lot of people on here are missing the simplicity of the question. He didn't say anything about creating a rating prediction system. He just wants to compute the similarity between each user's song rating behavior and each other user's song rating behavior. The Pearson correlation coefficient gives exactly that. Yes, you must iterate over every user/user pair.
EDIT:
After thinking about this a little more:
Pearson is great if you want the similarity between two users' tastes, but not their level of "opinionatedness"... one user who rates a series of songs 4, 5, and 6 will correlate perfectly with another user who rates the same songs 3, 6, and 9. In other words, they have the same "taste" (they would rank the songs in the same order), but the second user is much more opinionated. In other other words, the correlation coefficient treats any two rating vectors with a linear relationship as equal.
However, if you want the similarity between the actual ratings the users gave each song, you should use the root mean squared error between the two rating vectors. This is a purely distance based metric (linear relationships do not play into the similarity score), so the 4,5,6 and 3,6,9 users would not have a perfect similarity score.
The decision comes down to what you mean by "similar"...
That is all.
You should be able to find a good algorithm in this book: The Algorithm Design Manual by Steven Skiena.
The book has a whole bunch of algorithms for various purposes. You want a graph clustering algorithm, I think. I don't have my copy of the book handy, so I can't look it up for you.
A quick Google search found a Wikipedia page: http://en.wikipedia.org/wiki/Cluster_analysis Perhaps that will help, but I think the book explains algorithms more clearly.