lets say i have a set of users, a set of songs, and a set of votes on each song:
=========== =========== =======
User Song Vote
=========== =========== =======
user1 song1 [score]
user1 song2 [score]
user1 song3 [score]
user2 song1 [score]
user2 song2 [score]
user2 song3 [score]
user3 song1 [score]
user3 song2 [score]
user3 song3 [score]
user-n song-n [score]
=========== =========== =======
whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?
There are two common metrics that can be used to find similarities between users:
Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:
for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add all together into sum_of_powers. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers)).
Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong has been voted 2 from u1 and 4 from u2 it will plot the point (2,4) (assuming that user1 is x-axis and u2 is y-axis).
Just to clarify, you use linear regression to find two coefficients A and B, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B. If two sets are similar points should be near to the main diagonal so A should tend to 1 while B to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.
EDIT:
like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).
I recommend the book Programming Collective Intelligence from Toby Segaran. Chapter 3 describes different clustering methods like Hierarchical Clustering and K-means Clustering.
The source code for the examples is available here
If you want the most accurate results, then no, you'd have to iterate over everything.
If your database is large enough, you could just take a statistical sampling, say taking between 1,000 -10,000 users and matching against that.
You would also be better off to add some more tables to the database, store the results, and only update it every so often, instead of calculating this on the fly.
Ilya Grigorik did a series on recommendation algorithms, though he was focusing on Ruby. It appears to be under the machine learning section in his archives, but there isn't a direct section link.
If you want to do it in a approximate way without visiting all the records, you can use the Jaccard Coefficient. Probably needs some adaptation if you want to consider the scores. But I guess that's the best solutions if your system is too big and you don't have the time to check all the records.
I think a lot of people on here are missing the simplicity of the question. He didn't say anything about creating a rating prediction system. He just wants to compute the similarity between each user's song rating behavior and each other user's song rating behavior. The Pearson correlation coefficient gives exactly that. Yes, you must iterate over every user/user pair.
EDIT:
After thinking about this a little more:
Pearson is great if you want the similarity between two users' tastes, but not their level of "opinionatedness"... one user who rates a series of songs 4, 5, and 6 will correlate perfectly with another user who rates the same songs 3, 6, and 9. In other words, they have the same "taste" (they would rank the songs in the same order), but the second user is much more opinionated. In other other words, the correlation coefficient treats any two rating vectors with a linear relationship as equal.
However, if you want the similarity between the actual ratings the users gave each song, you should use the root mean squared error between the two rating vectors. This is a purely distance based metric (linear relationships do not play into the similarity score), so the 4,5,6 and 3,6,9 users would not have a perfect similarity score.
The decision comes down to what you mean by "similar"...
That is all.
You should be able to find a good algorithm in this book: The Algorithm Design Manual by Steven Skiena.
The book has a whole bunch of algorithms for various purposes. You want a graph clustering algorithm, I think. I don't have my copy of the book handy, so I can't look it up for you.
A quick Google search found a Wikipedia page: http://en.wikipedia.org/wiki/Cluster_analysis Perhaps that will help, but I think the book explains algorithms more clearly.
Related
I am using TF-IDF and DBSCAN to cluster similar human names in a database. The goal of the project is to be able to cluster together names that belong to the same person but may not necessarily be formatted or spelt the same. For example, John Smith can be also be labeled in the database as J. Smith or Smith, John. Ideally the model would be able to cluster these instances together.
The dataset I'm working with has over 250K records. I understand that DBSCAN will label records that are noise as -1. However, the model is also producing an additional cluster that almost always has around 200K records in it and the vast majority of the records within seem like they should be in their own individual clusters. Is there a reason why this may be happening? I'm considering running another model on this large cluster to see what happens.
Any advice would be greatly appreciated. Thanks!
First off, DBSCAN is a reasonable method for supervised clustering when the amount of clusters you have is unknown.
You need to pass a matrix of distances for every string you are clustering upon. What string similarity metric you use is your choice. Here is an example with Levenstein distance where names is a list or array of your strings for clustering:
import Levenshtein as Lev
import numpy as np
from sklearn.cluster import DBSCAN
lev_similarity = 1 * np.array([[Lev.distance(v1, v2) for v1 in names] for v2 in names])
dbscan = DBSCAN(eps=5, min_samples=1)
dbscan.fit(lev_similarity)
Because we are using lev distance, eps will be the number of substitutions to turn one string into the other. Tune it for your use case. The biggest concern being longer names being shortened ('malala yousafzai' vs 'malala y.' is more substitutions than 'jane doe' to 'jane d.')
My assumption as to why your current code has most of your dataset clustered: your eps value is tuned too high.
You called it 'DBSCAN', and I know what you're talking about because I'm doing this at work right now, but your descriptions sounds much more like fuzzy matching. Check out the link below and see if that helps you get to your end game.
https://medium.com/analytics-vidhya/matching-messy-pandas-columns-with-fuzzywuzzy-4adda6c7994f
Also, below is a link to a canonical example of DBSCAN, but again, I don't think that's what you actually want to do.
https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea
I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
I got this Prospects dataset:
ID Company_Sector Company_size DMU_Final Joining_Date Country
65656 Finance and Insurance 10 End User 2010-04-13 France
54535 Public Administration 1 End User 2004-09-22 France
and Sales dataset:
ID linkedin_shared_connections online_activity did_buy Sale_Date
65656 11 65 1 2016-05-23
54535 13 100 1 2016-01-12
I want to build a model which assigns to each prospect in the Prospects table the probability of becoming a customer. The model will predict if a prospect going to buy, and return the probability. the Sales table gives info about 2015 sales. My approach-the 'did buy' column should be a label in the model because 1 represents that prospect bought in 2016, and 0 means no sale. another interesting column is the online activity that ranges from 5 to 685. the higher it is- the more active the prospect is about the product. so I'm trying maybe to do Random Forest model and then somehow put the probability for each prospect in the new intent column. Is a Random Forest an efficient model in this case or maybe I should use another one. How can I apply the model results into the new 'intent' column for each prospect in the first table.
Well first, please see the How to ask and the On-topic guidelines. This is more of a consulting than a practical or specific question. Maybe more appropriate topic is machine learning.
TL;DR: Random forests are nice but seem to be inappropriate due to unbalanced data. You should read about recommender systems, and more fashioned good-performing models like Wide and Deep
An answer depends on: How much data do you have? What are your available data during inference? could you see the current "online_activity" attribute of the potential sale, before the customer is buying? many questions may change the whole approach that fits for your task.
Suggestion:
Generally speaking, these is a kind of business where you usually deal with very unbalanced data - low number of "did_buy"=1 against huge number of potential customers.
On the data science side, you should define valuable metric for success that can be mapped to money directly as possible. Here, it seems that taking actions by advertising or approaching to more probable customers can rise the "did_buy" / "was_approached" is a great metric for success. Overtime, you succeed if you rise that number.
Another thing to take into account, is your data may be sparse. I do not know how much buys you usually get, but it can be that you have only 1 from each country etc. That should also be taken into consideration, since simple random forest can be easily targeting this column in most of its random models and overfitting will be come a big issue. Decision trees suffer from unbalanced datasets. However, by taking the probability of each label in the leaf, instead of a decision, can sometimes be helpful for simple interpretable models and it reflects the unbalanced data. To be honest, I do not truly believe this is the right approach.
If I where you:
I would first embed the Prospects columns to a vector by:
Converting categories to random vectors (for each category) or one-hot encoding.
Normalizing or bucketizing company sizes into numbers that fits the prediction model (next)
Same ideas regarding dates. Here, maybe year can be problematic but months/days should be useful.
Country is definitely categorical, maybe add another "unknown" country class.
Then,
I would use a model that can be actually optimized according to different costs. Logistic regression is a wide one, deep neural network is another option, or see Google's Wide and deep for a combination.
Set the cost to be my golden number (the money-metric in terms of labels), or something as close as possible.
Run experiment
Finally,
Inspect my results and why it failed.
Suggest another model/feature
Repeat.
Go eat launch.
Ask a bunch of data questions.
Try to answer at least some.
Discover new interesting relations in the data.
Suggest something interesting.
Repeat (tomorrow).
Ofcourse there is a lot more into that than just the above, but that is for you to discover on your data and business.
Hope I helped! Good luck.
For a music project I want to find what which groups of artists users listens to. I have extracted three columns from the database: the ID of the artist, the ID of the user, and the percentage of all the users stream that is connected to that artist.
E.g. Half of the plays from user 15, is of the artist 12.
12 | 15 | 0.5
What I hope to find is a methodology to group clusters of groups together, so e.g. find out that users who tends to listen to artist 12 also listens to 65, 74, and 34.
I wonder what kind of methodologies that can be used for this grouping, and if there are any good sources for this approach (Python or Ruby would be great).
Imagine your data as a matrix with users as rows and artists as columns, with each cell containing the ratio.
A straight forward analysis would be to use clustering on the (possible very large) column vectors. Check out the python library scikit-learn. I can also recommend using IPython notebook for interactive data analysis.
Your problem is known as "market-basket analysis" or "affinity correlation", check out Best Python clustering library to use for product data analysis
Sounds like a classic matrix factorization task to me.
With a weighted matrix, instead of a binary one. So some fast algorithms may not be applicable, because they support binary matrixes only.
Don't ask for source on Stackoverflow: asking for off-site resources (tools, libraries, ...) is off-topic.