I have a prediction problem that I am working on and need some help on how to approach it. I have a CSV with two columns, user_id and ratings, where a user is giving a rating on something in the ratings column. A user can repeat in the user_id column with different unique ratings. For example:
user_id rating
1 5
4 6
1 6
7 6
2 7
4 7
Now the prediction data set has users who have already given previous rating similar to the one above:
user_id rating
11 6
12 10
13 8
13 9
14 4
14 5
Goal is to predict what these specific users will rate the next time. Secondly, lets say if we add a user '15' with no rating history, how can we predict the first two ratings that user will provide, in order.
I'm not sure how to train a model, with just user_id and ratings, which also happens to be the target column. Any help is appreciated!
First and the foremost thing that you have to mention that on what a user is giving rating i.e., the category for example in movie rating system you can provide that for a particular movie A which is an action movie the user gives rating 1 which means that the user hates action and for a movie B which is comedy type the user gives rating 9 which means that the user is a comedy lover so the next time a similar category came you can predict the rating of the user very easily and you can do so by including many movie category like thriller,romance,drama etc and can even take many accounting features like movie length, leading actor, director, language etc etc as all these govern a user rating very broadly.
But if you not provide on which basis the user is giving rating then it is very hard and of no use for example I am a user and I give ratings like 1,5,2,6,8,1,9,3,4,10 can you predict my next rating the answer is no because it just like a random generator between 0-10 but in the movie case where my past ratings clearly show that I love comedy and hate action then when a new comedy movie came you can easily predict the rating for that movie for me.
But still if your problem is this only then you can use various statistical methods like either take the mean and then approximate to nearest integer or either take the mode.
But I can suggest is that plot the rating for a user and visualise it, if it is following some pattern like for a user the rating first increases then goes to peak then decreases then go to minimum and then increases and follow like this(believe me this is going to be very impractical due to your constraints) and on the basis of that predict the rating.
But the best out of all these is be make a statistical model like give high weight to the last rating and lesser weight to second last rating and then even lesser and then take a mean, eg->
predict_rating = w1*(last_rating) + w2*(second_last_rating) + w3*(third_last_rating) ....
and then take mean
This will give you very good results and indeed it is machine learning and this particular algorithm in which you find the best suited weights is multivariate linear regression
and this for sure is the best model for the given constraints
Related
My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.
I have a dataset with distances between 4 cities, each city has a sales store from the same company and I have the number of sales from the last month of each sales store in another dataset.
What i want to know is the best possible route between the cities to make more profit (each product is sold for 5) knowing that i only produce in the first city and then i have a truck with a maximum truckload of 5000 loading the other 3 cities.
I can´t find anything similar to my problem, the closest i could find were search algorithms, can someone tell me what approach to take?
Sorry if my question is a bit confusing.
I have a rating dataset having userid, restaurantid, and ratings ( by the user for a particular restaurant; the rating can 0,1,2).
When I am doing pivoting to create a user-item matrix, I am replacing the nan (when the user has not rated a restaurant) values with zero.
My question is am I assigning a zero rating here? Am I saying that user1 has rated res1 as zero when he has not rated res1 at all? Isn't it going to affect my final prediction?
I have this data set:
I am being asked: Define a metric and a corresponding function that determines which are the fastest growing users in terms of positive customer engagement over the last year. Report the top 10 users based on the metric that defines “fastest growing user”.
So far I have created a correlation matrix:
user_id content_count total_engagement date_Delta
user_id 1.000000 -0.056683 0.027150 -0.000014
content_count -0.056683 1.000000 0.215149 -0.007097
total_engagement 0.027150 0.215149 1.000000 0.002337
date_Delta -0.000014 -0.007097 0.002337 1.000000
As you can see content_count and total_engagement have the best correlation component.
What I am thinking of doing next is to create a graph of each user_id and their total_engagement to see the overall linearity which will give some indication as to which users had a strong increasing total_engagement.
Although, I am overall a bit confused with how to define a metric for the question posed. I guess I just want to make this post to see if I could get anyone to purpose some ideas.
You must have new field which will tell daily engagement for user_id. something like total_management/date_delta which will give actual day engagement.
I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.
It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]
Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.
To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]