I have a rating dataset having userid, restaurantid, and ratings ( by the user for a particular restaurant; the rating can 0,1,2).
When I am doing pivoting to create a user-item matrix, I am replacing the nan (when the user has not rated a restaurant) values with zero.
My question is am I assigning a zero rating here? Am I saying that user1 has rated res1 as zero when he has not rated res1 at all? Isn't it going to affect my final prediction?
Related
I have trained a model which predicts the rating of a product for a given user:
rating =model(user,item)
I want to know how to get multiple recommendations for all the users. Do we run two loops, one for users and another for items, to get the rating of all users for all the items, and then for each user, we sort the rating and get the top 10/20 highest rated products??
interaction=np.zeros((len(users),len(items)))
for i in range(len(users)):
for j in range(len(items)):
interaction[i][j]=model(users[i],items[j])
is there any more efficient way to do that as my data have a large number of users and a large number of products??
I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()
I am currently working on IMDB 5000 movie dataset for a class project. The budget variable has a lot of zero values.
They are missing entries. I cannot drop them because they are 22% of my entire data.
What should I do in Python? Some suggested binning? Could you provide more details?
Well there are a few options.
Take an average of the non zero values and fill all the zeros with the average. This yields 'tacky' results and is not best practice a few outliers can throw off the whole.
Use the median of the non zero values, also not a super option but less likely to be thrown off by outliers.
Binning would be taking the sum of the budgets then say splitting the movies into a certain number of groups like say budgets over or under a million, take the average budget then divide that by the amount of groups you want then use the intervals created from the average if they fall in group 0 give them a zero if group 1 a one etc.
I think finding the actual budgets for the movies and replacing the bad itemized budgets with the real budget would be a good option depending on the analysis you are doing. You could take the median or average of each feature column of the budget make that be the percent of each budget for a movie then fill in the zeros with the percent of budget the median occupies. If median value for the non zero actor_pay column is budget/actor_pay=60% then filling the actor_pay column of a zeroed value with 60 percent of the budget for that movie would be an option.
Hard option create a function that takes the non zero values of a movies budgets and attempts to interpolate the movies budget based upon the other movies data in the table. This option is more like its own project and the above options should really be tried first.
I am trying to create a Gale-Shapley algorithm in Python, that delivers stable matches of doctors and hospitals. To do so, I gave every doctor and every hospital a random preference represented by a number.
Dataframe consisting of preferences
Afterwards I created a function that rates every hospital for one specific doctor (represented by ID) followed by a ranking of this rating creating two new columns. In rating the match, I took the absolute value of the difference between the preferences, where a lower absolute value is a better match. This is the formula for the first doctor:
doctors_sorted_by_preference['Rating of Hospital by Doctor 1']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[0]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 1']=doctors_sorted_by_preference["Rating of Hospital by Doctor 1"].rank()
which leads to the following table:
Dataframe consisting of preferences and rating + ranking of doctor
Hence, doctor 1 prefers the first hospital over all other hospitals as represented by the ranking.
Now I want to repeat this function for every different doctor by creating a loop (creating two new columns for every doctor and adding them to my dataframe), but I don't know how to do this. I could type out the same function for all the 10 different doctors, but if I increase the dataset to include 1000 doctors and hospitals this would become impossible, there must be a better way...
This would be the same function for doctor 2:
doctors_sorted_by_preference['Rating of Hospital by Doctor 2']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[1]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 2']=doctors_sorted_by_preference["Rating of Hospital by Doctor 2"].rank()
Thank you in advance!
You can also append the values into list and then write it to dataframe. Appending into lists would be faster if you have a large dataset.
I named by dataframe as df for sake of viewing :
for i in range(len(df['Preference Doctor'])):
list1= []
for j in df['Preference Hospital']:
list1.append(abs(df['Preference Doctor'].iloc[i]-j))
df['Rating of Hospital by Doctor_' +str(i+1)] = pd.DataFrame(list1)
df['Rank of Hospital by Doctor_' +str(i+1)] = df['Rating of Hospital by Doctor_'
+str(i+1)].rank()
I have a prediction problem that I am working on and need some help on how to approach it. I have a CSV with two columns, user_id and ratings, where a user is giving a rating on something in the ratings column. A user can repeat in the user_id column with different unique ratings. For example:
user_id rating
1 5
4 6
1 6
7 6
2 7
4 7
Now the prediction data set has users who have already given previous rating similar to the one above:
user_id rating
11 6
12 10
13 8
13 9
14 4
14 5
Goal is to predict what these specific users will rate the next time. Secondly, lets say if we add a user '15' with no rating history, how can we predict the first two ratings that user will provide, in order.
I'm not sure how to train a model, with just user_id and ratings, which also happens to be the target column. Any help is appreciated!
First and the foremost thing that you have to mention that on what a user is giving rating i.e., the category for example in movie rating system you can provide that for a particular movie A which is an action movie the user gives rating 1 which means that the user hates action and for a movie B which is comedy type the user gives rating 9 which means that the user is a comedy lover so the next time a similar category came you can predict the rating of the user very easily and you can do so by including many movie category like thriller,romance,drama etc and can even take many accounting features like movie length, leading actor, director, language etc etc as all these govern a user rating very broadly.
But if you not provide on which basis the user is giving rating then it is very hard and of no use for example I am a user and I give ratings like 1,5,2,6,8,1,9,3,4,10 can you predict my next rating the answer is no because it just like a random generator between 0-10 but in the movie case where my past ratings clearly show that I love comedy and hate action then when a new comedy movie came you can easily predict the rating for that movie for me.
But still if your problem is this only then you can use various statistical methods like either take the mean and then approximate to nearest integer or either take the mode.
But I can suggest is that plot the rating for a user and visualise it, if it is following some pattern like for a user the rating first increases then goes to peak then decreases then go to minimum and then increases and follow like this(believe me this is going to be very impractical due to your constraints) and on the basis of that predict the rating.
But the best out of all these is be make a statistical model like give high weight to the last rating and lesser weight to second last rating and then even lesser and then take a mean, eg->
predict_rating = w1*(last_rating) + w2*(second_last_rating) + w3*(third_last_rating) ....
and then take mean
This will give you very good results and indeed it is machine learning and this particular algorithm in which you find the best suited weights is multivariate linear regression
and this for sure is the best model for the given constraints