I have this data:
Game 1: 7.0/10.0, Reviewed: 1000 times
Game 2: 7.5/10.0, Reviewed: 3000 times
Game 3: 8.9/10.0, Reviewed: 140,000 times
Game 4: 10.0/10.0 Reviewed: 5 times
.
.
.
I want to manipulate this data in a way to make each rating reflective of how many times it has been reviewed.
For example Game 3 should have a little heavier weight than than Game 4, since it has been reviewed way more.
And Game 2's 7 should be weighted more than Game 1's 7.
Is there a proper function to do this scaling? In such a way that
ScaledGameRating = OldGameRating * (some exponential function?)
How about simply normalizing the average scores (i.e. subtract 5, the midpoint of the scoring interval) and multiply by the number of reviews? That will weight positive or negative scores according to the number of reviews.
Using this approach, you get the following values for your four games:
Game 1: 2,000 (7-5)*1000
Game 2: 7,500 (7.5-5)*3000
Game 3: 546,000 (8.9-5)*140000
Game 4: 25 (10-5)*5
Normalizing works well with negatively reviewed games because a game with a large number of negative (<5) reviews will not beat a game with a small number of positive (>5) reviews. That won't be the case if you use the absolute scores without normalizing.
You can do :
Find
Total Reviews
for Rating out of 10 you can just get
Game x Rating : ( (Number of times Game x Reviewed) / (Total Reviews) ) * 10
will give you out of 10 rating.That is weight of the particular game reviewed in total games present.
My Take in this problem is different. Considering if the review count is less, the remaining review are unknown and could have been anywhere between 1 through 10. So we can do a random distribution over the missing range and find the average over the entire maximum review population
max_freq = max(rating, key = itemgetter(1))[-1]
>>> for r,f in rating:
missing = max_freq - f
actual_rating = r
if missing:
actual_rating = sum(randint(1,10) for e in range(missing))/ (10.0*missing)
print "Original Rating {}, Scaled Rating {}".format(r, actual_rating)
Original Rating 0.7, Scaled Rating 0.550225179856
Original Rating 0.75, Scaled Rating 0.550952554745
Original Rating 0.89, Scaled Rating 0.89
Original Rating 1, Scaled Rating 0.54975249116)
Original Rating 0.7, Scaled Rating 0.550576978417
Original Rating 0.75, Scaled Rating 0.549582481752
Original Rating 0.89, Scaled Rating 0.89
Original Rating 1, Scaled Rating 0.550458230651
Related
I'm trying to predict the winning team of a Destiny2 PvP match based on the player stats.
The dataset has multiple columns:
player_X_stat
X from 1 to 12 for each player, players from 1 to 6 are in team A and players from 7 to 12 are in team B
stat is a specific player stat like average number of kills per game or average number of deaths per game etc.
So there are X * (number of different stat) columns (around 100 at the end)
winning_team being always 0 (see note below). The winning team is always composed of players from 1 to 6 (team A) and the losing team is composed of players from 7 to 12 (team B).
note: To set up a balanced dataset, I shuffle the teams so that the winning_team column has as many 0s as 1s. For example, player_1_kills becomes player_7_kills, player_7_kills becomes player_1_kills and winning_team becomes 1.
However when training a catboost classifier or a simple SVC, the confusion matrix is almost perfectly balanced:
[[87486 31592]
[31883 87196]]
It seems to me that this is an anomaly but I have no idea where it could come from.
Edit: link to google colab notebook
I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.
Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)
I am looking for:
the percentage of players who are in weight type thin,
the percentage of players who are in weight type normal,
the percentage of players who are in weight type overweight,
the percentage of players who are in weight type obese.
All of them are listed in the IMC column.
This is my dataset:
I don't know Python well. I get the percentage of each row, but it does not group by group (thin, normal, overweight, obese)
This is my current code:
(df.groupby('IMC').size() / df['IMC'].count()) * 100
that's because your groupby is reversed:
df.groupby('IMC')
will group by the weight, not by the category. If you flip your code to
df.groupby('clasificacion_oms')
you will then get all the players grouped by their weight type. Then you can divide the count in each group by the total number of players to get your percentage.
Does that give you what you need? If not, please post your code so we can provide more detail.
I have a financial dataset with monthly aggregates. I know the real world average for each measure.
I am trying to build some dummy transactional data using Python. I do not want the dummy transactional data to be entirely random. I want to model it around the real world averages I have.
Eg - If from the real data, the monthly total profit is $1000 and the total transactions are 5, then the average profit per transaction is $200.
I want to create dummy transactions that are modelled around this real world average of $200.
This is how I did it :
import pandas as pd
from random import gauss
bucket = []
for _ in range(5):
value = [int(gauss(200,50))]
bucket += value
transactions = pd.DataFrame({ 'Amount' : bucket})
Now, the challenge for me is that I have to randomize the identifiers too.
For eg, I know for a fact that there are three buyers in total. Let's call them A, B and C.
These three have done those 5 transactions and I want to randomly assign them when I create the dummy transactional data. However, I also know that A is very likely to do a lot more transactions than B and C. To make my dummy data close to real life scenarios, I want to assign probabilities to the occurence of these buyers in my dummy transactional data.
Let's say I want it like this:
A : 60% appearance
B : 20% appearance
C : 20% appearance
How can I achieve this?
What you are asking is not a probability. You want a 100% chance of A having 60% chance of buying. For the same take a dict as an input that has a probability of each user buying. Then create a list with these probabilities on your base and randomly pick a buyer from the list. Something like below:
import random
#Buy percentages of the users
buy_percentage = {'A': 0.6, 'B': 0.2, 'C': 0.2}
#no of purchases
base = 100
buy_list = list()
for buyer, percentage in buy_percentage.items():
buy_user = [buyer for _ in range(0, int(percentage*base))]
buy_list.extend(buy_user)
for _ in range(0,base):
#Randomly gets a buyer but makes sure that your ratio is maintained
buyer = random.choice(buy_list)
#your code to get buying price goes below
UPDATE:
Alternatively, the answer given in the below link can be used. This solution is better in my opinion.
A weighted version of random.choice
Have a situation where I am given a total ticket count, and cumulative ticket sale data as follows:
Total Tickets Available: 300
Day 1: 15 tickets sold to date
Day 2: 20 tickets sold to date
Day 3: 25 tickets sold to date
Day 4: 30 tickets sold to date
Day 5: 46 tickets sold to date
The number of tickets sold is nonlinear, and I'm asked if someone plans to buy a ticket on Day 23, what is the probability he will get a ticket?
I've been looking at quite a libraries used for curve fitting like numpy, PyLab, and sage but I've been a bit overwhelmed since statistics is not in my background. How would I easily calculate a probability given this set of data? If it helps, I also have ticket sale data at other locations, the curve should be somewhat different.
The best answer to this question would require more information about the problem--are people more/less likely to buy a ticket as the date approaches (and mow much)? Are there advertising events that will transiently affect the rate of sales? And so on.
We don't have access to that information, though, so let's just assume, as a first approximation, that the rate of ticket sales is constant. Since sales occur basically at random, they might be best modeled as a Poisson process Note that this does not account for the fact that many people will buy more than one ticket, but I don't think that will make much difference for the results; perhaps a real statistician could chime in here. Also: I'm going to discuss the constant-rate Poisson process here but note that since you mentioned the rate is decidedly NOT constant, you could look into variable-rate Poisson processes as a next step.
To model a Poisson process, all you need is the average rate of ticket sales. In your example data, sales-per-day are [15, 5, 5, 5, 16], so the average rate is about 9.2 tickets per day. We've already sold 46 tickets, so there are 254 remaining.
From here, it is simple to ask, "Given a rate of 9.2 tpd, what is the probability of selling less than 254 tickets in 23 days?" (ignore the fact that you can't sell more than 300 tickets). The way to calculate this is with a cumulative distribution function (see here for the CDF for a poisson distribution).
On average, we would expect to sell 23 * 9.2 = 211.6 tickets after 23 days, so in the language of probability distributions, the expectation value is 211.6. The CDF tells us, "given an expectation value λ, what is the probability of seeing a value <= x". You can do the math yourself or ask scipy to do it for you:
>>> import scipy.stats
>>> scipy.stats.poisson(9.2 * 23).cdf(254-1)
0.99747286634158705
So this tells us: IF ticket sales can be accurately represented as a Poisson process and IF the average rate of ticket sales really is 9.2 tpd, then the probability of at least one ticket being available after 23 more days is 99.7%.
Now let's say someone wants to bring a group of 50 friends and wants to know the probability of getting all 50 tickets if they buy them in 25 days (rephrase the question as "If we expect on average to sell 9.2 * 25 tickets, what is the probability of selling <= (254-50) tickets?"):
>>> scipy.stats.poisson(9.2 * 25).cdf(254-50)
0.044301801145630537
So the probability of having 50 tickets available after 25 days is about 4%.