Confusion matrix almost perfectly balanced

Confusion matrix almost perfectly balanced - python

I'm trying to predict the winning team of a Destiny2 PvP match based on the player stats.
The dataset has multiple columns:
player_X_stat
X from 1 to 12 for each player, players from 1 to 6 are in team A and players from 7 to 12 are in team B
stat is a specific player stat like average number of kills per game or average number of deaths per game etc.
So there are X * (number of different stat) columns (around 100 at the end)
winning_team being always 0 (see note below). The winning team is always composed of players from 1 to 6 (team A) and the losing team is composed of players from 7 to 12 (team B).
note: To set up a balanced dataset, I shuffle the teams so that the winning_team column has as many 0s as 1s. For example, player_1_kills becomes player_7_kills, player_7_kills becomes player_1_kills and winning_team becomes 1.
However when training a catboost classifier or a simple SVC, the confusion matrix is almost perfectly balanced:
[[87486 31592]
[31883 87196]]
It seems to me that this is an anomaly but I have no idea where it could come from.
Edit: link to google colab notebook

Related

Scipy - adjust National Team means based on sample size

I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.

Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)

Optimize to get the most and least occurrences of pairs of from the list

I have 10K list of 20 values like below in a dat file -
Medical, 24673, 23578, Orange, USA, Green, 25980, Canada, IT, 2M376, Engineering, 50925, 39421, Apple, India, Red, 77789, Mexico, FIN, 3R376
24673, 23578, HongKong, Green, Management, 77789, Canada, HR, HL238, Engineering, 34009, 22173, Netflix, India, Jio, 77789, Mexico, OPS, 3R376, Orange
I ran below program to find the top 3 most occurred combination of 5 values like suppose [Medical, 24673, 23578, India, Mexico] - Occurred 1000 times. But program is keep running for very long time - more then 10 hours and then I stopped it. Can you please provide any optimize way to run the same.
from itertools import combinations
from collections import Counter
import ast
def most_frequent_combs(fn):
counter1 = Counter()
for ln in open(fn):
unique_tokens = sorted(set(ast.literal_eval(ln)))
combos = combinations(unique_tokens, 5)
counter1 += Counter(combos)
return counter1
fn = '/data/unix_framework/landing/summit/finalcollection.dat'
p = most_frequent_combs(fn)
print(p.most_common(3))
print(p.most_common()[-3])

Here is my solution, it is a bit long so bear with me. Lets assume that the vocabulary size of the data is $V$ meaning that there are $V$ distinct words to choose from. Now, represent the data as a matrix of $N$ rows (10K here) and $V$ columns, lets call it $A$. Now $A_{i,j} = 1$ if i-th data row contains the j-th word and 0 otherwise.
Now, pick the 5 columns from the matrix A which have highest $l_1$ norm and consider a vector $x$ of size $Vx1$ which has sparsity level of 5, meaning it is 1 at 5 locations and 0 everywhere else. In the first round $x$ will be 1 at locations which correspond to the top-5 $l_1$ norm columns. Compute $l_1$ norm of vector $Ax$. Store it in an array and then replace one of the previously 5 selected columns by the 6th highest $l_1$ norm column and compute norm of $Ax$ again. Keep track of the norms while replacing the selected columns one-by-one. Do this until all the columns of A are exhausted.
This method should be faster (I am talking about orders of magnitude faster than looping over all possible combinations.

Retain Data Labels in Statsmodels Likehood Model

I am trying to use the statsmodels package to do Maximum Likehood Estimation.
My goal is to compute a set of soccer team ratings given a bunch of game outcomes.
I'm using a Logistic model:
1/(1 + e^(HFA + HT - AT))
That has the following parameters:
HFA - home field advantage
HT - home team rating
AT - away team rating
The goal here is to compute each teams' ratings (there are 18 teams) plus a constant Home Field Advantage factor.
The data I'm given is simply a list of game outcomes - home team, away team, 1 if the home team won the game, 0 if not.
My current thinking is to enrich this data to have a column for each team. I would then make a particular column 1 if that team is playing in that particular game and 0 if not, so there should be two 1's per row. Then add another column called 'hfa', which is always 1, which represents the home field advantage.
Because of how the model works, for any given game, I need to know which team was home and which was away, then I can compute the prediction properly. And to do that, I believe I need the data labels so I can determine which of the two teams in the game was the home team. However, any time I include non-numeric data (e.g, the team name) in my X columns, I get an error from statsmodels.
Here is the gist of the code I'm trying to write. Any guidance on how to do what I'm trying to do would be much appreciated - thank you!
from statsmodels.base.model import GenericLikelihoodModel
class MyLogit(GenericLikelihoodModel):
def nloglikeobs(self, params):
"""
This function should return one evaluation of the negative log-likelihood function per observation in your
dataset (i.e. rows of the endog/X matrix).
"""
???
X_cols = ['hfa'] + list(teams)
X = results[X_cols]
y = results['game_result']
model = MyLogit(X, y)
result = model.fit()
print(result.summary())

Pandas series function that shows the probability of the up and down moves of the stock price

Days Adjusted stock price
price
0 100
1 50
2 200
3 210
4 220
5 34
6 35
7 36
8 89
Assuming this table is a pandas dataframe. Can someone help me out with writing function that show the probability of the up and down moves of the stock price. For example, what is the probability of the stock price having two up days in a row.
Thanks I am new to python and I have been trying to figure this out for a while!

Actual stock price movement prediction is both a broad and a deep subject usually associated with time series analysis which I would consider out of the scope of this question.
However, the naive approach would be to assume the Bernoulli model where each price move is considered independent both of any previous moves and of time.
In this case, the probability of the price moving up can be inferred by measuring all the up moves against all moves recorded.
# df is a single-column pandas DataFrame storing the price
((df['price'] - df['price'].shift(1)) > 0).sum()/(len(df) - 1)
which for the data you posted gives 0.75.
Given the above, the probability of the price going up for two consecutive days would be 0.75*0.75 approximately equating to 0.56.

Manipulate data for scaling

I have this data:
Game 1: 7.0/10.0, Reviewed: 1000 times
Game 2: 7.5/10.0, Reviewed: 3000 times
Game 3: 8.9/10.0, Reviewed: 140,000 times
Game 4: 10.0/10.0 Reviewed: 5 times
.
.
.
I want to manipulate this data in a way to make each rating reflective of how many times it has been reviewed.
For example Game 3 should have a little heavier weight than than Game 4, since it has been reviewed way more.
And Game 2's 7 should be weighted more than Game 1's 7.
Is there a proper function to do this scaling? In such a way that
ScaledGameRating = OldGameRating * (some exponential function?)

How about simply normalizing the average scores (i.e. subtract 5, the midpoint of the scoring interval) and multiply by the number of reviews? That will weight positive or negative scores according to the number of reviews.
Using this approach, you get the following values for your four games:
Game 1: 2,000 (7-5)*1000
Game 2: 7,500 (7.5-5)*3000
Game 3: 546,000 (8.9-5)*140000
Game 4: 25 (10-5)*5
Normalizing works well with negatively reviewed games because a game with a large number of negative (<5) reviews will not beat a game with a small number of positive (>5) reviews. That won't be the case if you use the absolute scores without normalizing.

You can do :
Find
Total Reviews
for Rating out of 10 you can just get
Game x Rating : ( (Number of times Game x Reviewed) / (Total Reviews) ) * 10
will give you out of 10 rating.That is weight of the particular game reviewed in total games present.

My Take in this problem is different. Considering if the review count is less, the remaining review are unknown and could have been anywhere between 1 through 10. So we can do a random distribution over the missing range and find the average over the entire maximum review population
max_freq = max(rating, key = itemgetter(1))[-1]
>>> for r,f in rating:
missing = max_freq - f
actual_rating = r
if missing:
actual_rating = sum(randint(1,10) for e in range(missing))/ (10.0*missing)
print "Original Rating {}, Scaled Rating {}".format(r, actual_rating)
Original Rating 0.7, Scaled Rating 0.550225179856
Original Rating 0.75, Scaled Rating 0.550952554745
Original Rating 0.89, Scaled Rating 0.89
Original Rating 1, Scaled Rating 0.54975249116)
Original Rating 0.7, Scaled Rating 0.550576978417
Original Rating 0.75, Scaled Rating 0.549582481752
Original Rating 0.89, Scaled Rating 0.89
Original Rating 1, Scaled Rating 0.550458230651

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.