I have some data about a video game.
Data:
A match up has a matchId. Each match up includes two teams, and each team sizes varies. For instance, 3v3, 4v4, 5v5, ... Data simplified as follow:
matchId playerId teamId victory
100 200 14 1
100 201 14 1
100 212 14 1
100 220 14 1
100 202 15 0
100 206 15 0
100 214 15 0
100 217 15 0
Task:
I like to use a binary classifier in Scikit to predict the victory value (0/1) based on players' feature.
Questions:
I'm looking for a way to present feature that the classifier detect which two teams played against each other, because the result of a match up depends on the opponent team.
Later, I would like to see which players had more effect on the match up result, and which skills are more effective in victory. Can I use importance rates in Gradient Boosting Classifier?
One approach would be using a sparse-vector of all player_ids multiplied with 2 (2 teams), where the chosen ones are indicated with a nonzero value like 1.
If there are N players 0, ..., N-1, and the team A consists of 1, 3, 5, team B consists of 0, 2, 4, the input looks like the following:
x_sample_0 = [0, 1, 0, 1, 0, 1, 0, ...N-1, 1, 0, 1, 0, 1, ...]
...team A... ... team B...
This should be a quite powerful representation (in regards to the information represented) of the task with two obvious drawbacks:
the vector-size will get large (linear in the size of players)
this does not matter if the classifier is able to work with sparse data-structures (scipy.sparse) as there are mostly zeros
there is the problem of symmetry, as each pairing A vs. B can also be encoded as B vs. A
at the cost of doubling the input sample-size, i would highly recommend adding the symmetric pairing as extra sample (sample-size is doubled)
remember too flip the output (meaning: encode y as the win/loss from first teams perspective)!
maybe enforcing a lexicographic-ordering which will always output a well-defined order between two pairings (if players can't play themself) can be used as an alternative (i would try the above first)
Edit:
One more alternative:
introduce one variable for each possible team (e.g. f_0 = team of 0, 2, 4) and use this representation which has a different vector-size depending on the statistics
this kind of loses some information and would need much more data
Related
I'm trying to predict the winning team of a Destiny2 PvP match based on the player stats.
The dataset has multiple columns:
player_X_stat
X from 1 to 12 for each player, players from 1 to 6 are in team A and players from 7 to 12 are in team B
stat is a specific player stat like average number of kills per game or average number of deaths per game etc.
So there are X * (number of different stat) columns (around 100 at the end)
winning_team being always 0 (see note below). The winning team is always composed of players from 1 to 6 (team A) and the losing team is composed of players from 7 to 12 (team B).
note: To set up a balanced dataset, I shuffle the teams so that the winning_team column has as many 0s as 1s. For example, player_1_kills becomes player_7_kills, player_7_kills becomes player_1_kills and winning_team becomes 1.
However when training a catboost classifier or a simple SVC, the confusion matrix is almost perfectly balanced:
[[87486 31592]
[31883 87196]]
It seems to me that this is an anomaly but I have no idea where it could come from.
Edit: link to google colab notebook
I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.
Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)
This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.
I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.
INTRO
I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create.
Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.
Let's take an example where both user1 and user2 has 2 segments: print df
username voltage segID
0 user1 -0.154732 0
1 user1 -0.063169 0
2 user1 0.554732 1
3 user1 -0.641311 1
4 user1 -0.653732 1
5 user2 0.446469 0
6 user2 -0.655732 0
7 user2 0.646769 0
8 user2 -0.646369 1
9 user2 0.257732 1
10 user2 -0.346369 1
QUESTIONS:
scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?
I can't figure out my segments fitting into the n_samples * n_features structure.
I have two ideas:
1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?
{
'data': array([
[[-0.154732, -0.063169]],
[[ 0.554732, -0.641311, -0.653732],
[[ 0.446469, -0.655732, 0.646769]],
[[-0.646369, 0.257732, -0.346369]]
]),
'target':
array([0, 1, 2, 3]),
'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')
}
2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?
{
'data': array([
[-0.154732],
[-0.063169],
[ 0.554732],
[-0.641311],
[-0.653732],
[ 0.446469],
[-0.655732],
[ 0.646769],
[-0.646369],
[ 0.257732],
[-0.346369]
]),
'target':
array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}
I think the main problem is that I can't figure out what to use as labels...
EDIT:
ok it's clear... labels are given by my ground truth, they are just the user's names.
elyase's answer is exactly what i was looking for.
In order to better state the problem, I'm going to explain here the segID meaning.
In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments.
At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.
Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series".
My segID is just a column representing the id of a chunk.
This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.
As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.
Target:
As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]
Features
Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.
So if you ignore the SegID information your X matrix will look like this:
y/features
min max ... sum
user1 0.1 1.2 ... 1.1 # <-first time series for user 1
user1 0.0 1.3 ... 1.1 # <-second time series for user 1
user2 0.3 0.4 ... 13.0 # <-first time series for user 2
As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc
In this case you will have:
y/features
min max ... sum segID_most_freq segID_min
user1 0.1 1.2 ... 1.1 1 1
user1 0.3 0.4 ... 13 2 1
user2 0.3 0.4 ... 13 5 3
The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.
Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?