INTRO
I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create.
Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.
Let's take an example where both user1 and user2 has 2 segments: print df
username voltage segID
0 user1 -0.154732 0
1 user1 -0.063169 0
2 user1 0.554732 1
3 user1 -0.641311 1
4 user1 -0.653732 1
5 user2 0.446469 0
6 user2 -0.655732 0
7 user2 0.646769 0
8 user2 -0.646369 1
9 user2 0.257732 1
10 user2 -0.346369 1
QUESTIONS:
scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?
I can't figure out my segments fitting into the n_samples * n_features structure.
I have two ideas:
1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?
{
'data': array([
[[-0.154732, -0.063169]],
[[ 0.554732, -0.641311, -0.653732],
[[ 0.446469, -0.655732, 0.646769]],
[[-0.646369, 0.257732, -0.346369]]
]),
'target':
array([0, 1, 2, 3]),
'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')
}
2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?
{
'data': array([
[-0.154732],
[-0.063169],
[ 0.554732],
[-0.641311],
[-0.653732],
[ 0.446469],
[-0.655732],
[ 0.646769],
[-0.646369],
[ 0.257732],
[-0.346369]
]),
'target':
array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}
I think the main problem is that I can't figure out what to use as labels...
EDIT:
ok it's clear... labels are given by my ground truth, they are just the user's names.
elyase's answer is exactly what i was looking for.
In order to better state the problem, I'm going to explain here the segID meaning.
In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments.
At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.
Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series".
My segID is just a column representing the id of a chunk.
This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.
As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.
Target:
As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]
Features
Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.
So if you ignore the SegID information your X matrix will look like this:
y/features
min max ... sum
user1 0.1 1.2 ... 1.1 # <-first time series for user 1
user1 0.0 1.3 ... 1.1 # <-second time series for user 1
user2 0.3 0.4 ... 13.0 # <-first time series for user 2
As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc
In this case you will have:
y/features
min max ... sum segID_most_freq segID_min
user1 0.1 1.2 ... 1.1 1 1
user1 0.3 0.4 ... 13 2 1
user2 0.3 0.4 ... 13 5 3
The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.
Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?
Related
I'm new to Python, and I've got problems in calculating correlation coefficients for multiple participants.
I've got a dataframe just like this :
|Index|Participant|Condition|ReactionTime1|ReactionTime2|
|:---:|:---------:|:-------:|:-----------:|:-------------:|
|1|1|A|320|542|
|2|1|A|250|623|
|3|1|B|256|547|
|4|1|B|301|645|
|5|2|A|420|521|
|6|2|A|123|456|
|7|2|B|265|362|
|8|2|B|402|631|
I am wondering how to calculate correlation coefficient between Reaction Time 1 and Reaction Time2 for Participant 1 and for Participant 2 in each condition. My real dataset is way bigger than this (hundreds of Reaction Time for each participant, and there are a lot of participant too). Is there a general way to calculate this and to put coeff in a new df like this?
|Index|Participant|Condition|Correlation coeff|
|:---:|:---------:|:-------:|:-----------:|
|1|1|A|?|
|2|1|B|?|
|3|2|A|?|
|4|2|B|?|
Thanks :)
You can try groupby and apply with np.corrcoef, and reset_index afterwards:
result = (df.groupby(["Participant", "Condition"])
.apply(lambda gr: np.corrcoef(gr["ReactionTime1"], gr["ReactionTime2"])[0, 1])
.reset_index(name="Correlation coeff"))
which gives
Participant Condition Correlation coeff
0 1 A -1.0
1 1 B 1.0
2 2 A 1.0
3 2 B 1.0
We use [0, 1] on the returned value of np.corrcoef since it returns a symmetric matrix where diagonal elements are normalized to 1 and off-diagonal elements are the same and each gives the desired coefficient (so might as well index with [1, 0]). That is,
array([[1. , 0.25691558],
[0.25691558, 1. ]])
is an example returned value and we are interested in the off-diagonal entry.
Why it returned all +/- 1 in your case: since each participant & conditon pair only has 2 entries for each reaction, they are always perfectly correleated and the sign is determined via their orientation i.e. if one increases from one coordinate to the other coordinate, does other one increase or decrease.
This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.
I am using VAR model to forecast multivariate time series with lag 2. I have three features, and would like to forcast several timestamps forward. Instead of forcasting all the three features, I actually know the values of two of the features, and would like to forcast only one feature.
If I wanted to forcast all the three features 5 timestamps a head, I could have done that as follows (this is a toy example):
import pandas as pd
from statsmodels.tsa.api import VAR
data=pd.DataFrame({'Date':['1959-06-01','1959-06-02','1959-06-03','1959-06-04']\
,'a':[1,2,3,5],'b':[2,3,5,8],'c':[3,4,7,11]})
data.set_index('Date', inplace=True)
model = VAR(data)
results = model.fit(2)
results.forecast(data.values[-2:], 5)
Note that data is
a b c
Date
1959-06-01 1 2 3
1959-06-02 2 3 4
1959-06-03 3 5 7
1959-06-04 5 8 11
And the forecast gives me
array([[ 8.01388889, 12.90277778, 17.79166667],
[ 12.93113426, 20.67650463, 28.421875 ],
[ 20.73343461, 33.12405961, 45.51468461],
[ 33.22366195, 52.98948789, 72.75531383],
[ 53.15895736, 84.72805652, 116.29715569]])
Let's say I knew that the next 5 values for a should have actually been 8,13,21,34,55 and the next 5 values for b should have been 13,21,34,55,89. Is there a way to incorporate that into the model in statsmodels.tsa (or any other python package) to forecast only the 5 values of c? I know that R has such an option, by incorporating "hard" conditions into cpredict.VAR, but I was wondering if this can be done in python as well.
The above is a toy example. In reality I have several dozens of features, but I still know all of them and would like predict only one of them using VAR model.
I have a similar issue when solving this problem. This is a makeshift method to accomplish what you are asking.
prediction = model_fit.forecast(model_fit.y, steps=len(test))
predictions = prediction[:,0]
`
Where 0 in the prediction[:,0] refers to the column that contains the desired forecasting value.
I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.
I have some data about a video game.
Data:
A match up has a matchId. Each match up includes two teams, and each team sizes varies. For instance, 3v3, 4v4, 5v5, ... Data simplified as follow:
matchId playerId teamId victory
100 200 14 1
100 201 14 1
100 212 14 1
100 220 14 1
100 202 15 0
100 206 15 0
100 214 15 0
100 217 15 0
Task:
I like to use a binary classifier in Scikit to predict the victory value (0/1) based on players' feature.
Questions:
I'm looking for a way to present feature that the classifier detect which two teams played against each other, because the result of a match up depends on the opponent team.
Later, I would like to see which players had more effect on the match up result, and which skills are more effective in victory. Can I use importance rates in Gradient Boosting Classifier?
One approach would be using a sparse-vector of all player_ids multiplied with 2 (2 teams), where the chosen ones are indicated with a nonzero value like 1.
If there are N players 0, ..., N-1, and the team A consists of 1, 3, 5, team B consists of 0, 2, 4, the input looks like the following:
x_sample_0 = [0, 1, 0, 1, 0, 1, 0, ...N-1, 1, 0, 1, 0, 1, ...]
...team A... ... team B...
This should be a quite powerful representation (in regards to the information represented) of the task with two obvious drawbacks:
the vector-size will get large (linear in the size of players)
this does not matter if the classifier is able to work with sparse data-structures (scipy.sparse) as there are mostly zeros
there is the problem of symmetry, as each pairing A vs. B can also be encoded as B vs. A
at the cost of doubling the input sample-size, i would highly recommend adding the symmetric pairing as extra sample (sample-size is doubled)
remember too flip the output (meaning: encode y as the win/loss from first teams perspective)!
maybe enforcing a lexicographic-ordering which will always output a well-defined order between two pairings (if players can't play themself) can be used as an alternative (i would try the above first)
Edit:
One more alternative:
introduce one variable for each possible team (e.g. f_0 = team of 0, 2, 4) and use this representation which has a different vector-size depending on the statistics
this kind of loses some information and would need much more data