dependent variable One hot encoder - python

I am new to machine learning my question is:
Do we need to encode dependent variable y if it contains three class
segments 1,2,3 and I want to know if there is need to encode the
dependent variable when it contains no

OneHotEncoder will create k number of columns if there are k classes for a single variable.
For example : it will create 2 variables if gender values in that dataset are Male/Female,
It will create 3 Variables if gender values are male/Female/PreferNotToSay
Now, You don't want multiple variables in your predicate y, So better go with LabelEncoder(from sklearn.preprocessing) or some mechanism that keeps the dimensionality intact.

I did not clearly get what dependent variable in your case.
If you are talking about 'y' the output, then no need of one hot encoding.
If a particular column is combination/dependent on any other columns. In machine learning, one column has some or other relation between another.
Better to do one hot encoding on categorical variables.
Below is an example of what one hot encoding does:
Before:
name gender
a M
b F
c O
After
name M F O
a 1 0 0
b 0 1 0
c 0 0 1

Related

Last member of each element of an id list of indices in relational dataset

Suppose I have two datasets in python: households and people (individuals). A key or id (int64) connects a household with one or more individuals. I want to create a binary variable called "last_member" that takes a value of 0 if there are more individuals in the same household, and 1 if this individual is the last member of the household.
A trivial example would be the following:
last_member id ...
0 1 ...
0 1 ...
1 1 ...
1 2 ...
0 3 ...
1 3 ...
...
I can get the number of unique ids from the households dataset or from the individual's dataset itself.
I get a feeling that either numpy's where function, or pandas' aggregate are strong candidates to find such a solution. Still, I can't wrap my head around an efficient solution that does not involve, let's say, looping over the list of indices.
I coded a function that runs efficiently and solves the problem. The idea is to create the variable "last_member" full of zeros. This variable lets us compute the number of members per id using pandas' groupby. Then we compute the cumulative sum (minus 1, because of python's indexing) to find the indices where we would like to change the values of the "last_member" variable to 1.
def create_last_member_variable(data):
""" Creates a last_member variable based on the index of id variable.
"""
data["last_member"] = 0
n_members = data.groupby(["id"]).count()["last_member"]
row_idx = np.cumsum(n_members) - 1
data.loc[row_idx, "last_member"] = 1
return data

how to transform rows into list inside of a dataframe pandas

I have two variables (Columns) that are related: one represent the name a person, the other count the times this person workout in a week. The problem is about visualize that data.
when i want to see the data it looks like this:
x name wrk
0 0 E 1
1 1 A 2
2 2 B 5
3 3 A 3
4 4 C 6
now, the letters are repeated the times that this pearson appears in the variable "wrk". then I just want to see that letter, but without repetitions. For example when i want to see the mean of every person i see one letter and its mean on "wrk"
wrk
name
A 4.625000
B 5.142857
C 5.400000
D 3.833333
E 4.785714
I just want to see every value in wrk and only one letter in name, so I thought the solution is transforming wrk on a list to the output be like this:
work
name
A 1:2:3:5:7:8:10
B 1:2:4:7:8
C 1:6:9
D 1:2:3:5:7:8:10
E 1:2:3:5:7:8:10
the thing is I've shearched how to make this but i haven't found the code that helps me to do it. Can someone help me?
(sorry for my English, I'm learning)
Perhaps this?
df['wrk'] = df['wrk'].astype(str)
df = df.groupby('name')[['wrk']].agg(':'.join)

How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

How to interpret the output of H2O .predict method for random forest classification?

When I use the predict method on my trained model I get an output that is 1 row and 206 columns. it seems to have 206 values ranging in values from 0-1. This sort of makes sense as the model's output is categorical variable with values 0 and 1 as possible values. But I don't get the 206 values, as I understand it the output should be a value of 0 or 1. What do the 206 values mean?
I've spent the past hour or so browsing h2o documentation but can't seem to find an explanation of how to explain the 206 values outputted by predict when I was expecting one value that is either a 0 or 1.
thanks.
UPDATE AFTER YOUR COMMENT: The first column is the answer that your model is choosing. The remaining 205 columns are the prediction confidences for each of the 205 categories. (It implies whatever you are trying to predict is a factor (aka enum) column with 205 levels.) Those 205 columns should be summing to 1.0.
The column names should be a good clue: the first column is "predict", but the others are the labels of each of your 205 categories.
(Old answer, based on assuming it was 206 rows, 1 column!)
If predict is giving you a single column of output you have done a regression, not a classification.
This sort of makes sense as the model's output is categorical variable with values 0 and 1 as possible values.
H2O has seen those 0s and 1s and assumed they are numbers, not categories. To do a classification you simply need to change that column to be an enum (H2O's internal term for it), aka factor (the R/Python H2O API term for it). (Do this step immediately after loading your data into H2O, and before splitting it or making any models.)
E.g. if data is your H2O Frame, and answer is the name of your column with the 0 and 1 catgeories in it, you would do:
data["answer"] = data["answer"].asfactor()
If any of your other columns look numeric but should actually be treated as factors, you can do multiple columns at once like this:
factorsList = ["cat1", "cat2", "answer"]
data[factorsList] = data[factorsList].asfactor()
You can also set the column types at the time you import the data with the col_types argument.

Machine Learning: combining features into single feature

I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.

Categories

Resources