One hot encode correlations and decision trees - python

I have few questions about preparing the data for learning.
Im very confused about how to convert columns to categorical and binary columns when i want to use the for correlations and classifier decision tree.
for exmaple in NBA_df, convert the position column to categorical column for using decision tree, can i convert it to categorical with .astype('category').cat.codes? (I know in basketball you can note the position by number 1-5.
NBA_df
And in students_df why its more correct to convert the 'gender','race/ethnicity','lunch','test preparation course' columns to a new binary columns with .get_dummies and not do the categorical convert in the same column ?
students_df
Its same in correlation and trees?

I'm not sure I totally understand what you mean by converting to categorical "in the same column", but I assume you mean replacing the categorical response from positions into numbers 1 through 5 and keeping those numbers in the same column.
Assuming this is what you meant, you have to think about how the computer will interpret the input. Is a Small Forward (position 3 in basketball) 3 times a Point Guard (1 * 3)? Of course not, but a computer will see it that way. It will determine relationships with the target that are not realistic. For this reason, you need separate columns with a binary indicator like .get_dummies is doing. That way, the computer will not see the positions as numeric values that can be operated on, but it will see the positions as separate entities.

Related

Dataframe treat data with different decimal houses

I have this column in my dataset where the values are not consistent.
You can either find values with just one decimal house or four decimal values.
I need this columns to calculate some means.
How can I treat this column?

how to cluster values of continuous time series

In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.

Python - NaN values in df.corr

I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!

How to interpret the output of H2O .predict method for random forest classification?

When I use the predict method on my trained model I get an output that is 1 row and 206 columns. it seems to have 206 values ranging in values from 0-1. This sort of makes sense as the model's output is categorical variable with values 0 and 1 as possible values. But I don't get the 206 values, as I understand it the output should be a value of 0 or 1. What do the 206 values mean?
I've spent the past hour or so browsing h2o documentation but can't seem to find an explanation of how to explain the 206 values outputted by predict when I was expecting one value that is either a 0 or 1.
thanks.
UPDATE AFTER YOUR COMMENT: The first column is the answer that your model is choosing. The remaining 205 columns are the prediction confidences for each of the 205 categories. (It implies whatever you are trying to predict is a factor (aka enum) column with 205 levels.) Those 205 columns should be summing to 1.0.
The column names should be a good clue: the first column is "predict", but the others are the labels of each of your 205 categories.
(Old answer, based on assuming it was 206 rows, 1 column!)
If predict is giving you a single column of output you have done a regression, not a classification.
This sort of makes sense as the model's output is categorical variable with values 0 and 1 as possible values.
H2O has seen those 0s and 1s and assumed they are numbers, not categories. To do a classification you simply need to change that column to be an enum (H2O's internal term for it), aka factor (the R/Python H2O API term for it). (Do this step immediately after loading your data into H2O, and before splitting it or making any models.)
E.g. if data is your H2O Frame, and answer is the name of your column with the 0 and 1 catgeories in it, you would do:
data["answer"] = data["answer"].asfactor()
If any of your other columns look numeric but should actually be treated as factors, you can do multiple columns at once like this:
factorsList = ["cat1", "cat2", "answer"]
data[factorsList] = data[factorsList].asfactor()
You can also set the column types at the time you import the data with the col_types argument.

GradientBoostingClassifier and many columns

I use GradientBoosting classifier to predict gender of users. The data have a lot of predictors and one of them is the country. For each country I have binary column. There are always only one column set to 1 for all country columns. But such desicion is very slow from computation point of view. Is there any way to represent country columns with only one column? I mean correct way.
You can replace the binary variable with the actual country name then collapse all of these columns into one column. Use LabelEncoder on this column to create a proper integer variable and you should be all set.

Categories

Resources