in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".
Related
For instance I have thousands row with one of its is column 'cow_ID' where each cow ID have several rows. I want to replace those ID with number starting from 1 just to make it easier to remember.
df['cow_id'].unique().tolist()
resulting in:
5603,
5606,
5619,
4330,
5587,
4967,
5554,
4879,
4151,
5501,
4723,
4908,
3963,
4023,
4573,
3986,
5668,
4882,
5645,
5548
How do I change each unique ID into new number such as:
5603 -> 1
5606 -> 2
Try to look at
df.groupby('cow_id').ngroup()+1
Or try pd.factorize:
pd.factorize(df['cow_id'])[0]+1
As in the documentation, pd.factorize Encodes the object as an enumerated type or categorical variable.
Note that there are two return variables of pd.factorize
What you are looking for should be tagged with categorical encoding.
sklearn library in python has many preprocessing methods out of which label encoder should do the job for you. Refer this link.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder
Also keep in mind that using encodings like these might introduce some bias in your dataset as some algorithms can consider one label higher than the other, i.e., 1 > 2> ...>54 .
Refer this blog to learn more about encodings and when to use what
https://towardsdatascience.com/encoding-categorical-features-21a2651a065c
Let me know if you have any questions.
Here is the result using pandas.Categorical. The benefit is that you keep the original data and can flip back and forth.Here I create a variable called "c" that holds both the original categories and the new codes
It looks like TensorForest, the Random Forest implementation inside TensorFlow, somehow supports categorical features as input (without one-hot encoding).
See
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tensor_forest/python/ops/data_ops.py#L32
https://github.com/tensorflow/tensorflow/issues/4025#issuecomment-242814047
However it's not clear how to use them.
If you look at this example
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/random_forest_mnist.py
the 'x' parameter at line 65
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/random_forest_mnist.py#L65
must be a float array.
How could I pass categorical features (e.g. strings)?
When using the SKCompat wrapper to the estimator, the 'x' and 'y' parameters do need to be floats (because with that interface you can only pass in one object). However, using the estimator's input function interface (input_fn=...) that most examples use, the feature dictionary that input_fn returns can be a mix of floats, int, and string Tensors. Floats are treated as continuous, ints and strings are treated as categorical (creating x[i] == T decision nodes instead of x[i] <= T) and no one-hot encoding is needed. So you need to create an input function that returns batches of data (which is what the SKCompat interface does for you essentially).
I am working on a dataset which contains missing values in certain columns. I am trying to use XGBRegressor of Scikit-Learn wrapper interface for XGBoost. There it provides a parameter called 'missing' in which you can enter float values or otherwise it takes NaN of python as default. So i need help like how can i use this parameter to fill missing values of the columns in my dataset. It will be helpful if one can provide me a simple example as well.
The missing value parameter works as whatever value you provide for 'missing' parameter it treats it as missing value. For example if you provide 0.5 as missing value, then wherever it finds 0.5 in your data it treats it as missing value. Default is NaN. So what XGBoost does is based on the data it defines one of the path as default path. For example based on one parameter say it can go in two directions either left or right, so one of that will be made default based on the data. So whenever one of the missing value comes as input for a parameter, say you defined 0.5 as missing, then whenever 0.5 comes in the data it takes the default path. Initially I thought it imputes the missing value but it does not. It just defines one of the path as default and whenever any missing value come it takes that default path. This is defined in the paper XGBoost: A Scalable Tree Boosting System
its my understanding you got it mixed up.
The missing parameter only replaces a certain value (or list of values) for missing (aka NaN) - the default is "np.nan"
if you want to replace the actual missing values for some different value, lets say "X" you gotta do it on your data before applying the model.
if you got a dataframe "df" you can:
df.fillna(X)
if you got a np.array "array" you can:
np.nan_to_num(array)
but the above will replace the np.nan with zeros.
hope that helps,
I was learning logistic regression in python by comparing it with SAS.
Dataset: http://www.ats.ucla.edu/stat/data/binary.csv
here admit is the response variable and has categories 0 and 1.
SAS by default is modeling based on probability that ADMIT=0 and if I specify DESC option it does it on ADMIT = 1.
Ref: http://www.ats.ucla.edu/stat/sas/faq/logistic_descending.htm
Now in python,using stats models by default it is modeling on ADMIT = 1.How can I make to model on ADMIT= 0 (change the event description ) so that I dont see the difference in coefficients and predicted probablities.
Thanks.
The only robust way is to create a new 0-1 dummy variable with 1 representing the desired level.
for example:
not_admit = (ADMIT == 0).astype(int)
"robust" here refers to current ambiguities in the interaction between pandas, patsy and statsmodels which might change a categorical variable if the dtype is not integer or float, e.g. string, boolean or object. This treatment of categorical dependent variables will have to change at some point in a backwards incompatible way to make it consistent between formula and non-formula versions.
There are some issues about this, for example
https://github.com/statsmodels/statsmodels/issues/2733
I am trying to understand following case:
when I create new xgbost DMatrix
xgX = xgb.DMatrix(X, label=Y, missing=np.nan)
based on input data X with 64 features
I got the new DMatrix with 55 features
What the magic is doing here? Any advise would be great!
Take a look at
xgboost issue #1223
There, khotilov makes the comment:
The problem with CSR is that when you have completely sparse columns at the end, you cannot figure out that they exist by just looking at CSR's indices and pointers.
The consequence of this is that the function that creates the DMatrixfrom X, XGDMatrixCreateFromCSR, does not account for the empty columns at the end, which in your case is 9 columns. You may want to check that in your case and determine whether or not you really have 64 features in X.