How to use missing parameter of XGBRegressor of scikit-learn - python

I am working on a dataset which contains missing values in certain columns. I am trying to use XGBRegressor of Scikit-Learn wrapper interface for XGBoost. There it provides a parameter called 'missing' in which you can enter float values or otherwise it takes NaN of python as default. So i need help like how can i use this parameter to fill missing values of the columns in my dataset. It will be helpful if one can provide me a simple example as well.

The missing value parameter works as whatever value you provide for 'missing' parameter it treats it as missing value. For example if you provide 0.5 as missing value, then wherever it finds 0.5 in your data it treats it as missing value. Default is NaN. So what XGBoost does is based on the data it defines one of the path as default path. For example based on one parameter say it can go in two directions either left or right, so one of that will be made default based on the data. So whenever one of the missing value comes as input for a parameter, say you defined 0.5 as missing, then whenever 0.5 comes in the data it takes the default path. Initially I thought it imputes the missing value but it does not. It just defines one of the path as default and whenever any missing value come it takes that default path. This is defined in the paper XGBoost: A Scalable Tree Boosting System

its my understanding you got it mixed up.
The missing parameter only replaces a certain value (or list of values) for missing (aka NaN) - the default is "np.nan"
if you want to replace the actual missing values for some different value, lets say "X" you gotta do it on your data before applying the model.
if you got a dataframe "df" you can:
df.fillna(X)
if you got a np.array "array" you can:
np.nan_to_num(array)
but the above will replace the np.nan with zeros.
hope that helps,

Related

What does `C()` do in argument formula_like of function patsy.dmatrices()?

I am studying an example of calling function patsy.dmatrices().
The input argument formula_like contains items C(sales) and C(salary), and the function translates these items to discrete values into dummy variables depending on the specific input data. For example, C(salary) gets indicator columns of C(salary)[T.low], C(salary)[T.medium], etc.
So, I wonder:
What is the terminology of C()? Should we call it a function or something? I didn't find a clear description on the official document webpage, but I could have missed something.
What is the purpose of wrapping the column name with C()? I tried to remove it, e.g. changing the item from C(salary) to salary plainly, and the function still translates the column into dummy variables.
I am new to this area, and I highly appreciate any hints or suggestions.
y, X = dmatrices(
formula_like=
'left~satisfaction_level+last_evaluation+number_project+average_montly_hours'
'+time_spend_company+Work_accident+promotion_last_5years+C(sales)+C(salary)',
data=data,
return_type='dataframe')
X.head()

How to replace missing values in a column with ndarray/model prediction

I would like to use the model prediction (lets say RandomForestRegression) to replace the missing value in the column Age of a dataframe. I checked that the data type of the model prediction is numpy.ndarray.
Here’s what I do:
a = RandomForestRegressor()
a.fit(train_data, target)
result = a.predict(test_data)
df[df.Age.isna()].Age.iloc[:] = result
But it doesn’t work and can’t replace the nan value. May I ask why?
I saw some people use the same method but they work.
Do not use chained indexing. It is explicitly discouraged in the docs. The inconsistency you may be seeing may be linked to copy versus view discrepancies as described in the docs.
Instead, use a single pd.DataFrame.loc call:
df.loc[df['Age'].isna(), 'Age'] = result
See also Indexing and Selecting Data.

Convert Categorical Features (Enum) in H2o to Boolean

in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".

NetCDF variables that have the fill value/ missing value

I have a variable in a NetCDF file that has a default value if the variable is null. How do you remove this value or change it to 0 when the variable is missing a value?
It sounds like the problem is that when the variable is populated into the NetCDF file, it is set to insert some default value for values that are missing. Now, I am assuming that you need to remove these default values after the file has been written and you are working with the data.
So (depending on how you are accessing the variable) I would pull the variable out of the NetCDF file and assign it to a python variable. This is the first method that comes to mind.
Use a for loop to step through and replace that default value with 0
variable=NetCDF_variable #Assume default value is 1e10
cleaned_list=[]
for i in variable:
if i == 1e10:
cleaned_list.append(0) #0 or whatever you want to fill here
else:
cleaned_list.append(i)
If the default value is a float, you may want to look into numpy.isclose if the above code isn't working. You might also be interested in masking your data in case any computations you do would be thrown off by inserting a 0.
EDIT: User N1B4 provided a much cleaner and efficient way of doing the exact same thing as above.
variable[variable == 1e10] = 0

Specifying which category to treat as the base with 'statsmodels'

In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form
Location[T.Thailand]
with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?
You can pass a reference arg to the Treatment contrast, using syntax like
"y ~ C(Location, Treatment(reference='China'))"
http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment
If you have a better suggestion for naming conventions please file an issue with patsy.
If you use single quotes to wrap your string, reference's argument needs to be wrapped with double quotes. Very easy mistake to make. I was using single quotes on both.
For example:
'y ~ C(Location, Treatment(reference="China"))'
is correct.
'y ~ C(Location, Treatment(reference='China'))'
is not correct.
Ok, maybe someone will find this one helpfull. I needed to set a new baseline category for the dependent variable, I had no idea how to do it. I searched and found nothing, so i simply added a "_" for the other categories. If you have 3 categories A, B, C, and you want your baseline to be C you just change the labeles from A and B to _A and _B. It works. I appears that the baseline category is defined by sorted()
Maybe someone knows a proper way to do it, this is not very phytonic, ja.

Categories

Resources