I would like to use the model prediction (lets say RandomForestRegression) to replace the missing value in the column Age of a dataframe. I checked that the data type of the model prediction is numpy.ndarray.
Here’s what I do:
a = RandomForestRegressor()
a.fit(train_data, target)
result = a.predict(test_data)
df[df.Age.isna()].Age.iloc[:] = result
But it doesn’t work and can’t replace the nan value. May I ask why?
I saw some people use the same method but they work.
Do not use chained indexing. It is explicitly discouraged in the docs. The inconsistency you may be seeing may be linked to copy versus view discrepancies as described in the docs.
Instead, use a single pd.DataFrame.loc call:
df.loc[df['Age'].isna(), 'Age'] = result
See also Indexing and Selecting Data.
Related
For instance I have thousands row with one of its is column 'cow_ID' where each cow ID have several rows. I want to replace those ID with number starting from 1 just to make it easier to remember.
df['cow_id'].unique().tolist()
resulting in:
5603,
5606,
5619,
4330,
5587,
4967,
5554,
4879,
4151,
5501,
4723,
4908,
3963,
4023,
4573,
3986,
5668,
4882,
5645,
5548
How do I change each unique ID into new number such as:
5603 -> 1
5606 -> 2
Try to look at
df.groupby('cow_id').ngroup()+1
Or try pd.factorize:
pd.factorize(df['cow_id'])[0]+1
As in the documentation, pd.factorize Encodes the object as an enumerated type or categorical variable.
Note that there are two return variables of pd.factorize
What you are looking for should be tagged with categorical encoding.
sklearn library in python has many preprocessing methods out of which label encoder should do the job for you. Refer this link.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder
Also keep in mind that using encodings like these might introduce some bias in your dataset as some algorithms can consider one label higher than the other, i.e., 1 > 2> ...>54 .
Refer this blog to learn more about encodings and when to use what
https://towardsdatascience.com/encoding-categorical-features-21a2651a065c
Let me know if you have any questions.
Here is the result using pandas.Categorical. The benefit is that you keep the original data and can flip back and forth.Here I create a variable called "c" that holds both the original categories and the new codes
I need to filter the dataframe with pandas's str.conatins() function.
However, I want to pass on a list target that can later be customized by user, rather than a fixed string, is there a way to do that?
I have tried df.filter(like=) , but it would not work for me due to it's complete fit nature.
target('food','tasty','avocado','mint')
df1=df[df['text'].str.contains('food')]
You can use df.isin() Docs
target = set('food','tasty','avocado','mint')
df1=df[df['text'].isin(target)]
With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.
in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".
I am working on a dataset which contains missing values in certain columns. I am trying to use XGBRegressor of Scikit-Learn wrapper interface for XGBoost. There it provides a parameter called 'missing' in which you can enter float values or otherwise it takes NaN of python as default. So i need help like how can i use this parameter to fill missing values of the columns in my dataset. It will be helpful if one can provide me a simple example as well.
The missing value parameter works as whatever value you provide for 'missing' parameter it treats it as missing value. For example if you provide 0.5 as missing value, then wherever it finds 0.5 in your data it treats it as missing value. Default is NaN. So what XGBoost does is based on the data it defines one of the path as default path. For example based on one parameter say it can go in two directions either left or right, so one of that will be made default based on the data. So whenever one of the missing value comes as input for a parameter, say you defined 0.5 as missing, then whenever 0.5 comes in the data it takes the default path. Initially I thought it imputes the missing value but it does not. It just defines one of the path as default and whenever any missing value come it takes that default path. This is defined in the paper XGBoost: A Scalable Tree Boosting System
its my understanding you got it mixed up.
The missing parameter only replaces a certain value (or list of values) for missing (aka NaN) - the default is "np.nan"
if you want to replace the actual missing values for some different value, lets say "X" you gotta do it on your data before applying the model.
if you got a dataframe "df" you can:
df.fillna(X)
if you got a np.array "array" you can:
np.nan_to_num(array)
but the above will replace the np.nan with zeros.
hope that helps,