In my dataset, one of the columns is a boolean value, and there are missing values within the dataset and within other continuous variable columns which are successfully replaced with their mean. But the mean value can not be replaced for missing boolean. So how can I replace those values?
Note that the boolean is 1 or 0 in my dataset.
Below is the code for replacing continuous missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x)
x = imputer.transform(x)
Thank You
there are several methods to attack this issue.
if you can afford it (if you have enough data) exclude those lines
replace those lines with the majority value (same as replacing with mean of continuous value)
for time series - replace the cell with mean of x cells before and after and set a threshold which above it - the mean will become 0, else , the mean will become 0
You can treat this boolean variable as a categorical feature and then use a SimpleImputer with the most_frequent strategy instead of mean.
You can do as follow:
from sklearn.impute import SimpleImputer
import numpy as np
#Create sample data with nans
X = np.random.randint(2, size=100).reshape(1,-1).astype(float)
X[0,::4] = np.nan
SimpleImputer(strategy="most_frequent").fit_transform(X)
Related
I have two dataframes, train and test. The test set has missing values on a column.
import numpy as np
import pandas as pd
train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]
train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])
The test set has two missing values on column B. If the groupby column is A
If the imputing strategy is mode, then the missing values should be imputed with 7 and 2.
If the imputing strategy is mean, then the missing values should be (1+2+3+7+7)/5 = 4 and (3+5+2+2)/4 = 3.
What is a good way to do this?
This question is related, but uses only one dataframe instead of two.
IIUC, here's one way:
from statistics import mode
test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
If you want a function:
from statistics import mode
def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)
I'm using a Multiple Imputer from sklearn library to impute some missing values from rain datasets, containing the rain stations and the rain data (each station a column, and the index are DateTime). I was able to run the IterativeImputer and get an output with all my missing values filled. The problem is that the output contains negative values. It's possible to change de min_value that he imputes, but it sets a unique value for all the columns. I wanna set a min_value based on the minimal value for each column before the imputation. There is a response here in Stack for that answer, but I've no clue how to do it.
The code I'm using:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
#Babitonga's region stations
babi_ana = pd.read_csv(all_csv_files[0]).set_index("Time") #Here a read the csv data
# Transforming my index to datetime
babi_ana.index = pd.to_datetime(babi_ana.index)
mask = (babi_ana.index > ini1) & (babi_ana.index <= fim1) #Selecting the date range
babi_ana1 = babi_ana.loc[mask]
# Applying the imputer
imputer_data = IterativeImputer(random_state = 0,skip_complete=True,sample_posterior=True, max_iter = 10, missing_values = np.nan)
data = babi_ana1
minimum = data.iloc[:,:].min(axis=0) #No negative values from the original
imputer_data.fit(data.iloc[:,:].values)
data_imputed = imputer_data.transform(data.iloc[:,:].values)
# Here I realize the output has negative values
data_imputed = pd.DataFrame(data_imputed)
minimun_after = data_imputed.iloc[:,:].min(axis=0) #several negative values, except for 2 stations
I wanna be able to use the min_value and max_value based on the max and min from each station before the imputation, like this:
max_imputer = data.iloc[:,:].max(axis = 0)
min_imputer = data.iloc[:,:].min(axis = 0)
Great improvements on the question :).
I've read a bit more about the IterativeImputer here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer.
It seems that it can take a min_value parameter on the constructor, it expects either a float or an array. If you have a minimum value for all features (columns) of your data, you can just use the float alternative.
For example, if you want the minimum possible value to be 0 in all features (columns), you could change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = 0)
On the other hand, if you want different minimum values for different features, you need to use an array as long as the number of features. For example: if you have 2 features and the minimum values should be 0 and 5, respectively, you would change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = [0, 5])
You can do the same for the max_value parameter.
The first change should make sure you don't get any more negative imputed values.
If you want to use the min and max values based on the data you already have, the first step should be to write code that goes over that feature in your data and gets both the minimum and maximum values there. It should be the same as getting min and max values in an array, you can probably find lots of Python examples for that if you aren't sure how to do it.
As a final note, it's still a bit weird to me how the Imputer output negative data after fitting with only positive data. So I'd double check that data.iloc[:,:].values really is the data you want in the format the Imputer is expecting.
I was using sklearn.impute.SimpleImputer(strategy='constant',fill_value= 0) to impute all columns with missing values with a constant value(0 being that constant value here).
But, it sometimes makes sense to impute different constant values in different columns. For example, i might like to replace all NaN values of a certain column with the maximum value from that column, or some other column's NaN values with minimum or let's suppose median/mean of that particular column values.
How can i achieve this?
Also, i'm actually new to this field, so i'm not really sure if doing this might improve my model's results. Your opinions are welcome.
If you want to impute different features with different arbitrary values, or the median, you need to set up several SimpleImputer steps within a pipeline and then join them with the ColumnTransformer:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# first we need to make lists, indicating which features
# will be imputed with each method
features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu']
# then we instantiate the imputers, within a pipeline
# we create one imputer for numerical and one imputer
# for categorical
# this imputer imputes with the mean
imputer_numeric = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
])
# this imputer imputes with an arbitrary value
imputer_categoric = Pipeline(
steps=[('imputer',
SimpleImputer(strategy='constant', fill_value='Missing'))])
# then we put the features list and the transformers together
# using the column transformer
preprocessor = ColumnTransformer(transformers=[('imputer_numeric',
imputer_numeric,
features_numeric),
('imputer_categoric',
imputer_categoric,
features_categoric)])
# now we fit the preprocessor
preprocessor.fit(X_train)
# and now we can impute the data
# remember it returs a numpy array
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)
Alternatively, you can use the package Feature-Engine which transformers allow you to specify the features:
from feature_engine import imputation as msi
from sklearn.pipeline import Pipeline as pipe
pipe = pipe([
# add a binary variable to indicate missing information for the 2 variables below
('continuous_var_imputer', msi.AddMissingIndicator(variables = ['LotFrontage', 'GarageYrBlt'])),
# replace NA by the median in the 3 variables below, they are numerical
('continuous_var_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea'])),
# replace NA by adding the label "Missing" in categorical variables (transformer will skip those variables where there is no NA)
('categorical_imputer', msi.CategoricalImputer(variables = ['var1', 'var2'])),
# median imputer
# to handle those, I will add an additional step here
('additional_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['var4', 'var5'])),
])
pipe.fit(X_train)
X_train_t = pipe.transform(X_train)
Feature-engine returns dataframes. More info in this link.
To install Feature-Engine do:
pip install feature-engine
Hope that helps
I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)
I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))