I'm getting negative values as output of IterativeImputer from sklearn - python

I'm using a Multiple Imputer from sklearn library to impute some missing values from rain datasets, containing the rain stations and the rain data (each station a column, and the index are DateTime). I was able to run the IterativeImputer and get an output with all my missing values filled. The problem is that the output contains negative values. It's possible to change de min_value that he imputes, but it sets a unique value for all the columns. I wanna set a min_value based on the minimal value for each column before the imputation. There is a response here in Stack for that answer, but I've no clue how to do it.
The code I'm using:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
#Babitonga's region stations
babi_ana = pd.read_csv(all_csv_files[0]).set_index("Time") #Here a read the csv data
# Transforming my index to datetime
babi_ana.index = pd.to_datetime(babi_ana.index)
mask = (babi_ana.index > ini1) & (babi_ana.index <= fim1) #Selecting the date range
babi_ana1 = babi_ana.loc[mask]
# Applying the imputer
imputer_data = IterativeImputer(random_state = 0,skip_complete=True,sample_posterior=True, max_iter = 10, missing_values = np.nan)
data = babi_ana1
minimum = data.iloc[:,:].min(axis=0) #No negative values from the original
imputer_data.fit(data.iloc[:,:].values)
data_imputed = imputer_data.transform(data.iloc[:,:].values)
# Here I realize the output has negative values
data_imputed = pd.DataFrame(data_imputed)
minimun_after = data_imputed.iloc[:,:].min(axis=0) #several negative values, except for 2 stations
I wanna be able to use the min_value and max_value based on the max and min from each station before the imputation, like this:
max_imputer = data.iloc[:,:].max(axis = 0)
min_imputer = data.iloc[:,:].min(axis = 0)

Great improvements on the question :).
I've read a bit more about the IterativeImputer here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer.
It seems that it can take a min_value parameter on the constructor, it expects either a float or an array. If you have a minimum value for all features (columns) of your data, you can just use the float alternative.
For example, if you want the minimum possible value to be 0 in all features (columns), you could change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = 0)
On the other hand, if you want different minimum values for different features, you need to use an array as long as the number of features. For example: if you have 2 features and the minimum values should be 0 and 5, respectively, you would change your code to:
imputer_data = IterativeImputer(random_state = 0, skip_complete = True,sample_posterior = True, max_iter = 10, missing_values = np.nan, min_value = [0, 5])
You can do the same for the max_value parameter.
The first change should make sure you don't get any more negative imputed values.
If you want to use the min and max values based on the data you already have, the first step should be to write code that goes over that feature in your data and gets both the minimum and maximum values there. It should be the same as getting min and max values in an array, you can probably find lots of Python examples for that if you aren't sure how to do it.
As a final note, it's still a bit weird to me how the Imputer output negative data after fitting with only positive data. So I'd double check that data.iloc[:,:].values really is the data you want in the format the Imputer is expecting.

Related

How can I replace missing boolean values using python?

In my dataset, one of the columns is a boolean value, and there are missing values within the dataset and within other continuous variable columns which are successfully replaced with their mean. But the mean value can not be replaced for missing boolean. So how can I replace those values?
Note that the boolean is 1 or 0 in my dataset.
Below is the code for replacing continuous missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x)
x = imputer.transform(x)
Thank You
there are several methods to attack this issue.
if you can afford it (if you have enough data) exclude those lines
replace those lines with the majority value (same as replacing with mean of continuous value)
for time series - replace the cell with mean of x cells before and after and set a threshold which above it - the mean will become 0, else , the mean will become 0
You can treat this boolean variable as a categorical feature and then use a SimpleImputer with the most_frequent strategy instead of mean.
You can do as follow:
from sklearn.impute import SimpleImputer
import numpy as np
#Create sample data with nans
X = np.random.randint(2, size=100).reshape(1,-1).astype(float)
X[0,::4] = np.nan
SimpleImputer(strategy="most_frequent").fit_transform(X)

Using inverse_transform MinMaxScaler from scikit_learn to force a dataframe be in a range of another

I was following this answer to apply an inverse transformation over a scaled dataframe. My question is how can I do to transform a new one dataframe to a range of values of the original dataframe?.
So far, I did this:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
cols = ['A', 'B']
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
scaler = MinMaxScaler() # default min and max values are 0 and 1, respectively
scaled_data = scaler.fit_transform(data)
orig_data = scaler.inverse_transform(scaled_data) # obtain same as `data`
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
inver_new_data = scaler.inverse_transform(new_data)
I want inver_new_data will be a dataframe with its columns in the same range of values of data columns, for instance, column A between 0.5 and 2, and so on. However I get for column A values between 8 and 17.
Any ideas?
MinMaxScaler applies to each column the following transformation:
Subtract column minimum;
Divide by column range (i.e. column max - column min).
The inverse transform applies the "inverse" operation in "inverse" order:
Multiply by column range before the transformation;
Add the column min.
Therefore for column A is doing
(df['A'] - df['A'].min())/(df['A'].max() - df['A'].min())
in particular the scaler stores the min 0.5 and the range 1.5
When you apply the inverse_transform to [8, 11, 5] this becomes:
[8*1.5 + 0.5, 11*1.5 + 0.5, 5*1.5 + 0.5]=[12.5, 18, 8]
Now, this is not suggested in general to do any machine learning, however to transform the ranges of the new column to the previous one, you can do something like the following:
data = pd.DataFrame(np.array([[2,3],[1.02,1.2],[0.5,0.3]]),columns=cols)
# Create a Scaler for the initial data
scaler_data = MinMaxScaler()
# Fit the scaler with these data, but there is no need to transform them.
scaler_data.fit(data)
#Create new data
new_data = pd.DataFrame(np.array([[8,20],[11,2],[5,3]]),columns=cols)
# Create a Scaler for the new data
scaler_new_data = MinMaxScaler()
# Trasform new data in the [0-1] range
scaled_new_data = scaler_new_data.fit_transform(new_data)
# Inverse transform new data from [0-1] to [min, max] of data
inver_new_data = scaler_data.inverse_transform(scaled_new_data)
For example this will always map the min and max of new dataframe columns to the min and max of initial dataframe columns respectively.
To explain you what is MinMaxScaler doing:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
So basically every feature of your datawill be between 0 and 1.
The moment you run: fit_transform(data), is trained.
For transformation you have:
X_scaled = scale * X + min - X.min(axis=0) * scale
where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))
the scale was trained from the fitting method.
So if you run inverse_transofmr(new_data) this does not help you at all.
Also inver_new_data= scaler.transform(new_data) will not help you.
You need to precise what the same range means for you? The approach with MinMaxScalerwill not help you right now. You could only limit the columns to the min and max of the original dataframe. So for example:
dataA = new_data[['A']]
scalerA = MinMaxScaler(data['A'].min(), data['A'].max())
inver_new_data_A = scaler.fit_transform(dataA)
but this is also not th exact range, minmaxalso respects the distances between the points.

I have converted a continuous feature to categorical. I am getting NaN in Pandas

I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut

Handling missing (nan) values on sklearn.preprocessing

I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Categories

Resources