Imputer on some columns in a Dataframe - python

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])

The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)

Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))

When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Related

How can I replace missing boolean values using python?

In my dataset, one of the columns is a boolean value, and there are missing values within the dataset and within other continuous variable columns which are successfully replaced with their mean. But the mean value can not be replaced for missing boolean. So how can I replace those values?
Note that the boolean is 1 or 0 in my dataset.
Below is the code for replacing continuous missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x)
x = imputer.transform(x)
Thank You
there are several methods to attack this issue.
if you can afford it (if you have enough data) exclude those lines
replace those lines with the majority value (same as replacing with mean of continuous value)
for time series - replace the cell with mean of x cells before and after and set a threshold which above it - the mean will become 0, else , the mean will become 0
You can treat this boolean variable as a categorical feature and then use a SimpleImputer with the most_frequent strategy instead of mean.
You can do as follow:
from sklearn.impute import SimpleImputer
import numpy as np
#Create sample data with nans
X = np.random.randint(2, size=100).reshape(1,-1).astype(float)
X[0,::4] = np.nan
SimpleImputer(strategy="most_frequent").fit_transform(X)

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

sklearn: Need to reshape array but don't know where

I've tried almost everything, I know there is a way or something that I'm missing, I'm really noob in ML but I would really appreciate any help or explanations.
df["Date"] and df["Open"] are arrays like: [1,2, ..., 10]
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('AAPL.csv')
clf = LinearRegression()
i = 0
for date in df["Date"]:
s = date
s = s.replace("-","")
df["Date"][i] = s
i += 1
clf.fit(df["Date"],df["Open"])
print("Prediction:", clf.predict(df["Date"][-1]))
Here is the error that Python throws me:
ValueError: Expected 2D array, got 1D array instead:
array=[19801212. 19801215. 19801216. ... 20191127. 20191129. 20191202.].
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample. line 16
After some trys, errors and googling i figured out how to reshape df["Date] by doing this:
clf.fit(np.array(df["Date"]).reshape(-1,1),df["Open"])
But now throws me this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I really appreciate any help, thanks in advance.
For reshaping:
clf.fit(df["Date"].values.reshape(-1,1),df["Open"].values.reshape(-1,1))
But not sure you have correct datetime type column for df["Date"] since pandas could read it as a string. You could do:
df["Date"] = pd.to_numeric(pd.to_datetime(df["Date"]))
for type conversion (integer at last). Lastly, if you have nan rows, you could eliminate them with:
df = df.dropna(how='any',axis=0, subset=['Date','Open'])
Hope this works.

Handling missing (nan) values on sklearn.preprocessing

I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

How to use Imputer in a DataFrameMapper on a dataframe?

I want to use DataFrameMapper Imputer+Scaler mapping on all float64 columns of a dataframe. My code works with the StandardScaler but when I add the Imputer the mapper returns just one row with all zeros.
I saw this question
Imputer on some Dataframe columns in Python and the tutorial https://github.com/paulgb/sklearn-pandas And there is a warning:
site-packages\sklearn\utils\validation.py:386: DeprecationWarning:
Passing 1d arrays as data is deprecated in 0.17 and willraise
ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if
your data has a single feature or X.reshape(1, -1) if it contains a
single sample.
So I understand that there is a shape mismatch. How should the below example dataframe be reshaped?
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler, Imputer
# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print "Starting with a random dataframe of 6 rows and 4 columns of floats:"
print df.shape
print df
mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
result = mapper.fit_transform(df)
print "I get an unexpected result of all zeroes in just one row."
print result.shape
print result
print "Expected is a dataframe of 2 columns and 6 rows of scaled floats."
print "something like this:"
mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])]
mapper = DataFrameMapper(mapping)
result_scaler = mapper.fit_transform(df)
print result_scaler.shape
print result_scaler
This the output
Starting with a random dataframe of 6 rows and 4 columns of floats.
(6, 4)
A B C D
2013-01-01 -0.070551 0.039074 0.513491 -0.830585
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518
2013-01-03 -1.264926 -0.830575 0.461515 0.427228
2013-01-04 -0.374400 0.619986 0.318128 0.361712
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435
2013-01-06 1.436073 0.312183 1.566990 -0.272224
Unexpected result is all zeroes in just one row.
(1L, 12L)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Expected is a dataframe of 2 columns and 6 rows of scaled floats.
something like this
(6L, 2L)
[[ 0.08306789 -0.21892275]
[-0.21975387 1.61986719]
[-1.40829622 -0.27069922]
[-0.29633508 -0.4135387 ]
[-0.12300572 -1.54725542]
[ 1.964323 0.83054889]]
And a followup question - my original dataframe is a combination of floats, booleans and objects (labels). So when I have a list of
floats = list(df.select_dtypes(include=['float64']).columns)
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats]
how could I prepare the dataframe (shape it for Imputer) just for those columns?
As of now (version 1.1.0) there are much easier ways how to do this without necessity of creation additional wrapper class.
The first is a specification of the column selector which defines the shape of the array that is passed to the transformer:
simple string (like 'A') - a one dimensional array will be passed
list with one element (like ['A']) - a two dimensional array with one column will be passed
So, in your case, it should be enough to change the mapping definition (note brackets around column names):
mapping=[(['A'], [Imputer(), StandardScaler()]), (['C'], [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
Another option, in case you want to use the same transformation for all selected columns, is the use of gen_features function. You could do the following:
from sklearn_pandas import DataFrameMapper, gen_features
feature_def = gen_features(columns=[['A'], ['C']], classes=[Imputer, StandardScaler])
mapper = DataFrameMapper(feature_def)
This answers also your second question. Just select your columns, use the right column selector type and combine it with gen_features.
float_cols = list(df.select_dtypes(include=['float64']).columns)
# Use brackets for every column for 2D input shape
float_cols_2d = [[f] for f in float_cols]
The last "trick", if you prefer DataFrame output instead of numpy array, you can us df_out=True option for DataFrameMapper. The final example could look like following (note that I replaced Imputer with current SimpleImputer):
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
float_cols = list(df.select_dtypes(include=['float64']).columns)
float_cols_2d = [[f] for f in float_cols]
feature_def = gen_features(columns=float_cols_2d, classes=[SimpleImputer, StandardScaler])
mapper = DataFrameMapper(feature_def, df_out=True)
result = mapper.fit_transform(df)
The standard Imputer doesn't work with the DataFrameMapper because the orientation of the input/output in the DataFrameMapper is the transpose of what is expected. Creating a wrapper class around Imputer should solve the problem:
from sklearn.preprocessing import Imputer
class SeriesImputer(Imputer):
def fit(self, X, y=None):
return super(SeriesImputer, self).fit(X.reshape(-1, 1), y=y)
def transform(self, X):
return super(SeriesImputer, self).transform(X.reshape(-1, 1))
Then simply replace the occurrences of Imputer with SeriesImputer in the DataFrameMapper.

Categories

Resources