depreciation warning when I use sklearn imputer - python

My code is quite simple, but it always pops out the warning like this:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will
raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if
your data has a single feature or X.reshape(1, -1) if it contains a single sample.
(DeprecationWarning)
I don't know why it does not work even if I add s.reshape(-1,1) in the parentheses of fit_transforms.
The code is following:
import pandas as pd
s = pd.Series([1,2,3,np.nan,5,np.nan,7,8])
imp = Imputer(missing_values='NAN', strategy='mean', axis=0)
x = pd.Series(imp.fit_transform(s).tolist()[0])
x

Related

Why am i getting index error on this one hot encoding?

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?
It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.

ValueError using dask QuantileTransformer : unknown shape (1, nan)

I'm trying to use the QuantileTransformer from das-ml
For that, I have the following DF:
When I try:
from dask_ml.preprocessing import StandardScaler,QuantileTransformer,MinMaxScaler
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']])
I get this error:
ValueError: Tried to concatenate arrays with unknown shape (1, nan).
To force concatenation pass allow_unknown_chunksizes=True.
And I don't find where to set the parameter: allow_unknown_chunksizes=True
since in the transformer raises and error.
The first error disappears if I compute the df beforehand:
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']].compute())
But I don't why is this necessary or even if it is the right thing to do.
Also, in contrast to the StandardScaler this returns an array instead of a dataframe.
This was a limitation of the previous Dask-ML implementation. It's fixed in https://github.com/dask/dask-ml/pull/533.

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Reshaping data for Linear regression [duplicate]

On a fresh installation of Anaconda under Ubuntu... I am preprocessing my data in various ways prior to a classification task using Scikit-Learn.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
This all works fine but if I have a new sample (temp below) that I want to classify (and thus I want to preprocess in the same way then I get
temp = [1,2,3,4,5,5,6,....................,7]
temp = scaler.transform(temp)
Then I get a deprecation warning...
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
So the question is how should I be rescaling a single sample like this?
I suppose an alternative (not very good one) would be...
temp = [temp, temp]
temp = scaler.transform(temp)
temp = temp[0]
But I'm sure there are better ways.
Just listen to what the warning is telling you:
Reshape your data either X.reshape(-1, 1) if your data has a single feature/column
and X.reshape(1, -1) if it contains a single sample.
For your example type(if you have more than one feature/column):
temp = temp.reshape(1,-1)
For one feature/column:
temp = temp.reshape(-1,1)
Well, it actually looks like the warning is telling you what to do.
As part of sklearn.pipeline stages' uniform interfaces, as a rule of thumb:
when you see X, it should be an np.array with two dimensions
when you see y, it should be an np.array with a single dimension.
Here, therefore, you should consider the following:
temp = [1,2,3,4,5,5,6,....................,7]
# This makes it into a 2d array
temp = np.array(temp).reshape((len(temp), 1))
temp = scaler.transform(temp)
This might help
temp = ([[1,2,3,4,5,6,.....,7]])
.values.reshape(-1,1) will be accepted without alerts/warnings
.reshape(-1,1) will be accepted, but with deprecation war
I faced the same issue and got the same deprecation warning. I was using a numpy array of [23, 276] when I got the message. I tried reshaping it as per the warning and end up in nowhere. Then I select each row from the numpy array (as I was iterating over it anyway) and assigned it to a list variable. It worked then without any warning.
array = []
array.append(temp[0])
Then you can use the python list object (here 'array') as an input to sk-learn functions. Not the most efficient solution, but worked for me.
You can always, reshape like:
temp = [1,2,3,4,5,5,6,7]
temp = temp.reshape(len(temp), 1)
Because, the major issue is when your, temp.shape is:
(8,)
and you need
(8,1)
-1 is the unknown dimension of the array. Read more about "newshape" parameters on numpy.reshape documentation -
# X is a 1-d ndarray
# If we want a COLUMN vector (many/one/unknown samples, 1 feature)
X = X.reshape(-1, 1)
# you want a ROW vector (one sample, many features/one/unknown)
X = X.reshape(1, -1)
from sklearn.linear_model import LinearRegression
X = df[['x_1']]
X_n = X.values.reshape(-1, 1)
y = df['target']
y_n = y.values
model = LinearRegression()
model.fit(X_n, y)
y_pred = pd.Series(model.predict(X_n), index=X.index)

Error with NaN using sklearn in python

My Code:
### Working with NaN using sklearn
import numpy as np
from sklearn.preprocessing import Imputer
### Mean strategy
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
imp.fit([1,5,9,np.NaN])
X = [1,5,9,np.NaN]
y = imp.transform(X)
print (y)
After running I am getting below warning message:
C:\Users\Admin\Anaconda3\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
How to solve it? I tried the reshape but it is giving error message saying:
'list' object has no attribute 'reshape'
Please help.
so i ran your code and changed X do a 2d list... Turns out that because you were passing a 1D array to transform so it was throwing you the error... So i made it a 2D lisst
import numpy as np
from sklearn.preprocessing import Imputer
### Mean strategy
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
imp.fit([1,5,9,np.NaN])
X = [[1,5,9,np.NaN]] # < =========== The change that I made
y = imp.transform(X)
print(y)
enter code here

Categories

Resources