I want to use DataFrameMapper Imputer+Scaler mapping on all float64 columns of a dataframe. My code works with the StandardScaler but when I add the Imputer the mapper returns just one row with all zeros.
I saw this question
Imputer on some Dataframe columns in Python and the tutorial https://github.com/paulgb/sklearn-pandas And there is a warning:
site-packages\sklearn\utils\validation.py:386: DeprecationWarning:
Passing 1d arrays as data is deprecated in 0.17 and willraise
ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if
your data has a single feature or X.reshape(1, -1) if it contains a
single sample.
So I understand that there is a shape mismatch. How should the below example dataframe be reshaped?
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler, Imputer
# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print "Starting with a random dataframe of 6 rows and 4 columns of floats:"
print df.shape
print df
mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
result = mapper.fit_transform(df)
print "I get an unexpected result of all zeroes in just one row."
print result.shape
print result
print "Expected is a dataframe of 2 columns and 6 rows of scaled floats."
print "something like this:"
mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])]
mapper = DataFrameMapper(mapping)
result_scaler = mapper.fit_transform(df)
print result_scaler.shape
print result_scaler
This the output
Starting with a random dataframe of 6 rows and 4 columns of floats.
(6, 4)
A B C D
2013-01-01 -0.070551 0.039074 0.513491 -0.830585
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518
2013-01-03 -1.264926 -0.830575 0.461515 0.427228
2013-01-04 -0.374400 0.619986 0.318128 0.361712
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435
2013-01-06 1.436073 0.312183 1.566990 -0.272224
Unexpected result is all zeroes in just one row.
(1L, 12L)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Expected is a dataframe of 2 columns and 6 rows of scaled floats.
something like this
(6L, 2L)
[[ 0.08306789 -0.21892275]
[-0.21975387 1.61986719]
[-1.40829622 -0.27069922]
[-0.29633508 -0.4135387 ]
[-0.12300572 -1.54725542]
[ 1.964323 0.83054889]]
And a followup question - my original dataframe is a combination of floats, booleans and objects (labels). So when I have a list of
floats = list(df.select_dtypes(include=['float64']).columns)
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats]
how could I prepare the dataframe (shape it for Imputer) just for those columns?
As of now (version 1.1.0) there are much easier ways how to do this without necessity of creation additional wrapper class.
The first is a specification of the column selector which defines the shape of the array that is passed to the transformer:
simple string (like 'A') - a one dimensional array will be passed
list with one element (like ['A']) - a two dimensional array with one column will be passed
So, in your case, it should be enough to change the mapping definition (note brackets around column names):
mapping=[(['A'], [Imputer(), StandardScaler()]), (['C'], [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
Another option, in case you want to use the same transformation for all selected columns, is the use of gen_features function. You could do the following:
from sklearn_pandas import DataFrameMapper, gen_features
feature_def = gen_features(columns=[['A'], ['C']], classes=[Imputer, StandardScaler])
mapper = DataFrameMapper(feature_def)
This answers also your second question. Just select your columns, use the right column selector type and combine it with gen_features.
float_cols = list(df.select_dtypes(include=['float64']).columns)
# Use brackets for every column for 2D input shape
float_cols_2d = [[f] for f in float_cols]
The last "trick", if you prefer DataFrame output instead of numpy array, you can us df_out=True option for DataFrameMapper. The final example could look like following (note that I replaced Imputer with current SimpleImputer):
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
float_cols = list(df.select_dtypes(include=['float64']).columns)
float_cols_2d = [[f] for f in float_cols]
feature_def = gen_features(columns=float_cols_2d, classes=[SimpleImputer, StandardScaler])
mapper = DataFrameMapper(feature_def, df_out=True)
result = mapper.fit_transform(df)
The standard Imputer doesn't work with the DataFrameMapper because the orientation of the input/output in the DataFrameMapper is the transpose of what is expected. Creating a wrapper class around Imputer should solve the problem:
from sklearn.preprocessing import Imputer
class SeriesImputer(Imputer):
def fit(self, X, y=None):
return super(SeriesImputer, self).fit(X.reshape(-1, 1), y=y)
def transform(self, X):
return super(SeriesImputer, self).transform(X.reshape(-1, 1))
Then simply replace the occurrences of Imputer with SeriesImputer in the DataFrameMapper.
Related
I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()
I need help understanding this line of code:
y_train2 = train_target2.astype('category').cat.codes
Am I right in saying that y_train2 is being changed to a categorical variable using astype(category) and then cat.codes is used to change it into integers?
Below is the full block of code.
# Train data pre-processing
train_target2 = df_train_01['class_2']
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
# convert text labels to integers
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes
# Test data pre-processing
test_target2 = df_test_01['class_2']
test_target5 = df_test_01['class_5']
# drop 'class_2' and 'class_5' columns
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes
I think your understanding on the dataframe function and attribute is correct; pdf.astype('category') is turning values into categorical data and pdf.Categorical.codes() (or pdf.Series.codes() ) is an attribute that converts the values into a set of integers that start with 0.
Try to type some simple snippet below to see how they work.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
pdf = pd.DataFrame(iris.data, columns=['s-length', 's-width', 'p-length', 'p-width'])
print(
iris['s-length'].astype('category'),
len(np.unique(iris['s-length'])), # -> 35
len( set(iris['s-length'].astype('category').cat.codes ), # -> 35
np.unique(iris['s-length'].astype('category').cat.codes)), # -> array([ 0, 1,...34]), dtype=int8)
)
In essence, a pandas categorical data type is a mapping between values that do not have a numeric interpretation and a unique number for each value.
Let's break down your code:
# Take the series `train_target2` and convert it to categorical type
train_target2.astype('category')
# Access the attributes or methods of a categorical series
train_target2.astype('category').cat
# Take the `codes` attribute
train_target2.astype('category').cat.codes
In reality, .codes is not converting the data into numbers. Rather, you are only taking the numeric equivalent of each category. Strictly speaking, .astype('category') is the part that converted your data to categorical.
You can find the attributes and methods of this data type here.
I've tried almost everything, I know there is a way or something that I'm missing, I'm really noob in ML but I would really appreciate any help or explanations.
df["Date"] and df["Open"] are arrays like: [1,2, ..., 10]
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('AAPL.csv')
clf = LinearRegression()
i = 0
for date in df["Date"]:
s = date
s = s.replace("-","")
df["Date"][i] = s
i += 1
clf.fit(df["Date"],df["Open"])
print("Prediction:", clf.predict(df["Date"][-1]))
Here is the error that Python throws me:
ValueError: Expected 2D array, got 1D array instead:
array=[19801212. 19801215. 19801216. ... 20191127. 20191129. 20191202.].
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample. line 16
After some trys, errors and googling i figured out how to reshape df["Date] by doing this:
clf.fit(np.array(df["Date"]).reshape(-1,1),df["Open"])
But now throws me this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I really appreciate any help, thanks in advance.
For reshaping:
clf.fit(df["Date"].values.reshape(-1,1),df["Open"].values.reshape(-1,1))
But not sure you have correct datetime type column for df["Date"] since pandas could read it as a string. You could do:
df["Date"] = pd.to_numeric(pd.to_datetime(df["Date"]))
for type conversion (integer at last). Lastly, if you have nan rows, you could eliminate them with:
df = df.dropna(how='any',axis=0, subset=['Date','Open'])
Hope this works.
I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))
Why would numpy.corrcoef() return NaN values?
I am working with high dimensional data and it is infeasible to go through every datum to test values.
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Standardise
X_std = StandardScaler().fit_transform(df.values)
print(X_std.dtype) # Returns "float64"
# Correlation
cor_mat1 = np.corrcoef(X_std.T)
cor_mat1.max() # Returns nan
And then
cor_mat1.max()
Returns
nan
When computing cor_mat1 = np.corrcoef(X_std.T) I get this warning:
/Users/kimrants/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3183:
RuntimeWarning:
invalid value encountered in true_divide
This is a snippet of the input dataframe:
To try and fix it myself, I started removing all zero-columns and columns that contained any NaN values. I thought this would solve the problem, but it didn't. Am I missing something? I don't see why else it would return NaN values?
My end goal is to compute eigen-values and -vectors.
If you have a column where all rows have the same value, that column's variance is 0. np.corrcoef() thus numpy-divides that column's correlation coefficients by 0, which doesn't throw an error but only the warning invalid value encountered in true_divide with standard numpy settings. Those column's correlation coefficients get replaced by 'nan' instead:
import numpy as np
print(np.divide(0,0))
C:\Users\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in true_divide
"""Entry point for launching an IPython kernel.
nan
Removing all columns with Series.nunique() == 1 should solve your problem.
This solved the problem for reasons I cannot explain:
# Import
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
# Keep track of index / columns to reproduce dataframe
cols = df.columns
index = df.index
# Standardise
X_std = StandardScaler().fit_transform(df.values)
X_std = StandardScaler().fit_transform(X_std)
print(X_std.dtype) # Return "float64"
# Turn to dataFrame again to drop values easier
df = pd.DataFrame(data=X_std, columns= cols, index=index)
# Delete all zero columns
df = df.loc[:, (df != 0).any(axis=0)]
# Delete any NaN columns
df = df.dropna(axis='columns', how='any', inplace=False)
The standardising two times in a row works, but is weird.