sklearn: Need to reshape array but don't know where - python

I've tried almost everything, I know there is a way or something that I'm missing, I'm really noob in ML but I would really appreciate any help or explanations.
df["Date"] and df["Open"] are arrays like: [1,2, ..., 10]
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('AAPL.csv')
clf = LinearRegression()
i = 0
for date in df["Date"]:
s = date
s = s.replace("-","")
df["Date"][i] = s
i += 1
clf.fit(df["Date"],df["Open"])
print("Prediction:", clf.predict(df["Date"][-1]))
Here is the error that Python throws me:
ValueError: Expected 2D array, got 1D array instead:
array=[19801212. 19801215. 19801216. ... 20191127. 20191129. 20191202.].
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample. line 16
After some trys, errors and googling i figured out how to reshape df["Date] by doing this:
clf.fit(np.array(df["Date"]).reshape(-1,1),df["Open"])
But now throws me this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I really appreciate any help, thanks in advance.

For reshaping:
clf.fit(df["Date"].values.reshape(-1,1),df["Open"].values.reshape(-1,1))
But not sure you have correct datetime type column for df["Date"] since pandas could read it as a string. You could do:
df["Date"] = pd.to_numeric(pd.to_datetime(df["Date"]))
for type conversion (integer at last). Lastly, if you have nan rows, you could eliminate them with:
df = df.dropna(how='any',axis=0, subset=['Date','Open'])
Hope this works.

Related

sktime ARIMA invalid frequency

I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date']).set_index('date').T.iloc[0]
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

could not convert categorical data to number OneHotEncoder

I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)
#user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.

ufunc 'add' did not contain a loop with signature matching types dtype('<U23') dtype('<U23') dtype('<U23')

When trying to convert the sklearn dataset into pandas dataframe by the following code I am getting this error "ufunc 'add' did not contain a loop with signature matching types dtype('
import numpy as np
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'])
Here is how I converted the sklearn dataset to a pandas dataframe. The target column name needs to be appended.
bostonData = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
columns= np.append(boston['feature_names'],['target']))
You have numpy array of strings please provide full error therefore we figure out what's missing;
For example I am assuming you got dtype('U9'), please add;
dtype=float into your array. Something like not certain;
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'], dtype=float)
Sometimes it's just easier to keep it simple. Create a DF for both data and target, then merge using pandas.
data_df = pd.DataFrame(data=cancer['data'] ,columns=cancer['feature_names'])
target_df = pd.DataFrame(data=cancer['target'], columns=['target']).reset_index(drop=True)
target_df.rename_axis(None)
df = pd.concat([data_df, target_df], axis=1)

How to use Imputer in a DataFrameMapper on a dataframe?

I want to use DataFrameMapper Imputer+Scaler mapping on all float64 columns of a dataframe. My code works with the StandardScaler but when I add the Imputer the mapper returns just one row with all zeros.
I saw this question
Imputer on some Dataframe columns in Python and the tutorial https://github.com/paulgb/sklearn-pandas And there is a warning:
site-packages\sklearn\utils\validation.py:386: DeprecationWarning:
Passing 1d arrays as data is deprecated in 0.17 and willraise
ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if
your data has a single feature or X.reshape(1, -1) if it contains a
single sample.
So I understand that there is a shape mismatch. How should the below example dataframe be reshaped?
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler, Imputer
# just a random dataframe from http://pandas.pydata.org/pandas-docs/stable/10min.html
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print "Starting with a random dataframe of 6 rows and 4 columns of floats:"
print df.shape
print df
mapping=[('A', [Imputer(), StandardScaler()]), ('C', [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
result = mapper.fit_transform(df)
print "I get an unexpected result of all zeroes in just one row."
print result.shape
print result
print "Expected is a dataframe of 2 columns and 6 rows of scaled floats."
print "something like this:"
mapping=[('A', [StandardScaler()]), ('C', [StandardScaler()])]
mapper = DataFrameMapper(mapping)
result_scaler = mapper.fit_transform(df)
print result_scaler.shape
print result_scaler
This the output
Starting with a random dataframe of 6 rows and 4 columns of floats.
(6, 4)
A B C D
2013-01-01 -0.070551 0.039074 0.513491 -0.830585
2013-01-02 -0.313069 -1.028936 2.359338 -0.830518
2013-01-03 -1.264926 -0.830575 0.461515 0.427228
2013-01-04 -0.374400 0.619986 0.318128 0.361712
2013-01-05 -0.235587 -1.647786 -0.819940 -1.036435
2013-01-06 1.436073 0.312183 1.566990 -0.272224
Unexpected result is all zeroes in just one row.
(1L, 12L)
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Expected is a dataframe of 2 columns and 6 rows of scaled floats.
something like this
(6L, 2L)
[[ 0.08306789 -0.21892275]
[-0.21975387 1.61986719]
[-1.40829622 -0.27069922]
[-0.29633508 -0.4135387 ]
[-0.12300572 -1.54725542]
[ 1.964323 0.83054889]]
And a followup question - my original dataframe is a combination of floats, booleans and objects (labels). So when I have a list of
floats = list(df.select_dtypes(include=['float64']).columns)
mapping=[(f, [Imputer(missing_values=0,strategy="mean"), StandardScaler()]) for f in floats]
how could I prepare the dataframe (shape it for Imputer) just for those columns?
As of now (version 1.1.0) there are much easier ways how to do this without necessity of creation additional wrapper class.
The first is a specification of the column selector which defines the shape of the array that is passed to the transformer:
simple string (like 'A') - a one dimensional array will be passed
list with one element (like ['A']) - a two dimensional array with one column will be passed
So, in your case, it should be enough to change the mapping definition (note brackets around column names):
mapping=[(['A'], [Imputer(), StandardScaler()]), (['C'], [Imputer(), StandardScaler()])]
mapper = DataFrameMapper(mapping)
Another option, in case you want to use the same transformation for all selected columns, is the use of gen_features function. You could do the following:
from sklearn_pandas import DataFrameMapper, gen_features
feature_def = gen_features(columns=[['A'], ['C']], classes=[Imputer, StandardScaler])
mapper = DataFrameMapper(feature_def)
This answers also your second question. Just select your columns, use the right column selector type and combine it with gen_features.
float_cols = list(df.select_dtypes(include=['float64']).columns)
# Use brackets for every column for 2D input shape
float_cols_2d = [[f] for f in float_cols]
The last "trick", if you prefer DataFrame output instead of numpy array, you can us df_out=True option for DataFrameMapper. The final example could look like following (note that I replaced Imputer with current SimpleImputer):
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
float_cols = list(df.select_dtypes(include=['float64']).columns)
float_cols_2d = [[f] for f in float_cols]
feature_def = gen_features(columns=float_cols_2d, classes=[SimpleImputer, StandardScaler])
mapper = DataFrameMapper(feature_def, df_out=True)
result = mapper.fit_transform(df)
The standard Imputer doesn't work with the DataFrameMapper because the orientation of the input/output in the DataFrameMapper is the transpose of what is expected. Creating a wrapper class around Imputer should solve the problem:
from sklearn.preprocessing import Imputer
class SeriesImputer(Imputer):
def fit(self, X, y=None):
return super(SeriesImputer, self).fit(X.reshape(-1, 1), y=y)
def transform(self, X):
return super(SeriesImputer, self).transform(X.reshape(-1, 1))
Then simply replace the occurrences of Imputer with SeriesImputer in the DataFrameMapper.

Categories

Resources