ValueError using dask QuantileTransformer : unknown shape (1, nan) - python

I'm trying to use the QuantileTransformer from das-ml
For that, I have the following DF:
When I try:
from dask_ml.preprocessing import StandardScaler,QuantileTransformer,MinMaxScaler
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']])
I get this error:
ValueError: Tried to concatenate arrays with unknown shape (1, nan).
To force concatenation pass allow_unknown_chunksizes=True.
And I don't find where to set the parameter: allow_unknown_chunksizes=True
since in the transformer raises and error.
The first error disappears if I compute the df beforehand:
scaler = QuantileTransformer()
scaler.fit_transform(df[['LotFrontage','LotArea']].compute())
But I don't why is this necessary or even if it is the right thing to do.
Also, in contrast to the StandardScaler this returns an array instead of a dataframe.

This was a limitation of the previous Dask-ML implementation. It's fixed in https://github.com/dask/dask-ml/pull/533.

Related

sktime ARIMA invalid frequency

I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date']).set_index('date').T.iloc[0]
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):

Trying to Predict Missing Values

I am trying to impute missing values, not with zeros or means, but with ML predicted results. I'm testing my idea on the standard 'Titanic' dataset, which has around 80% of the age records filled in, but around 20% are missing. How can I fill in missing values with the predicted results from a simple linear regression model? Here is the code that I am testing.
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
Everything works up to this point, but when I try to 'fillna', I get errors.
data_with_null.age.fillna(age,inplace=True)
The line of code directly above shows this error:
TypeError: "value" parameter must be a scalar, dict or Series, but you passed a "DataFrame"
data_with_null['age'].fillna(data_with_null[data_with_null['age'].isnull()].apply(axis=1),inplace=True)
Similarly, the line of code above shows this error:
TypeError: apply() missing 1 required positional argument: 'func'

Handling missing (nan) values on sklearn.preprocessing

I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:"
Following is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
dataset = pd.read_csv("titanic_train.csv")
dataset.drop('Cabin',axis = 1,inplace = True)
x = dataset.drop('Survived',axis = 1)
y = dataset['Survived']
imputer = Imputer(missing_values ="nan",strategy = "mean",axis = 1)
imputer=imputer.fit(x['Age'])
x['Age']=imputer.transform(x['Age'])
The Imputer is expecting a 2-dimensional array as input, even if one of those dimensions is of length 1. This can be achieved using np.reshape:
imputer = Imputer(missing_values='NaN', strategy='mean')
imputer.fit(x['Age'].values.reshape(-1, 1))
x['Age'] = imputer.transform(x['Age'].values.reshape(-1, 1))
That said, if you are not doing anything more complicated than filling in missing values with the mean, you might find it easier to skip the Imputer altogether and just use Pandas fillna instead:
x['Age'].fillna(x['Age'].mean(), inplace=True)
Although #thesilkworkm beat me in the curb, it may be useful to know why exactly your own code doesn't work.
So, apart from the reshape issue, there are two more mistakes in your code; the first is that you erroneously ask for axis=1 in your imputer, while you should ask for axis=0 (which is the default value, and that's why it works when omitted completely, as in #thesilkworkm'a answer); from the docs:
axis : integer, optional (default=0)
The axis along which to impute.
If axis=0, then impute along columns.
If axis=1, then impute along rows.
The second mistake is your missing_values argument, which should be 'NaN', and not 'nan'; from the docs again:
missing_values : integer or “NaN”, optional (default=”NaN”)
The placeholder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,
use the string value “NaN”.
So, just for offering an alternative but equivalent solution (beyond the one already provided by #thesilkworm), you can also fit & transform in one line:
imp = Imputer(missing_values ="NaN",strategy = "mean",axis = 0)
x['Age'] = imp.fit_transform(x['Age'].reshape(-1,1))
When you are fit tranforming it use reshape(-1,1). Because method is expecting a 2D array as input but you are giving 1D array.
Ex: x['Age']=imputer.transform(x['Age'].reshape(-1,1))

Error in acorr_ljungbox from statsmodel

So I am trying to do a box-ljung test on a resudual, but I am getting a strange error and am not able to figure out why.
x = diag.acorr_ljungbox(np.random.random(20))
I tried doing the same with a random array also, still the same error:
ValueError: operands could not be broadcast together with shapes (19,) (40,)
This looks like a bug in the default lag setting, which is set to 40 independent of the length of the data.
As a workaround and to get a proper statistic, the lags needs to be restricted, e.g. using 5 lags below.
>>> from statsmodels.stats import diagnostic as diag
>>> diag.acorr_ljungbox(np.random.random(50))[0].shape
(40,)
>>> diag.acorr_ljungbox(np.random.random(20), lags=5)
(array([ 0.36718151, 1.02009595, 1.23734092, 3.75338034, 4.35387236]),
array([ 0.54454461, 0.60046677, 0.74406305, 0.44040973, 0.49966951]))

Categories

Resources