Mixed data Imputation with MissForest in Python

Mixed data Imputation with MissForest in Python - python

I am trying to use MissForest imputation in the missingpy package but am getting different errors. My dataset contains 14 categorical variables and about 70 numerical variables. I know it is possible to impute categorical variables this way, but is it possible to do both numeric and categorical in one go? I keep getting errors when I try to implement it.
I've done the following:
cat = df[df.select_dtypes('object').columns]
cat_ind = [df.columns.get_loc(c) for c in cat]
cat_ind = [x - 1 for x in cat_ind] # get indices for categorical variables
from sklearn import preprocessing
cat = df[df.select_dtypes('object').columns.values]
le = preprocessing.LabelEncoder()
for column in cat: df[column] = le.fit_transform(cat[column])
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base from missingpy
import MissForest
imputer = MissForest(n_estimators=10,max_iter=5) #missforest
X_imputed = imputer.fit_transform(df, cat_vars=cat_ind)
I then get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-162-612c7ea4a463> in <module>
9 imputer = MissForest(n_estimators=10,max_iter=5) #miss forest
---> 10 X_imputed = imputer.fit_transform(X, cat_vars=cat_ind)
5 frames
/usr/local/lib/python3.8/dist-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
195 "multilabel-sequences",
196 ]:
--> 197 raise ValueError("Unknown label type: %r" % y_type)
198
199
ValueError: Unknown label type: 'continuous'

Related

RandomOverSampler doesn't seem to accept log transform as my y target variable

I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?
This is a sample of my y_train values, which are log transformed.
3.688879
3.828641
3.401197
3.091042
4.624973
from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
1 from imblearn.over_sampling import RandomOverSampler
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
73 The corresponding label of `X_resampled`.
74 """
---> 75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'

Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np
# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)
# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)
# Now resample
X_res, y_res = rs.resample(
sampler_obj=RandomOverSampler(random_state=27),
trainX=df,
trainY=y_classes
)
The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.

While using AutoML from EvalML getting error AttributeError: 'DataTable' object has no attribute 'to_series'

I am running eval.automl on a data, made a class column as below:
df.loc[(df.quality<6), 'flag_class'] = 1
df.loc[(df.quality==6), 'flag_class'] = 2
df.loc[(df.quality>6), 'flag_class'] = 3
then splitting it as below:
X = df[['several columns inside']].copy()
y = df[['flag_class']].copy()
but when running below code getting error:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type = 'multiclass')
error:
AttributeError Traceback (most recent call last)
<ipython-input-37-dffcb1214932> in <module>
----> 1 X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type = 'multiclass')
~\AppData\Roaming\Python\Python38\site-packages\evalml\preprocessing\utils.py in split_data(X, y,
problem_type, problem_configuration, test_size, random_seed)
75 data_splitter = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=random_seed)
76
---> 77 train, test = next(data_splitter.split(X.to_dataframe(), y.to_series()))
78
79 X_train = X.iloc[train]
AttributeError: 'DataTable' object has no attribute 'to_series'
Any support will be highly appreciated, Thanks in advance

DataTable is a class from the woodwork framework. The framework is build on top of pandas Dataframe.

enter image description heredumie =y
type(dumie)
y_train=dumie.squeeze()
type(y_train)
Explanation: For dependent feature(y) should be series format DataTable can't convert it so First convert Y in to series by using squeeze
X should be in DataFrame format and Y should be series format

ValueError: Input contains NaN in python

My Code
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
mnb=MultinomialNB()
svm=SGDClassifier(max_iter=1000, tol=0.2)
mnb_bow_predictions=train_predict_evaluate_model(classifier=mnb,
train_features=bow_train_features,
train_labels=train_labels,
test_features=bow_test_features,
test_labels=test_labels)
and raise the error
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
58 elif X.dtype == np.dtype('object') and not allow_nan:
59 if _object_dtype_isnan(X).any():
---> 60 raise ValueError("Input contains NaN")
61
62
ValueError: Input contains NaN\
whats make my program raise this error? error in dataset or in function?

All feature and label values must be finite. If bow_train_features, train_labels, bow_train_features, train_labels are DataFrames or Numpy arrays, you can filter for only the fully-finite observations in the train/test sets using the code below:
# Create finite observation filters for train/test sets
train_finite_filter = np.isfinite(bow_train_features) & np.isfinite(train_labels)
test_finite_filter = np.isfinite(bow_test_features) & np.isfinite(test_labels)
# Filter for finite training observations
bow_train_features_finite = bow_train_features[train_finite_filter]
train_labels_finite = train_labels[train_finite_filter]
# Filter for finite test observations
bow_test_features_finite = bow_test_features[test_finite_filter]
test_labels_finite = test_labels[test_finite_filter]

Python regressors library summary function returns ValueError for Logistic regression

I'm using python inbulit boston dataset from sklearn with CHAS as my target variable.
I built Logistic Regression model from sklearn pkg.I'm using regressors library to get the summary statistics of the model output but i'm facing the following error. pleasee help me on this and kindly let me know if you need further information
find more about regressors library in below link: [1]:
https://regressors.readthedocs.io/en/latest/usage.html
Please find the below python code which i used for model building:
import numpy as np
from sklearn import datasets
import pandas as pd
bostonn = datasets.load_boston()
boston = pd.DataFrame(bostonn.data , columns= bostonn['feature_names'])
print(boston.head())
X = boston.drop('CHAS' , axis =1)
y = boston.CHAS.astype('category')
from sklearn.linear_model import LogisticRegression
from regressors import stats
log_mod=LogisticRegression(random_state=123)
model=log_mod.fit(X,y)
stats.summary(model, X, y , xlabels=None)
I'm getting the following error:
ValueErrorTraceback (most recent call last)
in ()
1 #xlabels = boston.feature_names[which_betas]
----> 2 stats.summary(model, X, y ,xlabels=None)
251 )
252 coef_df['Estimate'] = np.concatenate(
--> 253 (np.round(np.array([clf.intercept_]), 6), np.round((clf.coef_), 6)))
254 coef_df['Std. Error'] = np.round(coef_se(clf, X, y), 6)
255 coef_df['t value'] = np.round(coef_tval(clf, X, y), 4)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
ValueError: all the input array dimensions except for the concatenation axis must match exactly
There are other posts which has the similar error but those solution didn't help
my problem.The attached above link has the information about how the summary function actually works.kindly let me know if you need further information.

Inverse Transform Predicted Results

I have a training data CSV with three columns (two for data and a third for targets) and I successfully predicted the target column for my test CSV. The problem is I need to inverse transform the results back to strings for further analysis. Below is the code and error.
from sklearn import datasets
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
df_train = pd.read_csv('/Users/justinchristensen/Documents/Python_Education/SKLearn/Path_Training_Data.csv')
df_test = pd.read_csv('/Users/justinchristensen/Documents/Python_Education/SKLearn/Path_Test_Data.csv')
#Separate columns in training data set
x_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:,-1:]
#Separate columns in test data set
x_test = df_test.iloc[:,:-1]
#Initiate classifier
clf = svm.SVC(gamma=0.001, C=100)
le = LabelEncoder()
#Transform strings into integers
x_train_encoded = x_train.apply(LabelEncoder().fit_transform)
y_train_encoded = y_train.apply(LabelEncoder().fit_transform)
x_test_encoded = x_test.apply(LabelEncoder().fit_transform)
#Fit the model into the classifier
clf.fit(x_train_encoded,y_train_encoded)
#Predict test values
y_pred = clf.predict(x_test_encoded)
The error
NotFittedError
Traceback (most recent call last)
<ipython-input-38-09840b0071d5> in <module>()
1
----> 2 y_pred_inverse = le.inverse_transform(y_pred)
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in inverse_transform(self, y)
146 y : numpy array of shape [n_samples]
147 """
--> 148 check_is_fitted(self, 'classes_')
149
150 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

You need to use the same label object which you used for transforming your targets to get them back. Each time you use the Label Enocder you instantiated a new object. Use the same object.
Change the following line
y_train_encoded = y_train.apply(le().fit_transform)
y_test_encoded = y_test.apply(le().fit_transform)
Then use the same object to reverse the transformation. You can check the first example here in the documentation for reference as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mixed data Imputation with MissForest in Python - python

Related

RandomOverSampler doesn't seem to accept log transform as my y target variable

While using AutoML from EvalML getting error AttributeError: 'DataTable' object has no attribute 'to_series'

ValueError: Input contains NaN in python

Python regressors library summary function returns ValueError for Logistic regression

Inverse Transform Predicted Results

Categories

Resources