Imputation on the test set with fancyimpute

Imputation on the test set with fancyimpute - python

The python package Fancyimpute provides several methods for the imputation of missing values in Python. The documentation provides examples such as:
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Model each feature with missing values as a function of other features, and
# use that estimate for imputation.
X_filled_ii = IterativeImputer().fit_transform(X_incomplete)
This works fine when applying the imputation method to a dataset X. But what if a training/test split is necessary? Once
X_train_filled = IterativeImputer().fit_transform(X_train_incomplete)
is called, how do I impute the test set and create X_test_filled? The test set needs to be imputed using the information from the training set. I guess that IterativeImputer() should returns and object that can fit X_test_incomplete. Is that possible?
Please note that imputing on the whole dataset and then split into training and test set is not correct.

The package looks like it mimic's scikit-learn's API. And after looking in the source code, it looks like it does have a transform method.
my_imputer = IterativeImputer()
X_trained_filled = my_imputer.fit_transform(X_train_incomplete)
# now transform test
X_test_filled = my_imputer.transform(X_test)
The imputer will apply the same imputations that it learned from the training set.

Related

UserWarning: X does not have valid feature names, but Linear Regression was fitted with feature names [duplicate]

I'm getting the following warning after upgrading to version 1.0 of scikit-learn:
UserWarning: X does not have valid feature names, but IsolationForest was
fitted with feature name
I cannot find in the docs on what is a "valid feature name". How do I deal with this warning?

I got the same warning message with another sklearn model. I realized that it was showing up because I fitted the model with a data in a dataframe, and then used only the values to predict. From the moment I fixed that, the warning disappeared.
Here is an example:
model_reg.fit(scaled_x_train, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
This first code had the warning, because scaled_x_train is a DataFrame with feature names, while scaled_x_test.values is only values, without feature names. Then, I changed to this:
model_reg.fit(scaled_x_train.values, y_train[vp].values)
data_pred = model_reg.predict(scaled_x_test.values)
And now there are no more warnings on my code.

I was getting very similar error but on module DecisionTreeClassifier for Fit and Predict.
Initially I was sending dataframe as input to fit with headers and I got the error.
When I trimmed to remove the headers and sent only values then the error got disappeared.
Sample code before and after changes.
Code with Warning:
model = DecisionTreeClassifier()
model.fit(x,y) #Here x includes the dataframe with headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)
Code without Warning:
model = DecisionTreeClassifier()
model.fit(x.values,y) #Here x.values will have only values without headers
predictions = model.predict([
[20,1], [20,0]
])
print(predictions)

I had also the same problem .The problem was due to fact that I fitted the model with X train data as dataframe (model.fit(X,Y)) and I make a prediction with with X test as an array ( model.predict([ [20,0] ]) ) . To solve that I have converted the X train dataframe into an array as illustrated bellow .
BEFORE
model = DecisionTreeClassifier()
model.fit(X,Y) # X train here is a dataFrame
predictions = model.predict([20,0]) ## generates warning
AFTER
model = DecisionTreeClassifier()
X = X.values # conversion of X into array
model.fit(X,Y)
model.predict([ [20,0] ]) #now ok , no warning

The other answers so far recommend (re)training using a numpy array instead of a dataframe for the training data. The warning is a sort of safety feature, to ensure you're passing the data you meant to, so I would suggest to pass a dataframe (with correct column labels!) to the predict function instead.
Also, note that it's just a warning, not an error. You can ignore the warning and proceed with the rest of your code without problem; just be sure that the data is in the same order as it was trained with!

I got the same error while using dataframes but by passing only values it is no more there
use
reg = reg.predict( x[['data']].values , y)
It is showing error because our dataframe has feature names but we should fit the data as 2d array(or matrix) with values for training or testing the dataset.
Here is the image of the same thing mentioned above image of jupytr notebook code

Converting a pandas Interval into a string (and back again)

I'm relatively new to Python and am trying to get some data prepped to train a RandomForest. For various reasons, we want the data to be discrete, so there are a few continuous variables that need to be discretized. I found qcut in pandas, which seems to do what I want - I can set a number of bins, and it will discretize the variable into that many bins, trying to keep the counts in each bin even.
However, the output of pandas.qcut is a list of Intervals, and the RandomForest classifier in scikit-learn needs a string. I found that I can convert an interval into a string by using .astype(str). Here's a quick example of what I'm doing:
import pandas as pd
from random import sample
vals = sample(range(0,100), 100)
cuts = pd.qcut(vals, q=5)
str_cuts = pd.qcut(vals, q=5).astype(str)
and then str_cuts is one of the variables passed into a random forest.
However, the intent of this system is to train a RandomForest, save it to a file, and then allow someone to load it at a later date and get a classification for a new test instance, that is not available at training time. And because the classifier was trained on discretized data, the new test instance will need to be discretized before it can be used. So what I want to be able to do is read in a new instance, apply the already-established discretization scheme to it, convert it to a string, and run it through the random forest. However, I'm getting hung up on the best way to 'apply the discretization scheme'.
Is there an easy way to handle this? I assume there's no straight-forward way to convert a string back into an Interval. I can get the list of all Interval values from the discretization (ex: cuts.unique()) and apply that at test-time, but that would require saving/loading a discretization dictionary alongside the random forest, which seems clunky, and I worry about running into issues trying to recreate a categorical variable (coming mostly from R, which is extremely particular about the format of categorical variables). Or is there another way around this that I'm not seeing?

Use the labelsargument in qcut and use pandas Categorical.
Either of those can help you create categories instead of interval for your variable. Then, you can use a form of encoding, for example Label Encoding or Ordinal Encoding to convert the categories (the factors if you're used to R) to numerical values which the Forest will be able to use.
Then the process goes :
cutting => categoricals => encoding
and you don't need to do it by hand anymore.
Lastly, some gradient boosted trees libraries have support for categorical variables though it's not a silver bullet and will depend on your goal. See catboost and lightgbm.

For future searchers, there are benefits to using transformers from scikit-learn instead of pandas. In this case, KBinsDiscretizer is the scikit equivalent of qcut.
It can be used in a pipeline, which will handle applying the previously-learned discretization to unseen data without the need for storing the discretization dictionary separately or round trip string conversion. Here's an example:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import KBinsDiscretizer
pipeline = make_pipeline(KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile'),
RandomForestClassifier())
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
If you really need to convert back and forth between pandas IntervalIndex and string, you'll probably need to do some parsing as described in this answer: https://stackoverflow.com/a/65296110/3945991 and either use FunctionTransformer or write your own Transformer for pipeline integration.

While it may not be the cleanest-looking method, converting a string back into an interval is indeed possible:
import pandas as pd
str_intervals = [i.replace("(","").replace("]", "").split(", ") for i in str_cuts]
original_cuts = [pd.Interval(float(i), float(j)) for i, j in str_intervals]

Hold out tensorflow 1.4 new dataset API

With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.

I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.

You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.

You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)

try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.

I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()

I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Scikit-learn feature selection for regression data

I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.
I am working with scikit-learn version 0.11.
I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.
I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.
The two approaches mentioned above:
I tried using sklearn.feature_selection.f_regression(X,Y).
This failed with the following error message:
"TypeError: copy() takes exactly 1 argument (2 given)"
I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?
Please find my toy dataset pasted into the end of this message.
The following code snippet should give the results I describe above.
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features
Thanks in advance.
Richard
Contents of my contrived svmlight file - with additional blank lines inserted for clarity:
1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1.8 1:1.000000 2:1.000000#mB
0.1 5:1.000000#mC
1.8 1:1.000000 2:1.000000#mD
0.1 3:1.000000 4:1.000000#mE
0.1 3:1.000000#mF
1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
1.8 2:1.000000#mH

As larsmans noted, chi2 cannot be used for feature selection with regression data.
Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]

You could also try to do feature selection by L1/Lasso regularization. The class specifically designed for this is RandomizedLasso which will train LassoRegression on multiple subsamples of your data and select features that are selected most frequently by these models. You can also just use Lasso, LassoLars or SGDClassifier to do same thing without the benefit of resampling but faster.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.