Passing the argument from a previous step in sklearn pipelines - python

I have a dataset with numerical and categorical features on which I am trying to fit a classifier. My idea was to preprocess the categorical data first using Pandas such that my dataset can be written as (to borrow MATLAB's concatenation notation)
X_train = [ X_train_num, X_train_cat ]
and
X_test = [ X_test_num, X_test_cat ].
To deal with numerical data, I did the following:
# define concatenation of arrays so we can assemble the various parts
# that are preprocessed differently in the pipelines
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
# pipeline to preprocess, reassemble, and fit our models
trainPipeline = Pipeline([
('preprocessing', numPipeline), # scale numerical data
('assembling', FunctionTransformer(concat, kw_args={'a2' : X_train[nominalFeatures]})), # wrong, but how?
('classifying', LogisticRegression())
])
The issue here is that when I pass X_train to the pipeline, it only extracts X_train_num to scale it in the first step, which is why I need to reassemble X_train_num_scaled with X_train_cat = X_train[nominalFeatures] together in the second step. The code above will obviously not work when I use X_test as an input for prediction unless I find a way to access the initial input from the first step and use that in the concatenation step.
I have tried to look at trainPipeline.steps[0] and down the list for the initial variable name but found nothing that could help me. What am I missing?

As #Vivek Kumar states, you should implement FeatureUnion() method in order to construct that pipe. It is usually used to concatenate inputs to let the model train on the extended data. So, in your case the pipe should look as the following:
def concat(a1, a2):
return np.concatenate((a1, a2), axis=1)
subpipe = Pipeline(
[('concat', FunctionTransformer(concat, kw_args={'a2': X_train[nominalFeatures]})),
('preproc', numPipeline())])
union = FeatureUnion(
[('prep_data', subpipe),
('raw_data', FunctionTransformer(concat, kw_args={'a1': X_train_num}))])
pipe = Pipeline(
[('union', union),
('logreg', LogisticRegression())])
Then, you should be able to perform pipe.predict(X_test, y) provided X_test is already preprocessed.
Quickcheck: I applied numPipeline() function to X_train[nominalFeatures] and let X_train_num be as it is. I hope that is what you desire.

Related

Is there a way to combine these sklearn Pipelines/ColumnTransformers so I don't have to make multiple fit_transform() calls?

I'd like to create a Pipeline where I can call fit_transform() just one time on my train dataset (train_df) and receive a fully preprocessed dataset. I don't think I can currently do that, however, because I have to call PCA() on the output of a ColumnTransformer and then concatenate that output with the result of a separate ColumnTransformer called on train_df. Basically, I think I'm going too high up the abstraction ladder, with one too many pipelines/ct's embedded within each other. There's no way to streamline the entire preprocessing process by passing train_df to a single Pipeline or ColumnTransformer - unless I'm missing something and you have any insight? I've spent hours wracking myself around this problem and have finally faced the reality I'm just spinning my wheels. Any help or solutions would be greatly appreciated.
Thank you!
num_ct = ColumnTransformer([
('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
('skewed_num', skewed_num_pipe, skewed_vars)
], remainder='drop')
total_num_pipe = Pipeline([('num_ct', num_ct),
('dim_reduc', PCA(n_components=5))])
cat_ct = ColumnTransformer([
('cat_pipe1', cat_pipe1, cat_vars1),
('cat_pipe2', cat_pipe2, cat_vars2)
], remainder='drop')
final_num = total_num_pipe.fit_transform(train_df)
final_cat = cat_ct.fit_transform(train_df)
final_X_train = np.c_[final_num, final_cat]
I finally found a solution to this, thanks to #Alexander's suggestion of chaining ColumnTransformers into a Pipeline. (TLDR: Don't forget that you can create a Pipeline of ColumnTransformers, using remainder='passthrough' to your advantage.)
I first created a ColumnTransformer that concatenates the transformations for both numeric and categorical variables, but without the PCA.
ct = ColumnTransformer([
('non_skewed_num', non_skewed_num_pipe, non_skewed_vars),
('skewed_num', skewed_num_pipe, skewed_vars),
('cat_pipe1', cat_pipe1, cat_vars1),
('cat_pipe2', cat_pipe2, cat_vars2)
], remainder='drop')
Then, I created a ColumnTransformer just for the PCA, and when I specified which columns to apply this to, I used a slice object since this ColumnTransformer will be fed a NumPy array--not a DataFrame--in the eventual Pipeline (it will be the second ColumnTransformer in the Pipeline). I also set remainder='passthrough' so the non-numeric variables will be retained untransformed after the PCA.
ct2 = ColumnTransformer([('dim_reduc', PCA(n_components=5), slice(0, 37))], remainder='passthrough') # 37 is number of numeric variables
Finally, I created a Pipeline chaining these two ColumnTransformers
final_pipe = Pipeline([('ct', ct),
('ct2', ct2)])
Calling final_pipe.fit_transform(train_df) yields the cleaned array I wanted. Hope this helps!

Why is sklearn.metrics support value changing every time?

I'm working on training a supervised learning keras model to categorize data into one of 3 categories. After training, I run this:
dataset = pandas.read_csv(filename, header=[0], encoding='utf-8-sig', sep=',')
# split X and Y (last column)
array = dataset.values
columns = array.shape[1] - 1
np.random.shuffle(array)
x_orig = array[:, 1:columns]
testy = array[:, columns]
columns -= 1
# normalize data
scaler = StandardScaler()
testx= scaler.fit_transform(x_orig)
#onehot
testy = to_categorical(testy)
# load weights
save_path = "[filepath]"
model = tf.keras.models.load_model(save_path)
# gets class breakdown
y_pred = model.predict(testx, verbose=1)
y_pred_bool = np.argmax(y_pred, axis=1)
y_true = np.argmax(testy, axis=1)
print(sklearn.metrics.precision_recall_fscore_support(y_true, y_pred))
sklearn.metrics.precision_recall_fscore_support prints, among other metrics, the support for each class. Per this link, support is the number of occurrences of each class in y_true, which is the true labels.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
My problem: each run, support is different. I'm using the same data, and support for each class always adds up the same (but different than the total in the file – which I also don’t understand), but the number per class differs.
As an example, one run might say [16870, 16299, 7807] and the next might say [17169, 15923, 7884]. They add up the same, but each class differs.
Since my data isn't changing between runs, I'd expect support to be identical every time. Am I wrong? If not, what's going on? I've tried googling, but didn't get any useful results.
Potentially useful information: when I run sklearn.metrics.classification_report, I have the same issue, and the numbers from that match the numbers from precision_recall_fscore_support.
Sidenote: unrelated to above question, but I couldn't google-fu an answer to this one either, I hope that's ok to include here. When I run model.evaluate, part of the printout is e.g. 74us/sample. What does us/sample mean?
Add:
np.random.seed(42)
before you shuffle the array at
np.random.shuffle(array)
The reason for this is without seeding np.shuffle will create a different result each time. Thus when you feed the array into the model it will return a different result. Seeding allows you to shuffle it the same each time, thus creating reproducible results.
Or you can not shuffle and get the same array each time to feed into the model. Either or both methods will ensure reproducibility within the model.

Predicting a single value in classification network

I am currently trying to get into machine learning and neural networks, but my lack of programming skills is kind of hindering me at the moment. I am following an online tutorial in which these lines of code were made to evaluate the created model:
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)
predictions = list(model.predict(input_fn=pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred['class_ids'][0])
final_preds[:10]
from sklearn.metrics import classification_report
print(classification_report(y_test,final_preds))
This works very well for me an tells me the precision I achieved on these 10 inputs I chose from X_test. Unfortunately, I can't really figure out how to be able to predict a particular, single value from X_test or maybe even a manually input value that has the same dimensions as an element of X_test.
X_test is a pandas.core.frame.DataFrame and includes 15 columns and thousands of rows. Therefore, I would find it helpful to maybe predict or evaluate a certain value.
If I missed any essential information, that I should have included, let me know. Thanks in advance!
Why don't you just take sections of the X_test dataframe, or pass in single values as a dataframe with a single row.
Sectioning a dataframe:
temp = X_test[i:i+1]
to test with the ith row, use temp now instead of X_test.
Or create a new dataframe with required data:
temp = pandas.DataFrame(data, columns = X_test.columns)
where data is input as a list (iterable) [[a1,a2,a3...a15]].
again use temp instead of X_test in your code.

Spark transformations and preservation of RDD element ordering

I'm trying to understand how (an if) the piece of code below works. In particular, what I don't understand is WHY does this code ASSUME -maybe correctly- that the order of elements in the RDD is preserved subsequent to mappings. This is in essence an example of the same question asked here Mind blown: RDD.zip() method. I don't understand why/how the last line quarantees that the zip actually zips the correct prediction with the corresponding label from the testData RDD? One of the comments mentions that if the RDD, testData in this case, is ordered in some way, then map will preserve that order. However, predictions is an entirely different RDD.. I can't see how or why this works!!
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
## Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = labeledDataRDD.randomSplit([0.7, 0.3])
## Train a RandomForest model
model = RandomForest.trainClassifier(trainingData, numClasses=2510,
categoricalFeaturesInfo={},numTrees=100,
featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

SelectKBest based on (estimated) amount of features

I'm trying to implement a hierarchical text classifier with scikit-learn, with one "root" classifier that arranges all input strings in one (or more) of ~50 categories. For each of these categories, I'm gonna train a new classifier, which solves the actual task.
The reason for this two-layer approach is training performance and memory issues (a classifier which is supposed to separate >1k classes does not perform very well...).
This is what my pipeline looks like for each of these "subclassifiers"
pipeline = Pipeline([
('vect', CountVectorizer(strip_accents=None, lowercase=True, analyzer='char_wb', ngram_range=(3,8), max_df=0.1)),
('tfidf', TfidfTransformer(norm='l2')),
('feat', SelectKBest(chi2, k=10000)),
('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, n_iter=10))),
])
Now to my problem: I'm using SelectKBest to limit the model size to a reasonable amount, but for the subclassifiers, there is sometimes not enough input data available so I don't even get to the 10k feature limit, which causes
(...)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 300, in fit
self._check_params(X, y)
File "/usr/local/lib/python3.4/dist-packages/sklearn/feature_selection/univariate_selection.py", line 405, in _check_params
% self.k)
ValueError: k should be >=0, <= n_features; got 10000.Use k='all' to return all features.
I don't know how many features I will have without applying the CountVectorizer, but I have to define the pipeline in advance. My preferred solution would be to skip the SelectKBest step, if there are less than k features anyway, but I don't know how to implement this behaviour without calling CountVectorizer twice (once in advance, once as part of the pipeline).
Any thoughts on this?
I followed the advice of Martin Krämer and created a subclass of SelectKBest which implements the desired functionality:
class SelectAtMostKBest(SelectKBest):
def _check_params(self, X, y):
if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
# set k to "all" (skip feature selection), if less than k features are available
self.k = "all"
I tried to add this snipped to his answer but the request was rejected so there you are...
I think the cleanest option would be to subclass SelectKBest and fallback to identity transformation in your implementation, if k exceeds the number of input features, otherwise just call the super implementation.
You could use SelectPercentile, which is more meaningful if you don't have a fixed number of features.

Categories

Resources