I am tasked with a supervised learning problem on a dataset and want to create a full Pipeline from complete beginning to end.
Starting with the train-test splitting. I wrote a custom class to implement sklearns train_test_split into the sklearn pipeline. Its fit_transform returns the training set. Later i still want to accsess the test set, so i made it an instance variable in the custom transformer class like this:
self.test_set = test_set
from sklearn.model_selection import train_test_split
class train_test_splitter([...])
[...
...]
def transform(self, X):
train_set, test_set = train_test_split(X, test_size=0.2)
self.test_set = test_set
return train_set
split_pipeline = Pipeline([
('splitter', train_test_splitter() ),
])
df_train = split_pipeline.fit_transform(df)
Now i want to get the test set like this:
df_test = splitter.test_set
Its not working. How do I get the variables of the instance "splitter". Where does it get stored?
You can access the steps of a pipeline in a number of ways. For example,
split_pipeline['splitter'].test_set
That said, I don't think this is a good approach. When you fill out the pipeline with more steps, at fit time everything will work how you want, but when predicting/transforming on other data you will still be calling your transform method, which will generate a new train-test split, forgetting the old one, and sending the new train set down the pipe for the remaining steps.
Related
from sklearn.linear_model import LogisticRegression
pipe4 = Pipeline([('ss', StandardScaler()), ('clf', knn)])
grid2 = GridSearchCV(pipe4, {'clf':[ knn, LogisticRegression()]})
grid2.fit(X_train, y_train)
pd.DataFrame(grid2.cv_results_).T
I made a knn classifier and logistic regression model and wanted to check which model is better through pipeline method.
as you can see the code above I put the knn only in the pipe4 but in grid search, both knn and logsistic regression are working and I could check the result
does it mean I can add the models in Gridseacrh even though I put the one model in pipeline?
Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).
I'm using LogisticRegressionCV on my data in a pipeline. After fitting to the data, I'd like to return my optimal C value. How do I do this since I can't use .best_params_ since that is a feature of GridSearchCV. I know that .C_ is the correct feature of LogisticRegressionCV, but my estimator is in a pipeline, so that doesn't work right now.
lr_cv2 = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(solver='liblinear', cv=10, Cs=np.logspace(-5, 8, 15) ))])
lr_cv2.fit(X_train, y_train)
lr_cv2.C_
AttributeError: 'Pipeline' object has no attribute 'C_'
By using the named_steps method of your instance of Pipeline, you can access to the methods composing the single elements of your pipeline:
print(lr_cv2.named_steps['classifier'].C_ )
I have a set of data on which I would like to train a Neural Net, although I believe my question pertains to any type of machine learning.
My data falls into two classes, however I have many more examples of class one than I do of class two. Before I go ahead and train a neural net on my data, I intend to split the data into 3 independent groups (Training, Validation and Testing), and within each one, duplicate the data I have for class one enough times so that I have equal amounts of data from each class in that group.
This is really tedious to do, and I'm willing to bet that other people have had the same problem. Is there a python library that does this for me? Or at least part of it?
tl;dr: I want a python library that splits my data into 3 parts and equalizes the amount of data I have in each class without throwing away data
Yes, use scikit-learn. Copy pasting KeironO's answer from https://github.com/fchollet/keras/issues/1711:
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
model.fit...
# fit and evaluate here.
if __name__ == "__main__":
n_folds = 10
data, labels, header_info = load_data()
skf = StratifiedKFold(labels, n_folds=n_folds, shuffle=True)
for i, (train, test) in enumerate(skf):
print "Running Fold", i+1, "/", n_folds
model = None # Clearing the NN.
model = create_model()
train_and_evaluate_model(model, data[train], labels[train], data[test], labels[test))
I'm practicing with some text using scikit-learn.
Towards getting more familiar with GridSearch, I am starting with some example code found here:
###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0)
}
grid_search.fit(X_train, y_train)
print("Best score: %0.3f" % grid_search.best_score_)
Notice I am being very careful here, and I've only got one estimator and one parameter!
I'm finding that when I run this, I get the error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None))]) does not.
Hummmm...why am I missing some sort of 'score' attribute?
When I check the possible parameters,
print CountVectorizer().get_params().keys()
I don't see anything where I can score, as was implied by this answer.
The documentation says By default, parameter search uses the score function of the estimator to evaluate a parameter setting. So why do I need to specify a score method?
Regardless, I thought I might need to explicity pass a scoring argument, but this didn't help and gave me an error: grid_search.fit(X_train, y_train, scoring=None)
I don't understand this error!
GridSearch maximizes a score over the grid of parameters. You have to specify what kind of score to use because there are many different types of scores possible. For example, for classification problems, you could use accuracy, f1-score, etc. Usually, score type is specified by passing a string in the scoring argument (see scoring parameter). Alternatively, model classes, like SVC or RandomForestRegressor, will have a .score() method. GridSearch will call that if no scoring argument is provided. However, that may or may not be the type of score that you want to optimize. There is also an option of passing in a function as the scoring argument if you have an unusual metric that you want GridSearch to use.
Transformers, like CountVectorizer, do not implement a score method, because they are just deterministic feature transformations. For the same reason, there aren't any scoring methods that make sense to apply to that type of object. You need a model class (or possibly a clustering algorithm) at the end of your pipeline for scoring to make sense.
Aha! I figured it out.
I wasn't understanding how the pipeline works. Sure, I could create a CountVectorizer, but why? There is no way you can get a score out of it, or basically do anything with it other than have a sparse matrix just sitting there.
I need to create a classifier (SGDRegressor) or a regressor (SGDClassifier).
I didn't realize that the pipeline will go
CV --> Regressor
or
CV --> Classifier
The pipeline does what it's name implies...pipes the objects together in series.
In other words, this works:
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', SGDRegressor())
])
I'm using:
sklearn.cross_validation.cross_val_score
to make a cross validation and get the results of each run.
The output of this function is the scores.
Is there a method to get the folds (partitions) themselves that are partitioned internally in the cross_val_score function?
There isn't a way to extract the internal cross validation splits used in the cross_val_score, as this function does not expose any state about it. As mentioned in the documentation, either a k-fold or stratified k-fold with k=3 will be used.
However, if you need to keep track of the cross validation splits used, you can explicitly pass in the cv argument of cross_val_score by creating your own cross validation iterators:
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
kf = KFold(len(iris.target), 5, random_state=0)
clf = SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=kf)
so that it uses the splits you specified exactly instead of rolling its own.
The default cross validator for cross_val_score is a StratifiedKFold with K=3 for classification. You can get a cross validation iterator instead, by using the StratifiedKFold and looping over the splits as shown in the example.