How to train-test split and cross-validate in surprise? - python

I wrote the following code below which works:
from surprise.model_selection import cross_validate
cross_validate(algo,dataset,measures=['RMSE', 'MAE'],cv=5, verbose=False, n_jobs=-1)
However when I do this: (notice the trainset is passed here in cross_validate instead of whole dataset)
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(dataset, test_size=test_size)
cross_validate(algo, trainset, measures=['RMSE', 'MAE'],cv=5, verbose=False, n_jobs=-1)
It gives the following error:
AttributeError: 'Trainset' object has no attribute 'raw_ratings'
I looked it up and
Surprise documentation says that Trainset objects are not the same as dataset objects, which makes sense.
However, the documentation does not say how to convert the trainset to dataset.
My question is:
1. Is it possible to convert Surprise Trainset to surprise Dataset?
2. If not, what is the correct way to train-test split the whole dataset and cross-validate?

From my understanding, cross-validate will perform the trainset(s)/testset(s) splits for you. So your first line of code is correct and will split into 5 folds(cv=5). Each fold will be the test for the other 4 (train).
If you wanted a simple train/test set, see this example from the docs.

Related

My pipeline not imputing values correctly?

I'm new to python and have been learning about pipelines from datacamp. I have been experimenting with some fifa data that has missing NaN values. I have tried to create a pipeline with the steps of imputing any missing data (replacing it with the mean) and then creating a logistic regression. I don't seem to get any errors in the output. However, when I print things such as print(x_train) and print(y_pred) the output still returns NaN values. Would that indicate that my Pipeline is not working and that the data was not correctly imputed as surely I should be seeing the mean values rather than NaN. Would appreciate if someone could answer the question in layman's terms as I am new to the topic.
fif_data=pd.read_csv("fifa_draft_1.csv")
df_Foot_Dummy=pd.get_dummies(fif_data, drop_first=True)
imp=SimpleImputer(missing_values=np.nan, strategy="mean")
logreg=LogisticRegression()
x=df_Foot_Dummy["passing"].values.reshape(-1,1)
y=df_Foot_Dummy["preferred_foot_Right"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42)
steps=[("imputation", imp),("logistic_regression",logreg)]
pipe=Pipeline(steps)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
print(x_train)
print(y_pred)
Pipelines do not change data in-place; at each step, the data is modified and passed along, but the intermediate results are not saved (with a partial exception when the cache parameter is set).
That the logistic regression doesn't complain indicates that the imputation has in fact happened.
y_pred shouldn't have any missing values; if that's the case, please let us know and provide an example dataset.

Using sklearn's roc_auc_score for OneVsOne Multi-Classification?

So I am working on a model that attempts to use RandomForest to classify samples into 1 of 7 classes. I'm able to build and train the model, but when it comes to evaluating it using roc_auc function, I'm able to perform 'ovr' (oneVsrest) but 'ovo' is giving me some trouble.
roc_auc_score(y_test, rf_probs, multi_class = 'ovr', average = 'weighted')
The above works wonderfully, I get my output, however, when I switch multi_class to 'ovo' which I understand might be better with class imbalances, I get the following error:
roc_auc_score(y_test, rf_probs, multi_class = 'ovo')
IndexError: too many indices for array
(I pasted the whole traceback below!)
Currently my data is set up as follow:
y_test (61,1)
y_probs (61, 7)
Do I need to reshape my data in a special way to use 'ovo'?
In the documentation, https://thomasjpfan.github.io/scikit-learn-website/modules/generated/sklearn.metrics.roc_auc_score.html, it says "binary y_true, y_score is supposed to be the score of the class with greater label. The multiclass case expects shape = [n_samples, n_classes] where the scores correspond to probability estimates."
Additionally, the whole traceback seems to hint as using maybe using a more binary array (hopefully that's the right term! I'm new to this!)
Very, very thankful for any ideas/thoughts!
#Tirth Patel provided the right answer, I needed to reshape my test set using one hot encoding. Thank you!

Feature Names Mismatch when Passing X_test to .predict() Function (Again, Still)

Ok I'm still having this issue and I'm at a loss as to where I'm going wrong. I thought I had a working solution, but I was wrong.
After finding a regression pipeline through TPOT, I go to use the .predict(X_test) function and I get the following error message:
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
I read somewhere on Github that XGBoost likes to have the X features passed to it in the form of a Numpy Array, and not a Pandas Dataframe. So I did that and now I receive this error message whenever a RandomForestRegressor ends up in my pipeline.
So I investigate:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed, shuffle=False)
# Here is where I convert the features to numpy arrays
X_train=X_train.values
X_test=X_test.values
print('[INFO] Printing the shapes of the training/testing feature/label sets...')
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
[INFO] Printing the shapes of the training/testing feature/label sets...
(1366, 117)
(456, 117)
(1366,)
(456,)
# Notice 117 rows for X columns...
# Now print the X_test shape just before the predict function...
print(X_test.shape)
(456, 117)
# Still 117 columns, so call predict:
predictions = best_model.predict(X_test)
ValueError: Number of features of the model must match the input. Model n_features is 117 and input n_features is 118
WHY!!!!!!?????
Now the tricky thing is, I'm using a custom tpot_config to only use the regressors XGBRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, DecisionTreeRegressor, and RandomForestRegressor, so I need to come up with a way to train and predict the features whereby all of them will work with the data in the same way, so that no matter what pipeline it comes up with, I won't have this issue each time I go to run my code!
There have been similar questions asked at these links on SO:
Here
Here
Here
Here
... but I don't understand why my model is not predicting, when I AM passing it the same number of (X) features as was used in training the model!? Where am I going wrong here???
EDIT
I should also mention, that leaving the features as dataframes and not converting them to numpy arrays sometimes gives me a "feature names mismatch" error when XGBRegressor is in the pipeline as well. So I'm at a loss as to how to handle both the list of tree regressors (which like Dataframes) and XGBoost (which likes Numpy arrays). I have also tried “re-arranging” the columns(?) to make sure that the X_train and X_test Dataframes are in the same order like some have suggested but that didn’t do anything.
I have posted my full code in a Google Colab notebook here where you can make comments on it. How can I pass the testing data to the .predict() function no matter what pipeline TPOT comes up with??????
Thanks to weixuanfu at GitHub, I may have found a solution by moving the feature_importance code section down to the bottom of the my code, and yes using numpy arrays for the features. If I run into this issue again, I will be posting it below:
https://github.com/EpistasisLab/tpot/issues/738

Is it possible to split the training DataLoader (and dataset) into training and validation datasets?

The torchvision package provides easy access to commonly used datasets. You would use them like this:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
Apparently, you can only switch between train=True and train=False. The docs explain:
train (bool, optional) – If True, creates dataset from training.pt,
otherwise from test.pt.
But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.
I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader. Is it possible to split a DataLoader into two separate streams of data?
Meanwhile, I stumbled upon the method random_split. So, you don't split the DataLoader, but you split the Dataset:
torch.utils.data.random_split(dataset, lengths)

How can I do a scikit-learn grid search with training data that is an iterator

I am working on a text classification problem, using a pipeline that looks like this:
self.full_classifier = Pipeline([
('vectorize', CountVectorizer()),
('tf-idf', TfidfTransformer()),
('classifier', SVC(kernel='linear', class_weight='balanced'))
])
The full corpus is too large to fit in memory, but small enough that after the vectorization step I have no memory issues. I can successfully fit a classifier by using
self.full_classifier.fit(
self._all_data (max_samples=train_data_length),
self.dataset.head(train_data_length)['target'].values
)
where self._all_data is an iterator that yields the documents per training example (while self.dataset just includes document id's and targets). Here, max_samples is optional, I am using it to do a split on training/testing data. I now want to use gridsearch to optimize parameters, for which I am using this code:
parameters = {
'vectorize__stop_words': (None, 'english'),
'tfidf__use_idf': (True, False),
'classifier__class_weight': (None, 'balanced')
}
gridsearch_classifier = GridSearchCV(self.full_classifier, parameters, n_jobs=-1)
gridsearch_classifier.fit(self._all_data(), self.dataset['target'].values)
My problem is that this generates the following error:
TypeError: Expected sequence or array-like, got <type 'generator'>
with the traceback pointing at the gridsearch_classifier.fit method (and then into scikit's code, error raised in _num_samples(x). Since it is possible to fit with a generator as input, I was wondering if there is also a way to do this with the grid search, that I am currently missing.
Any help is appreciated!
Not without materializing the generator as a list. While various fit methods can often be structured to consume one item at a time, and thus accept an iterator, grid search additionally performs cross-validation and generates cv splits of the data by indexing a realized set.

Categories

Resources