Error when attempting cross validation in python - python

I am currently trying to implement cross validation with linear regression. The linear regression works, but when I try cross validation I get this error:
TypeError: only integer scalar arrays can be converted to a scalar index
I get this error on line 5 of my code.
Here is my code:
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
linreg.fit(X_train, Y_train)
# p = np.array([linreg.predict(xi) for xi in x[test]])
p = linreg.predict(X_test)
e = p-Y_test
xval_err += np.dot(e,e)
rmse_10cv = np.sqrt(xval_err/len(X_train))
Can someone please help me with this problem?
Thanks in advance!

There are a few problems with your code.
In line 5 Y_train is not defined. I think you want the lowercase y_train.
Similarly you want e = p-y_test on line 8.
In rmse_10cv = np.sqrt(xval_err/len(X_train)) X_train is defined inside your loop, so it will take the value on the last iteration of your loop. Watch your output where to print your training indices for each fold to make sure the length of X_train is always the same, otherwise your calculation of rmse_10cv will not be valid.
I ran your code with the fixes I described and with the following before the loop:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
X = X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
linreg = LinearRegression()
xval_err = 0
and I did not receive any errors.

Related

Split the dataset in train and test based on the group value?

If I have the following dataset: (If I group by the data with 'group_name', the data will look like:)
I want to split the dataset into train and test set based on the **group_name** feature. For example, if I want 80:20 ratio, then the train and test set will look like (i.e. in the group-by function):
Train Set:
Test Set:
Thus, the 80:20 ratio is considered in the above example. Also, the above examples shown are the results of the groupby function applied to the actual dataset.
Get the training with
training = df.groupby('group_name').apply(lambda x: x.sample(frac=0.8))
Then get the testing with the other index
testing = df.loc[set(df.index) - set(training.index.get_level_values(1))]
Hey sklearn has many types of train test split, one of them is stratifysplit
simply:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = df['column_to_groupby'], random_state=0)
it will split each group based on test_size you defined.
just try the sklearn lib. Based on your purpose, following methods could be try. They are groupkflod, or GroupShuffleSplit.
Here is the example for groupkfold :
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
for more detail, Visualizing cross-validation behavior in scikit-learn article could provide detail info. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py.

Function for cross validation and oversampling (SMOTE)

I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1). y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE.
def Learning(n, est, X, y):
s_k_fold = StratifiedKFold(n_splits = n)
acc_scores = []
rec_scores = []
f1_scores = []
for train_index, test_index in s_k_fold.split(X, y):
X_train = X[train_index]
y_train = y[train_index]
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
X_test = X[test_index]
y_test = y[test_index]
est.fit(X_resampled, y_resampled)
y_pred = est.predict(X_test)
acc_scores.append(accuracy_score(y_test, y_pred))
rec_scores.append(recall_score(y_test, y_pred))
f1_scores.append(f1_score(y_test, y_pred))
print('Accuracy:',np.mean(acc_scores))
print('Recall:',np.mean(rec_scores))
print('F1:',np.mean(f1_scores))
Learning(3, SGDClassifier(), X_train_s_pca, y_train)
When I run the code, I get the below error:
None of [Int64Index([ 4231, 4235, 4246, 4250, 4255, 4295, 4317,
4344, 4381,\n 4387,\n ...\n 13122,
13123, 13124, 13125, 13126, 13127, 13128, 13129, 13130,\n
13131],\n dtype='int64', length=8754)] are in the [columns]"
Help to make it run is appreciated.
If you observe the error stack trace (which is important but you don't include) carefully, you should see that the error comes from these line (and will come from other similar lines):
X_train = X[train_index]
This way of selecting rows only applicable for Numpy array. Since you are using Pandas DataFrame, you should use loc:
X_train = X.loc[train_index]
Alternatively, you can convert the DataFrame to Numpy array instead (to minimize code change) by using values:
Learning(3, SGDClassifier(), X_train_s_pca.values, y_train.values)

How to fix sklearn multiple linear regression ValueError in python (inconsistent numbers of samples: [2, 1])

I had my linear regression working perfectly with a single feature. Ever since trying to use two I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 1]
The first print statement is printing the following:
(2, 6497) (1, 6497)
Then the code crashes at the train_test_split phase.
Any ideas?
feat_scores = {}
X = df[['alcohol','density']].values.reshape(2,-1)
y = df['quality'].values.reshape(1,-1)
print (X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
reg = LinearRegression()
reg.fit(X_train, y_train)
reg.predict(y_train)
Your missed out in this line
X = df[['alcohol','density']].values.reshape(2,-1)
y = df['quality'].values.reshape(1,-1)
Don't reshape the data into (2, 6497) (1, 6497), instead you have to give it as (6497,2) (6497,)
Sklearn takes the dataframes/Series directly. so you could give,
X = df[['alcohol','density']]
y = df['quality']
Also, you can predict only with X values, Hence
reg.predict(X_train)
or
reg.predict(X_test)

split into train and test by group+ sklearn cross_val_score

I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score have groups parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
To specify your train and validation sets in this way you will need to create a cross-validation object and not use the cv=5 argument to cross_val_score. The trick is you want to stratify the folds but not based on the classes in y, rather based on another column of data. I think you can use StratifiedShuffleSplit for this like the following.
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4],
[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1])
groups_to_stratify = np.array([1,2,3,1,2,3,1,2,3,1,2,3])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
sss.get_n_splits()
print(sss)
# Note groups_to_stratify is used in the split() function not y as usual
for train_index, test_index in sss.split(X, groups_to_stratify):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print("TRAIN indices:", train_index,
"train groups", groups_to_stratify[train_index],
"TEST indices:", test_index,
"test groups", groups_to_stratify[test_index])

TypeError: only length-1 arrays can be converted to Python scalars, while using Kfold cross Validation

I am trying to use Kfold cross valiadtion for my model, but get this error while doing so. I know that KFold only accepts 1D arrays but even after converting the length input to an array its giving me this problem.
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
if __name__ == "__main__":
np.random.seed(1335)
verbose = True
shuffle = False
n_folds = 5
y = np.array(y)
if shuffle:
idx = np.random.permutation(y.size)
X_train = X_train[idx]
y = y[idx]
skf = KFold(y, n_folds)
models = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy')]
print("Stacking in progress")
A = []
for j, clf in enumerate(models):
print(j, clf)
for i, (itrain, itest) in enumerate(skf):
print("Fold :", i)
x_train = X_train[itrain]
x_test = X_train[itest]
y_train = y[itrain]
y_test = y[itest]
print(x_train.shape, x_test.shape)
print(len(x_train), len(x_test))
clf.fit(x_train, y_train)
pred = clf.predict_proba(x_test)
A.append(pred)
I get the error for the line "skf = KFold(y, n_folds)". Any help with this will be appreciated.
From its doc, KFold() does not expect y as an input, but only the number of splits (n_folds).
Once you have an instance of KFold, you do myKfold.split(x) (x being all of your input data) to obtain an iterator yielding train and test indices. Example copy pasted from sklearn doc:
>>> from sklearn.model_selection import KFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4])
>>> kf = KFold(n_splits=2)
>>> kf.get_n_splits(X)
2
>>> print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)
>>> for train_index, test_index in kf.split(X):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]

Categories

Resources