I have applied svm on my dataset. my dataset is multi-label means each observation has more than one label.
while KFold cross-validation it raises an error not in index.
It shows the index from 601 to 6007 not in index (I have 1...6008 data samples).
This is my code:
df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {} '.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])
Actually, I do not know how to apply KFold cross-validation in which I can get the F1 score and accuracy of each label separately.
having looked at this and this did not help me how can I successfully to apply on my case.
for being reproducible, this is a small sample of the data frame
the last seven features are my labels including ADR, WD,...
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1
Update
when I did whatever Vivek Kumar said It raises the error
ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]
in classifier part . do you have any idea how to resolve it?
there are a couple of links for this error in stackoverflow which says I need to reshape training data. I also did that but no success link
Thanks :)
train_index, test_index are integer indices based on the number of rows. But pandas indexing dont work like that. Newer versions of pandas are more strict in how you slice or select data from them.
You need to use .iloc to access the data. More information is available here
This is what you need:
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
...
...
# TfidfVectorizer dont work with DataFrame,
# because iterating a DataFrame gives the column names, not the actual data
# So specify explicitly the column name, to get the sentences
SVC_pipeline.fit(X_train['sentences'], y_train[category])
prediction = SVC_pipeline.predict(X_test['sentences'])
Related
I hope this message finds you well. I have been working with a dataframe and I had to remove the rows which contained any null values. I used the following command to delete such rows.
I have used the following command:
df.dropna(axis=0,how="any",inplace=True)
Then when I apply k-fold cross validation like this:
#Using kfold cross validation
from sklearn.model_selection import KFold, cross_val_predict
kf = KFold(shuffle=True, random_state=42, n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (X.iloc[train_index, :],
X.iloc[test_index, :],
y[train_index],
y[test_index])
I face the following error:
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Int64Index([ 0, 149, 151, 156, 157,\n ...\n 26474, 26987, 27075, 27157, 27345],\n dtype='int64', length=1764). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
I do not know how to fix this. Its probably giving me an error because those rows do not exist and probably I have to reindex them again starting from zero and having proper index. I do not know how to do it. Can anyone suggest any good recommendation? Thanks
What I think you want is:
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (X.iloc[train_index],
X.iloc[test_index],
y.iloc[train_index],
y.iloc[test_index])
I think your problem comes form the fact that you are using relative index number generated by kf.split(X) as index values on y[train_index] and y[test_index]. Your original could - by chance - work if the index in the X and y DF's indexes.
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I am new to H2O. So far for the train-test split I have used the StratifiedKFold() of sklearn.
skf = StratifiedKFold(n_splits=n, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I need the indexes for some further processing later.
In H2O I can't figure out how to get the indexes while doing cross validation. From what I have gathered via videos and blogs, this is how we do CV in H2O:
gbm_model = H2OGradientBoostingEstimator(model_id = 'gbm_model',nfolds=5)
How do I get the train and test indexes of each fold?
Also, how do I get the indexes while doing a simple split?
data_split = data.split_frame(ratios=[0.8],seed = 1234)
train_df = data_split[0]
test_df = data_split[1]
How do I get the indexes that went into train and test?
you could use stratified_kfold_column(n_folds=3, seed=-1) or stratified_split(test_frac=0.2, seed=-1) which create a column with the splits you can use to subset to split on later.
see more about these in the docs
I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.
label is on the first column.
CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:
def cross(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
Also I get similar accuracy with the code below:
def train(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
But If I don't use shuffle in the code below it results c. %49
If I use shuffle then it results c. %65
I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.
dataset = pd.read_csv("./dataset.csv", header=0,sep=";")
dataset = shuffle(dataset) #!!!???
X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]
X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
Assuming your question is "Why does it happen":
In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".
Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.
Addendum to answer further data in comment:
This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:
reduce the size of the test set until you find a window size in which the difference between shuffle/no shuffle is negligible.
this process essentially manifests as some dependence between your features so you could see if in a small time frame there is a dependence between your features
I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)
However, I'd like to stratify my training dataset. How do I do that? I've been looking into the StratifiedKFold method, but doesn't let me specifiy the 75%/25% split and only stratify the training dataset.
[update for 0.17]
See the docs of sklearn.model_selection.train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)
[/update for 0.17]
There is a pull request here.
But you can simply do train, test = next(iter(StratifiedKFold(...)))
and use the train and test indices if you want.
TL;DR : Use StratifiedShuffleSplit with test_size=0.25
Scikit-learn provides two modules for Stratified Splitting:
StratifiedKFold : This module is useful as a direct k-fold cross-validation operator: as in it will set up n_folds training/testing sets such that classes are equally balanced in both.
Heres some code(directly from above documentation)
>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
... #fit and predict with X_train/test. Use accuracy metrics to check validation performance
StratifiedShuffleSplit : This module creates a single training/testing set having equally balanced(stratified) classes. Essentially this is what you want with the n_iter=1. You can mention the test-size here same as in train_test_split
Code:
>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test
You can simply do it with train_test_split() method available in Scikit learn:
from sklearn.model_selection import train_test_split
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])
I have also prepared a short GitHub Gist which shows how stratify option works:
https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9
Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
Where start is min and stop is max of your continuous target.
If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.
In addition to the accepted answer by #Andreas Mueller, just want to add that as #tangy mentioned above:
StratifiedShuffleSplit most closely resembles train_test_split(stratify = y)
with added features of:
stratify by default
by specifying n_splits, it repeatedly splits the data
StratifiedShuffleSplit is done after we choose the column that should be evenly represented in all the small dataset we are about to generate.
'The folds are made by preserving the percentage of samples for each class.'
Suppose we've got a dataset 'data' with a column 'season' and we want the get an even representation of 'season' then it looks like that:
from sklearn.model_selection import StratifiedShuffleSplit
sss=StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=0)
for train_index, test_index in sss.split(data, data["season"]):
sss_train = data.iloc[train_index]
sss_test = data.iloc[test_index]
As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
This is called a stratified train-test split.
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15
X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903)
X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)
X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)
Updating #tangy answer from above to the current version of scikit-learn: 0.23.2 (StratifiedShuffleSplit documentation).
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1 # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]