Splitting dataset into two non-redundant numpy arrays?

Splitting dataset into two non-redundant numpy arrays? - python

I have a numpy array "my_data". I am trying to split this dataset randomly. However, when I do this using the following code, I get a "train" array and a "test" array. Train array and test array have some rows in column.
training_idx = np.random.randint(my_data.shape[0], size=split_size)
test_idx = np.random.randint(my_data.shape[0], size=len(my_data)-split_size)
train, test = my_data[training_idx,:], my_data[test_idx,:]
My intention is to find train array first randomly and then whatever rows are left in my_data which are not in train array, to be a part of test array.
Is there a way in numpy to do so ? (I am refraining from using sklearn to split my data)
I referred to this post here to get here with my dataset.
How to split/partition a dataset into training and test datasets for, e.g., cross validation?
If I code per this post’s logic I end up getting train and test data sets where train and test have some redundant rows in them. I intend on making train and test datasets where no rows are common.

Following this answer you can do:
train_idx = np.random.randint(my_data.shape[0], size=split_size)
mask = np.ones_like(my_data, dtype=bool)
mask[train_idx] = False
train, test = my_data[~mask], my_data[mask]
Although, a more natural way would be to slice a permutation of your data, as Poojan suggested.
permuted = np.random.permutation(my_data)
train, test = permuted[:split_size], permuted[split_size:]

Related

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?

The previous answer says that you should separate the training and testing set when scaling, otherwise the testing one might bias the transformation of the training one. This is half correct and half wrong.
If you do the transformation separately, then it might well be that the training set will get scaled to wrong proportions (e.g. if it comes from a narrow continuous time range, thus taking on a subset of the values range). You will end up having wrong values for the variables of the test set.
In general, what you must do is scale on the training set and transfer the scale over to the testing set. This is done by using the methods fit and transform separately, as seen in the documentation.

You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

dividing big dataset python

My dataset Features shape is (80102, 2592) and label.shape (80102, 2). I want to consider only few rows for traning as it is taking lot of time for training the CNN model. How can I divide the dataset in python and consider only few rows for traning and tesing both.

If your data is in the form of arrays let X be the array containing the data and y be the array containing the labels. You can use sklearn train_test_split function to create new samples of the data per the code below
from sklearn.model_selection import train_test_split
percent=.1 specify the percentof data you want to use, in this case 10%
X_data, X_dummy, y_labels, y_dummy=train_test_split(X,y,train_size=percent,randon_state=123, shuffle=True)
X_data will contain 10% of the original data and will be shuffled
y_labels will contain 10% of the corresponding labels.
If you want to specifically set the number of samples set train_size to an integer value. If you need further information the documentation is located here. If you data is a pandas dataframe you can use the pandas function pandas.DataFrame.sample..Documentation for that is here.. Assume your data frame is called data. The code below will produce a new data frame with a specified percent of the original rows
percent=.1
new_data=pandas.data.sample(n=None, frac=percent, replace=False, weights=None, random_state=123, axis=0)

Generate Test data using TfIdfVectorizer

I have separated my data into train and test parts. My data table has a 'text' column. Consider that I have ten other columns representing numerical features. I have used TfidfVectorizer and the training data to generate term matrix and combine that with numerical features to create the training dataframe.
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=5000, max_df=0.95)
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(X_train['text'].values)
df1_tfidf_train = pd.DataFrame(tfidf_vectorizer_train.toarray(), columns=tfidf_vectorizer.get_feature_names())
df2_train = df_main_ques.iloc[train_index][traffic_metrics]#to collect numerical features
df_combined_train = pd.concat([df1_tfidf_train, df2_train], axis=1)
To calculate the tf-idf score for test part, I need to reuse the training data set. I am not sure how to generate the test data part.
Related post:
[1]Append tfidf to pandas dataframe: discuss only creating training dataset part
[2]How does TfidfVectorizer compute scores on test data: Discussed test data part but it is not clear how to generate the test dataframe that contains both terms and numerical features.

you can use transform method of trained vectorizer for transforming your test data on already trained vectorizer. you can reuse the trained vectorizer for test data set TF-IDF score generation by
tfidf_vectorizer_test = tfidf_vectorizer.transform(X_test['text'].values)

using x.reshape on a 1D array in sklearn

I tried to use sklearn to use a simple decision tree classifier, and it complained that using a 1D array is now depricated and must use X.reshape(1,-1). So I did but it has turned my labels list to a list of lists with only one element so number of labels and samples do not match now. Another words my list of labels=[0,0,1,1] turns into [[0 0 1 1]]. Thanks
This is the simple code that I used:
from sklearn import tree
import numpy as np
features =[[140,1],[130,1],[150,0],[170,0]]
labels=[0,0,1,1]
labels = np.array(labels).reshape(1,-1)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features,labels)
print clf.predict([150,0])

You are reshaping the wrong thing. Reshape the data your are predicting on, not your labels.
>>> clf.predict(np.array([150,0]).reshape(1,-1))
array([1])
Your labels have to align with your training data (features) ,so the length of both arrays should be the same. If labels is reshaped, you are right, it is a list of lists with a length of 1 and not equal to the length of your features.
You have to reshape your test data because prediction needs an array that looks like your training data. i.e. each index needs to be a training example with the same number of features as in training. You'll see that the following two commands return a list of lists and just a list respectively.
>>> np.array([150,0]).reshape(1,-1)
array([[150, 0]])
>>> np.array([150,0])
array([150, 0])

How to generate a custom cross-validation generator in scikit-learn?

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.

The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )

I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.

Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)

class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting dataset into two non-redundant numpy arrays? - python

Related

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

dividing big dataset python

Generate Test data using TfIdfVectorizer

using x.reshape on a 1D array in sklearn

How to generate a custom cross-validation generator in scikit-learn?

Categories

Resources