I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)
The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.
Thanks
You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class.
See the documentation for using it.
The train_test_split function allows a definition of the size of the training data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
See the docs
Related
everyone.
I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not know how to avoid that. I would like to know how to limit the oversampling values to 0 and 1.
this is what I am doing:
over = SMOTE(sampling_strategy='minority')
X, y = over.fit_resample(X, y)
y's vector type is int64.
Also, I am modeling using an ANN using Keras. The training data set available for me is split into three different datasets: train, val and test, for training, validation and testing purposes. I wonder which one should be OVERsampled (or UNDERsampled), or if I should OVERsample (or UNDERsample) BEFORE splitting.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15)
WARNING: my model will be evaluated later on using a much larger dataset (1000K records) which is the REAL test dataset from the company I am working for, and whose target values are UNKNOWN for me and my model.
Thanx
Johnny
Since you have 190 columns, 500K records, SMOTE will not work well as SMOTE technically interpolations techniques. It may suffer from curse of dimensionality. OVERsample and UNDERsample approaches also have some limitations. I would prefer to use class weight. Also, make sure that all data pre-processing are performed separately on test data and train data. Here is paper outlined the framework: https://www.sciencedirect.com/science/article/pii/S2666827022000585
I have a dataset of drugs, associated chemical features and whether they are "responsive" or "Unresponsive". I need to ensure that once I split the dataset into test and train they both have the same proportion of responsive:unresponsive. I know how to randomly split the data where training is 80% and test is 20%. Not sure how to do the stratified sampling necessary here, is this what I'm meant to use - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?
The train_test_split function already has one parameters that allows you keeping the proportion of y. The parameter is stratify; and is defined in the documentation as "If not None, data is split in a stratified fashion, using this as the class labels".
An example of code would be:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
I have a text classification task based on documents, where I expect the classes are related to word frequencies. Because of the specific nature of my application, where I have a corpus that will grow over time and want to classify new documents as they arrive, I have used FeatureHasher rather than the existing TFidfVectorizer (which both vectorizes and does adjustment), since the vocabulary size can grow with new documents.
As discussed here for instance (https://stats.stackexchange.com/questions/154660/tfidfvectorizer-should-it-be-used-on-train-only-or-traintest), it seems correct to me that the term frequencies when doing TFIDF should be calculated relative to the train set only, then used to rescale the test set, rather than first doing rescaling on the entire corpus and then splitting. This is because using the test dataset for frequency calculations is violating the principle that you shouldn't use this information.
Let's assume you start with a matrix X of raw term frequencies (not adjusted yet) and y, a vector of classes. The typical order that many code examples show is:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
vec = TfidfTransformer()
#rescale X by its own frequencies, then split
X = vec.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#...now fit a model
but the correct thing should be the following:
vec = TfidfTransformer()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#store rescaling based on X_train frequencies alone
vec.fit(X_train)
#resacale each (transform) by the same model
X_train = vec.transform(X_train)
X_test = vec.transform(X_test)
#...now fit a model
Okay, now the main question: I want to conduct some kind of cross-validation, perhaps with GridSearchCV, where I can feed it a set of potential model parameters and conduct several splits of the data for each one. The typical way to do this is to build a model pipeline and then feed it into the cross-validation utility. Since pipelines are kind of black boxes that are hard to view the details of, I just wanted to verify whether, if TfidfTransformer is included as a step in the pipeline, that it does the adjustment correctly, as I've mentioned above, by conducting the adjustment on the training data of each split.
I am working with a highly imbalanced dataset and using train_test_split from sklearn.model_selection
I have 10000 items in this dataset and the ratio is about 10/2/2/1 for the classes, what I am looking for is a way to have the train split balanced
and I would like to stop adding elements to the largest class when it reaches a max number.
Is it possible to limit the number of items, I know is possible to remove the extra items after splitting but I wonder if there is such an option?
Use stratify parameter while calling train_test_split function. Follow the documentation for more info.
For 30% test data, you can do like this
X_train,X_test, y_train, y_test = train_test_split(data, y_true, stratify=y_true, test_size=0.3)
data is your total data & y_true is your ground truth values
X = np.array(df.drop([label], 1))
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df[label])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
linReg.fit(X_train, y_train)
I've been fitting my linear regression classifier over and over again with data from different spreadsheets under the assumption that every time I fit the same model with a new spreadsheet, it is adding points and making the model more robust.
Was this assumption correct? Or am I just wiping the model every time I fit it?
If so, is there a way for me to fit my model multiple times for this 'cumulative' type effect?
Linear regression is a batch (aka. offline) training method, you can't add knowledge with new patterns. So, sklearn is re-fitting the whole model. The only way to add data is to append the new patterns to your original training X, Y matrices and re-fit.
You're almost certainly wiping your mode land starting from scratch. To do what you want, you need to append the additional data to the bottom of your data frame and re-fit using that.