I have my validation and train dataset and I am trying to remove the validation dataset that exists in the train dataset and create a new subset.
val_data = data from a previous experiment
train_data = current training data
new_data = data from train_data the does not include val_data
I am trying to get the new_data
Please how do I go about it?
NOTE: I am not using train test to generate the validation dataset because it was already generated beforehand
Related
I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.
I am working on a flight ticket price prediction data set using pyspark ML lib, which contains both train and test data sets. I have successfully implemented my model on train data set and predicted the price i.e, the label column, but don't know how to apply the same model on test data set for predicting the price of the ticket.
The following code is for training the model on train data set(containing both features and label column).
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol="features",labelCol = "Price", maxIter = 10)
gbtModel = gbt.fit(training_data)
predictions_gbt = gbtModel.transform(testing_data)
predictions_gbt.select("features", "Price", "prediction").show()
I have build a model to predict if a customer is a business or a private customer. After training the model I predict the class of 1000 datasets which I didn’t use for the training. This prediction will be saved in a csv file.
Now I have two different behaviours:
Splitting sample data in the program
When I create the sample with train, sample = train_test_split(train, test_size=1000, random_state=seed) then prediction gets the same accuracy during the training (same value as validation).
Splitting sample data in advance and then loading it
But when I split the data manually before learning by taking 1000 datasets of the original csv file and copying it in a new sample csv file which I am loading before doing the prediction after learning, I got a much worse result (e.g. 76% instead of 90%).
This behaviour doesn’t make sense in my eyes since the original data (the csv file for training) was also shuffled in advanced and therefore I should get the same result.
Here is the relevant code of the mentioned case distinction:
1. Splitting sample data in the program
Splitting
def getPreProcessedDatasetsWithSamples(filepath, batch_size):
path = filepath
data = __getPreprocessedDataFromPath(path)
train, test = train_test_split(data, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.2, random_state=42)
train, sample = train_test_split(train, test_size=1000, random_state=seed)
train_ds = __df_to_dataset(train, shuffle=False, batch_size=batch_size)
val_ds = __df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = __df_to_dataset(test, shuffle=False, batch_size=batch_size)
sample_ds = __df_to_dataset(sample, shuffle=False, batch_size=batch_size)
return (train_ds, val_ds, test_ds, sample, sample_ds)
Prediction with sample, sample_ds
def savePredictionWithSampleToFileKeras(model, outputName, sample, sample_ds):
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 90%
2. Splitting sample data in advance and then loading it
Prediction by loading csv file
def savePredictionToFileKeras(model, sampleFilePath, outputName, batch_size):
sample_ds = preprocessing.getPreProcessedSampleDataSets(sampleFilePath, batch_size)
sample = preprocessing.getPreProcessedSampleDataFrames(sampleFilePath)
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 77%
EDIT
Observation: When I load the whole data as sample data, I get the same value as the validation value as expected (ca. 90%) but when I just randomize the line order of the same file, I get a value of 82%. As my understanding the accuracy should be the same, since the files are equal.
Some additional information:
I have changed the implementation form the sequential to the functional API. I’m using Embeddings in the pre-processing (I also tried One-Hot-Encoding without success).
Finally I found the problem: I am using a Tokenizer to preprocess a NAME and STREET column in a way that I am converting each word to a value which indicates how often the word occurs. In the case I am using train_test_split I use the same overall words of all data for converting the words, but when I am loading the sample dataset afterwards I use only the words which occurs in the sample dataset. For instance, the word “family” could be the most used word overall but just the third in the sample dataset and therefore the encoding would be totally wrong.
After using the same tokenizer instance for all data, I get the same high accuracy for all the data.
Both the above methods don't predict correct accuracy. Accuracy will be good measurement only if your data is balanced. For unbalanced data it is not good measure and it wont be correct always. each time you change the accuracy will change.
You should be using K fold cross validation firstly so that all data points will be used for training model. In case if your data set is not balanced you can try different balance techniques like over sampling or under sampling on the train data and validate the model.
For a small dataset, I was using scikit-learn test_train_split on a dataframe of the whole dataset as
from sklearn.model_selection import train_test_split
train, test = train_test_split(features_dataframe, test_size=0.2)
train, test = train_test_split(train, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
And it simply create a test, train, validation split on my dataset.
Now, I want to perform data-loading from the disk i.e., my csv files. So, I'm using the experimental tf.data function make_csv_dataset. What I have done is
import tensorflow as tf
defaults=[float()]*len(selected_columns)
data_set=tf.data.experimental.make_csv_dataset(
file_pattern = "./processed/*/*/*.csv",
column_names=all_columns, # array with all columns labels
select_columns=selected_columns, # array with desired column labels
column_defaults=defaults, # default column values
label_name="Target",
batch_size=10,
num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000,
ignore_errors=True)
As far as I guess is, I have the dataset, but when I try to perform train_test_split of scikit-learn, it don't work and the reason is obvious, the data_set is not loaded yet, its just configured to be loaded.
How, to perform train, test, validation split on this data?
I have gone through some guides, and everyone (as far as I come across), is loading the training data:
overfit_and_underfit
custom_training_walkthrough
estimator
First of all, to have a better control overmy dataset, I used a lower level similar API i.e., CsvDataset. Then I manually spitted dataset in two different folders for test and train splits and loaded separately as
import pathlib
training_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets/path-to-dataset/Train').glob("*/*.csv"))
testing_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets//path-to-dataset/Test').glob("*/*.csv"))
training_dataset=tf.data.experimental.CsvDataset(
training_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(type(training_dataset))
testing_dataset=tf.data.experimental.CsvDataset(
testing_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(training_dataset.element_spec)
print(testing_dataset.element_spec)
training_dataset= training_dataset.shuffle(50000)
validate_ds = training_dataset.batch(300).take(100)
train_ds = training_dataset.batch(300, drop_remainder=True).skip(100)
test_ds = testing_dataset.batch(300, drop_remainder=True)
Now, it's working but one problem is left and that is, validation dataset. Ideally, validation dataset should be different for each epoch, but in this case it's same so, training for multiple epochs is not improving performance. If anybody can help to resolve this issue, I would be grateful.
I'm trying to train/validate a CNN using Pytorch on an unbalanced image dataset (class 1:250 images, class 0: 4000ish images), and right now, I've tried augmentation solely on my training set (thanks #jodag). However, my model is still learning to favor the class with significantly more images.
I want to find ways to compensate for my unbalanced data set.
I thought about using oversampling/undersampling using the imbalanced data sampler (https://github.com/ufoym/imbalanced-dataset-sampler), but I already use a sampler to select indices for my 5-fold validation. Is there a way I could implement cross-validation using the code below and also add this sampler? Similarly, is there a way to augment one label more frequently than the other? Along the lines of these questions, are there any alternative easier ways that I could address my unbalanced dataset that I haven't looked into yet?
Here's an example of what I have so far
total_set = datasets.ImageFolder(PATH)
KF_splits = KFold(n_splits= 5, shuffle = True, random_state = 42)
for train_idx, valid_idx in KF_splits.split(total_set):
#sampler to get indices for cross validation
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
#Use a wrapper to apply augmentation only to training set
#These are dataloaders that pull images from the same folder but sort into validation and training sets
#Though transforms augment only the training set, it doesn't address
#the underlying issue of a heavily unbalanced dataset
train_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['train']),
batch_size=32, sampler=ImbalancedDatasetSampler(total_set))
valid_loader = torch.utils.data.DataLoader(
WrapperDataset(total_set, transform=data_transforms['val']),
batch_size=32)
print("Fold:" + str(i))
for epoch in range(epochs):
#Train/validate model below
`
Thank you for your time and help!