How to use the fraction of training dataset in the testing dataset - python

I'm recently studying neural network and panda dataframe, the dataset that I have is split into several .csv files, and for the train dataset I load them as follows:
df1 = pd.read_csv("/home/path/to/file/data1.csv")
df2 = pd.read_csv("/home/path/to/file/data2.csv")
df3 = pd.read_csv("/home/path/to/file/data3.csv")
df4 = pd.read_csv("/home/path/to/file/data4.csv")
df5 = pd.read_csv("/home/path/to/file/data5.csv")
trainDataset = pd.concat([df1, df2, df3, df4, df5])
Then, as suggested by many articles, the test dataset should be around 20% of the train dataset. My questions are:
How can I define the test dataset to be 20% of the train dataset?
When I load both train and test dataset, what is the best way to randomize the data?
I tried this solution, and wrote the following code but it didn't work:
testDataset = train_test_split(trainDataset, test_size=0.2)
I appreciate any tips or help for this matter.

The function train_test_split would give you the answer, but I'm a bit surprised by the call you had in your example.
It is more common to have something like this, with x being the features (the x in y=f(x), with f is the real function that you try to mimic with your learning) and y being the responses (the y in y=f(x)).
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)
For more explanations, please see https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Related

How to avoid data leakage when using data augmentation?

I am developing a classification problem that uses data augmentation. To do this, I have already extracted features from the copies by adding noise and other features. However, I want to avoid data leakage, which can happen when the copy is in the training set and the original is in the test set, for example.
I started testing some solutions, and I arrived at the code below. However, I do not know if the current solution can prevent this problem.
Basically, I have the original base (df) and the base with the characteristics of the copies (df2). When I split the df in training and testing, I look for the copies in df2 so that they are together with the original data, both in training and in testing.
Can someone help me?
Here is the code:
df = pd.read_excel("/content/drive/MyDrive/data/audio.xlsx")
df2 = pd.read_excel("/content/drive/MyDrive/data/audioAUG.xlsx")
X = df.drop('emotion', axis = 1)
y = df['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42, stratify=y)
X_train_AUG = df2[df2['id'].isin(X_train.id.to_list())]
X_test_AUG = df2[df2['id'].isin(X_test.id.to_list())]
X_train = X_train.append(X_train_AUG.loc[:, ~X_train_AUG.columns.isin(['emotion'])])
X_test = X_test.append(X_test_AUG.loc[:, ~X_test_AUG.columns.isin(['emotion'])])
y_train_AUG = X_train_AUG.loc[:, X_train_AUG.columns.isin(['emotion'])]
y_test_AUG = X_test_AUG.loc[:, X_test_AUG.columns.isin(['emotion'])]
y_train_AUG = y_train_AUG.squeeze()
y_test_AUG = y_test_AUG.squeeze()
y_train = y_train.append(y_train_AUG)
y_test = y_test.append(y_test_AUG)
short answer, your splitting procedure is fine however I would personally split both df1 and df2 by 75-25% of the length of both ( if both have the same size) because I don't know how your df2 as an augmented df1 data generated. I think if those ['id'] are in order it's fine. ( for example, if all of the data are sorted and in ascending order in both data frame)
e.x
train_len = int(0.75*len(df1))
train_data = df[:train_len] #something like this
data_AUG = df2[:train_len]
and applying the same thing you have mentioned for whatever is in dfa2 for your data augmentation. this would guarantee to prevent of any data leakage.(as far as i concerned these are one-by-one data)
or maybe a better way, generate augmented data from split data from the start.(generating those from the 75% of data which will be used in model)

How to use multiclassification model to make predicitions in entire dataframe

I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.

X has 8 features, but RandomForestRegressor is expecting 67 features as input

I want to build a House Price Prediction app. The content has features where user can enter their inputs, then a predictive model will predict the price and display it to the user. I am using a dataset from Kaggle to do the prediction. When I run the code, it shows an error message that says
X has 8 features, but RandomForestRegressor is expecting 67 features as input.
Below is the code. Xy contains the data from Kaggle and df is the user input. Xy is the train set and df is the test. Xy has 8 variables including the target. df will only retrieve 7 inputs (so it will have 7 variables because there's no target variables received from user).
# Assign to X for input features and Y for target
X = Xy.drop('Price', axis=1)
Y = Xy['Price'].values
# Build Regression Model
model = RandomForestRegressor()
model.fit(X, Y)
df = pd.get_dummies(df, columns=['Location', 'Furnishing', 'Property_Type_Supergroup', 'Size_Type'])
# Apply Model to Make Prediction
prediction = model.predict(df)
I tried to search the solutions online but nothing works for my code. Hope someone can help.
It's a little difficult to tell without seeing the data that you're fitting the model on. Between the error and your code though, it seems like possibly you're fitting the model on a data frame of 67 features. The data frame that you call fit on needs to be the same as the data frame you call predict on (at least in terms of features).
Sorry if this answer is redundant, it is difficult to tell without seeing the data and the exact error.
"X has 8 features, but RandomForestRegressor is expecting 67 features as input."
I assumed that this is the standard dataset you used, and after unzipping and loading it has the following files:
sample_submission.csv
test.csv
data_description.txt
train.csv
if you check the shape of train.csv and test.csv:
train = pd.read_csv('./house_prices/train.csv')
test = pd.read_csv('./house_prices/test.csv')
print(f'Train shape : {train.shape}')
print(f'Test shape : {test.shape}')
#Train shape : (1460, 81)
#Test shape : (1459, 80)
That shows you deleted or dropped some column/features/attributes and reduced them from 81 to 67, so no problem till now. The problem is once you converted the categorical variables into numeric variables using pd.get_dummies() in the data pre-processing stage then split data into x_train & y_train using same df to fit() your model. Finally, you predict on x_test via y_pred = model.predict(x_test). Otherwise, the shape of df does not match X (one has 8 columns, the other has 67 columns in your case)!!
So I suggest first the df should be splitted:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# Chossing features for predicting the target variable
x = df
# Data split on df
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2 , random_state=42)
# Apply RandomForestRegressor
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(x_train,y_train)
# Predicting the data using the model
y_pred = model.predict(x_test)
# Evaluating the model
print(metrics.r2_score(y_test,y_pred))
I included following posts for your reference:
post1
post2
post3

Random Forest on Panel Data using Python

So I am having some troubles running a random forest regression on panel data.
The data currently looks like this:
I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:
from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
'summary', 'household_size', 'work_from_home', 'num_rooms',
'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)
However I get the following error message:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.
Thanks in advance.
Simply switching y = df[['KwH']] to y = df['KwH'] or y = df.KwH should solve this.
This is because scikit-learn doesn't expect y to be a dataframe, and selecting columns with the double [[...]] precisely is returning a dataframe.

How to make train_test_split on a dataset created with make_csv_dataset

For a small dataset, I was using scikit-learn test_train_split on a dataframe of the whole dataset as
from sklearn.model_selection import train_test_split
train, test = train_test_split(features_dataframe, test_size=0.2)
train, test = train_test_split(train, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
And it simply create a test, train, validation split on my dataset.
Now, I want to perform data-loading from the disk i.e., my csv files. So, I'm using the experimental tf.data function make_csv_dataset. What I have done is
import tensorflow as tf
defaults=[float()]*len(selected_columns)
data_set=tf.data.experimental.make_csv_dataset(
file_pattern = "./processed/*/*/*.csv",
column_names=all_columns, # array with all columns labels
select_columns=selected_columns, # array with desired column labels
column_defaults=defaults, # default column values
label_name="Target",
batch_size=10,
num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000,
ignore_errors=True)
As far as I guess is, I have the dataset, but when I try to perform train_test_split of scikit-learn, it don't work and the reason is obvious, the data_set is not loaded yet, its just configured to be loaded.
How, to perform train, test, validation split on this data?
I have gone through some guides, and everyone (as far as I come across), is loading the training data:
overfit_and_underfit
custom_training_walkthrough
estimator
First of all, to have a better control overmy dataset, I used a lower level similar API i.e., CsvDataset. Then I manually spitted dataset in two different folders for test and train splits and loaded separately as
import pathlib
training_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets/path-to-dataset/Train').glob("*/*.csv"))
testing_csvs = sorted(str(p) for p in pathlib.Path('./../Datasets//path-to-dataset/Test').glob("*/*.csv"))
training_dataset=tf.data.experimental.CsvDataset(
training_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(type(training_dataset))
testing_dataset=tf.data.experimental.CsvDataset(
testing_csvs,
record_defaults=defaults,
compression_type=None,
buffer_size=None,
header=True,
field_delim=',',
use_quote_delim=True,
na_value="",
select_cols=selected_indices
)
print(training_dataset.element_spec)
print(testing_dataset.element_spec)
training_dataset= training_dataset.shuffle(50000)
validate_ds = training_dataset.batch(300).take(100)
train_ds = training_dataset.batch(300, drop_remainder=True).skip(100)
test_ds = testing_dataset.batch(300, drop_remainder=True)
Now, it's working but one problem is left and that is, validation dataset. Ideally, validation dataset should be different for each epoch, but in this case it's same so, training for multiple epochs is not improving performance. If anybody can help to resolve this issue, I would be grateful.

Categories

Resources