How to avoid data leakage when using data augmentation?

How to avoid data leakage when using data augmentation? - python

I am developing a classification problem that uses data augmentation. To do this, I have already extracted features from the copies by adding noise and other features. However, I want to avoid data leakage, which can happen when the copy is in the training set and the original is in the test set, for example.
I started testing some solutions, and I arrived at the code below. However, I do not know if the current solution can prevent this problem.
Basically, I have the original base (df) and the base with the characteristics of the copies (df2). When I split the df in training and testing, I look for the copies in df2 so that they are together with the original data, both in training and in testing.
Can someone help me?
Here is the code:
df = pd.read_excel("/content/drive/MyDrive/data/audio.xlsx")
df2 = pd.read_excel("/content/drive/MyDrive/data/audioAUG.xlsx")
X = df.drop('emotion', axis = 1)
y = df['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42, stratify=y)
X_train_AUG = df2[df2['id'].isin(X_train.id.to_list())]
X_test_AUG = df2[df2['id'].isin(X_test.id.to_list())]
X_train = X_train.append(X_train_AUG.loc[:, ~X_train_AUG.columns.isin(['emotion'])])
X_test = X_test.append(X_test_AUG.loc[:, ~X_test_AUG.columns.isin(['emotion'])])
y_train_AUG = X_train_AUG.loc[:, X_train_AUG.columns.isin(['emotion'])]
y_test_AUG = X_test_AUG.loc[:, X_test_AUG.columns.isin(['emotion'])]
y_train_AUG = y_train_AUG.squeeze()
y_test_AUG = y_test_AUG.squeeze()
y_train = y_train.append(y_train_AUG)
y_test = y_test.append(y_test_AUG)

short answer, your splitting procedure is fine however I would personally split both df1 and df2 by 75-25% of the length of both ( if both have the same size) because I don't know how your df2 as an augmented df1 data generated. I think if those ['id'] are in order it's fine. ( for example, if all of the data are sorted and in ascending order in both data frame)
e.x
train_len = int(0.75*len(df1))
train_data = df[:train_len] #something like this
data_AUG = df2[:train_len]
and applying the same thing you have mentioned for whatever is in dfa2 for your data augmentation. this would guarantee to prevent of any data leakage.(as far as i concerned these are one-by-one data)
or maybe a better way, generate augmented data from split data from the start.(generating those from the 75% of data which will be used in model)

Related

Retaining the target class during PCA in the auto dataset

I am trying to find the correct way, or to make sure that I have retained the target class during a PCA. I tried to do the scaling before and after splitting the data, but the issue is still the same.
I am sorry that I can't use the seaborn.load_dataset(name, cache=True, data_home=None, **kws) to load the dataset so here we go
Loading the data
# loading the dataframe
auto = pd.read_csv('auto.csv')
Make a target class by saying that any mileage lower than the median is 0 and higher is 1
med=np.median(auto["mpg"])
auto["mpg01"]=auto["mpg"].apply(lambda x: 1 if x>med else 0)
Splitting the data
X=auto[['cylinders','displacement','horsepower','weight','acceleration','year',"origin"]]
y=auto["mpg01"]
X_train, X_test, y_train, y_test = train_test_split(X,y , random_state=101, test_size=0.3, shuffle=True)
Start the PCA
pca2 = PCA(n_components=2)
X_train_reduced2 = pca2.fit_transform(scale(X_train))
Make a DF that joins the pcs and the target class
pca_df2 = pd.DataFrame(X_train_reduced2, columns =["PC1", "PC2"])
pca_df2["mpg01"]=y_train
pca_df2
I noticed that there are some NANs in this new dataframe. The length of the dataframe makes senses. The only thing I can think of is that the index no longer matches, but it should, and I have no way to verify it.
enter image description here
The 2D plot of the PCA shows this. There is no separations btw the target class. I am just wondering if I got all the step right.
enter image description here

As you said, indexes are no longer matching.
You need to modify the line:
pca_df2 = pd.DataFrame(X_train_reduced2, columns=["PC1", "PC2"], index=X_train.index)
Note that PCA is not returning a pd.Dataframe, but a simple np.array. You need to fix indexed to match the label y_train.

creating a confusion maxtrix out of various data files but getting ValueError : x and y must be the same size

I'm new to python and trying to create a sentiment analysis using VADER
I pulled various artists (13) data into individual dataframes, converted the lyrics to words, found only the unique words, remove stopwords and all that then put it all into a single df
#for all the artists clean, get the single event of the word and place it in the list
df_allocate = []
for df in df_all:
df_clean = cleaning(df)
df_words = to_unique_words(df_clean)
df_allocate.append(df_words)
frames = df_allocate
# create the new column with the information of words lists
df_main = pd.concat(frames, ignore_index=True)
df_main = df_main.reset_index(drop=True)
Now I'm trying to train a logistic regression model, predict test results and get a confusion matrix.
I think I'm getting confused about how data frames work and also how to train_test_split the data correctly.
Right now, I have:
for column_name in df_all:
cv = CountVectorizer(max_features=100000)
X = cv.fit_transform(df_main['Artist']).toarray()
y = column_name.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state=20)
classifier = LogisticRegression(random_state= 25)
classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)
print_confusionMatrix = confusion_matrix(y_test, y_predict)
print(print_confusionMatrix)
print("accuracy score : ", accuracy_score(y_test, y_predict))
When I debug the program, I see why it's complaining however, I don't know how to fix it. I looked over how to iterate through dataframe and tried doing
for df in df_all.index
but it didn't work.
The columns are Artist, Title, Album, Date, Lyric, Year, and sentiment. What I want to accomplish is to iterate through each artist (df_all has the data frames of each individual artist, and that is why I use it), and get a prediction of the sentiment analysis of their lyrics to build a confusion matrix for all the 13 artists.
Previous tries are changing x to, and y keep it as that, so it's:
X = cv.fit_transform(df_main).toarray()
y = df_main.sentiment
however, this is where I get the error that x and y must be the same size.
Please push me in the right direction. I'm quite lost.

How to use the fraction of training dataset in the testing dataset

I'm recently studying neural network and panda dataframe, the dataset that I have is split into several .csv files, and for the train dataset I load them as follows:
df1 = pd.read_csv("/home/path/to/file/data1.csv")
df2 = pd.read_csv("/home/path/to/file/data2.csv")
df3 = pd.read_csv("/home/path/to/file/data3.csv")
df4 = pd.read_csv("/home/path/to/file/data4.csv")
df5 = pd.read_csv("/home/path/to/file/data5.csv")
trainDataset = pd.concat([df1, df2, df3, df4, df5])
Then, as suggested by many articles, the test dataset should be around 20% of the train dataset. My questions are:
How can I define the test dataset to be 20% of the train dataset?
When I load both train and test dataset, what is the best way to randomize the data?
I tried this solution, and wrote the following code but it didn't work:
testDataset = train_test_split(trainDataset, test_size=0.2)
I appreciate any tips or help for this matter.

The function train_test_split would give you the answer, but I'm a bit surprised by the call you had in your example.
It is more common to have something like this, with x being the features (the x in y=f(x), with f is the real function that you try to mimic with your learning) and y being the responses (the y in y=f(x)).
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)
For more explanations, please see https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

I am struggling to understand how training our algorithms connects with making predictions on new data.
My situation: I have an algorithm that I use on a labeled dataset. After the steps of importing it, encoding it, fit_transforming it and fitting it to make predictions on the data_test of the train_test_split function I get a really nice prediction from using the labeled dataset.
I am stumped as to how I need to feed a new dataset (unlabeled this time) to the trained model, which has learned from the labeled dataset. I know that technically the data used to train withheld the labels from itself to predict, but I am unaware how I have to provide the gaussianNB algorithm new data features to predict unknown labels.
My code for the training:
df = pd.read_csv(chosen_file, sep=',')
cat_cols = df.select_dtypes(include=['object'])
cat_cols_filled = cat_cols.fillna('0')
le = LabelEncoder()
cat_cols_fitted = cat_cols_filled.apply(lambda col: le.fit_transform(col))
non_cat_cols = df.select_dtypes(exclude=['object'])
non_cat_cols_filled = non_cat_cols.fillna('0')
non_cat_cols_fitted = non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
target_prep = df.iloc[:,-1]
target = le.fit_transform(target_prep.astype(str))
data = pd.concat([cat_cols_fitted, non_cat_cols_fitted], axis=1)
try:
data_train, data_test, target_train, target_test = train_test_split(data, target, train_size=0.3))
alg = GaussianNB()
pred = alg.fit(data_train, target_train).predict(***data_test***)
This is all fine and dandy. But I cannot understand how I have to give something in place of data_test. Do I need to provide the new dataset with some placeholder values for the label column? My label column from the beginning dataframe is the last one.
My attempt:
new_df = pd.read_csv(new_chosen_file, sep=',')
new_cat_cols = new_df.select_dtypes(include=['object'])
new_cat_cols_filled = new_cat_cols.fillna('0')
new_cat_cols_fitted = new_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_non_cat_cols = new_df.select_dtypes(exclude=['object'])
new_non_cat_cols_filled = new_non_cat_cols.fillna('0')
new_non_cat_cols_fitted = new_non_cat_cols_filled.apply(lambda col: le.fit_transform(col))
new_data = pd.concat([new_cat_cols_fitted, new_non_cat_cols_fitted], axis=1)
print(new_data)
new_pred = alg.predict(new_data)
new_prediction = pd.DataFrame({'NEW ML prediction':new_pred})
print(new_pred)
print(new_prediction)
Notice I do not provide the target column in the new dataset. However the program errors out on me if I my column count does not match, so I am forced to add at least the label for the column for it to not do that:
Am I way off in my understanding of how this works? Please let me know.
EDIT:
I found my major screw-up in the code. I had not isolated my target column out of the data DataFrame. This was why data was 10 column shape.
I can finally appreciate the simplicity of the code.

You are instantiating an empty model to alg. Returning the prediction from fitted model to a variable named pred. So you are not actually saving the fitted model.
The concatenation of multiple methods such as
alg.fit(data_train, target_train).predict(***data_test***) is known as method chaining and can cause confusion.
A cleaner & more readable alternative is to :
alg = GaussianNB() # initiating model
alg = alg.fit(data_train, target_train) # fitting model with train data
pred = alg.predict(***data_test***) # testing with test data
new_pred = alg.predict(new_data) # test with new data`

Use scaled data on test file

I want to fit a model of logistic regression on my first file(F1) of data and i want to test it
on another file named F2(The same exersise of another year).
Code on F1:
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
logistic = LogisticRegression(random_state =0,max_iter = 300 ,penalty = 'l2')
model = logistic.fit(X,y)
ScaledObj = X_train
How can i do to use the scaled data in my test file please
I did this but i don"t knwo how to use ScaledObj on my test
Code on my File Test(F2)
F2 = pd.read_csv("F2.csv", sep =',')
y_test = F2['y']
X_test = F2.copy()
del X_test['y']
y_pred = model.predict(X_test)
proba= model.predict_proba(X_test)[:, 1]
Auc_Test = metrics.roc_auc_score(y_test, proba)

For best practice in a machine learning project, the typical workflow goes like this:
fit the scaler to the training data separated from the testing data
transform the training data (you did this already with your fit_transform step)
transform the test data using the already-fitted scaler*. This prevents any data leakage between your training and testing data
Use the same fitted scaler* to transform any other validation or production data.
*-Note that the scaler only lives in memory, so if you want to use it in another script, you can use something like pickle or joblib to save the object for later use
You've done steps 1-3 correctly in your code above, and you can execute step 4 the same way. One thing I would recommend, however, is not to overwrite your variables, as this can be confusing when reading the code later.
F2 = pd.read_csv("F2.csv", sep =',')
y_test1 = F2['y']
X_test1 = F2.copy()
del X_test1['y']
#add this line, same as you did before
X_test1 = sc.transform(X_test1)
y_pred = model.predict(X_test1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to avoid data leakage when using data augmentation? - python

Related

Retaining the target class during PCA in the auto dataset

creating a confusion maxtrix out of various data files but getting ValueError : x and y must be the same size

How to use the fraction of training dataset in the testing dataset

How can we predict target values for new data, based on a different dataset? scikit learn / gaussianNB

Use scaled data on test file

Categories

Resources