Retaining the target class during PCA in the auto dataset

Retaining the target class during PCA in the auto dataset - python

I am trying to find the correct way, or to make sure that I have retained the target class during a PCA. I tried to do the scaling before and after splitting the data, but the issue is still the same.
I am sorry that I can't use the seaborn.load_dataset(name, cache=True, data_home=None, **kws) to load the dataset so here we go
Loading the data
# loading the dataframe
auto = pd.read_csv('auto.csv')
Make a target class by saying that any mileage lower than the median is 0 and higher is 1
med=np.median(auto["mpg"])
auto["mpg01"]=auto["mpg"].apply(lambda x: 1 if x>med else 0)
Splitting the data
X=auto[['cylinders','displacement','horsepower','weight','acceleration','year',"origin"]]
y=auto["mpg01"]
X_train, X_test, y_train, y_test = train_test_split(X,y , random_state=101, test_size=0.3, shuffle=True)
Start the PCA
pca2 = PCA(n_components=2)
X_train_reduced2 = pca2.fit_transform(scale(X_train))
Make a DF that joins the pcs and the target class
pca_df2 = pd.DataFrame(X_train_reduced2, columns =["PC1", "PC2"])
pca_df2["mpg01"]=y_train
pca_df2
I noticed that there are some NANs in this new dataframe. The length of the dataframe makes senses. The only thing I can think of is that the index no longer matches, but it should, and I have no way to verify it.
enter image description here
The 2D plot of the PCA shows this. There is no separations btw the target class. I am just wondering if I got all the step right.
enter image description here

As you said, indexes are no longer matching.
You need to modify the line:
pca_df2 = pd.DataFrame(X_train_reduced2, columns=["PC1", "PC2"], index=X_train.index)
Note that PCA is not returning a pd.Dataframe, but a simple np.array. You need to fix indexed to match the label y_train.

Related

How to avoid data leakage when using data augmentation?

I am developing a classification problem that uses data augmentation. To do this, I have already extracted features from the copies by adding noise and other features. However, I want to avoid data leakage, which can happen when the copy is in the training set and the original is in the test set, for example.
I started testing some solutions, and I arrived at the code below. However, I do not know if the current solution can prevent this problem.
Basically, I have the original base (df) and the base with the characteristics of the copies (df2). When I split the df in training and testing, I look for the copies in df2 so that they are together with the original data, both in training and in testing.
Can someone help me?
Here is the code:
df = pd.read_excel("/content/drive/MyDrive/data/audio.xlsx")
df2 = pd.read_excel("/content/drive/MyDrive/data/audioAUG.xlsx")
X = df.drop('emotion', axis = 1)
y = df['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 42, stratify=y)
X_train_AUG = df2[df2['id'].isin(X_train.id.to_list())]
X_test_AUG = df2[df2['id'].isin(X_test.id.to_list())]
X_train = X_train.append(X_train_AUG.loc[:, ~X_train_AUG.columns.isin(['emotion'])])
X_test = X_test.append(X_test_AUG.loc[:, ~X_test_AUG.columns.isin(['emotion'])])
y_train_AUG = X_train_AUG.loc[:, X_train_AUG.columns.isin(['emotion'])]
y_test_AUG = X_test_AUG.loc[:, X_test_AUG.columns.isin(['emotion'])]
y_train_AUG = y_train_AUG.squeeze()
y_test_AUG = y_test_AUG.squeeze()
y_train = y_train.append(y_train_AUG)
y_test = y_test.append(y_test_AUG)

short answer, your splitting procedure is fine however I would personally split both df1 and df2 by 75-25% of the length of both ( if both have the same size) because I don't know how your df2 as an augmented df1 data generated. I think if those ['id'] are in order it's fine. ( for example, if all of the data are sorted and in ascending order in both data frame)
e.x
train_len = int(0.75*len(df1))
train_data = df[:train_len] #something like this
data_AUG = df2[:train_len]
and applying the same thing you have mentioned for whatever is in dfa2 for your data augmentation. this would guarantee to prevent of any data leakage.(as far as i concerned these are one-by-one data)
or maybe a better way, generate augmented data from split data from the start.(generating those from the 75% of data which will be used in model)

Is there a KFold splitter that avoids putting labels (unique, only occur once) into the test set that are unseen in the training set?

Essentially, my test set contains labels (features) that are not present in the training set sometimes.
This is because there are roughly ~100 labels overall, in a 500 observation dataset.
Being a student from a different field, I made a low-quality solution for it but I was wondering if there was a part of SKLearn or a similar package that deals with this.
This was my solution for the time being:
i, a = [1, 1]
while a is not None:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.95)
b = y_validation.isin(y_train).value_counts()
print(str(f"While labels between sets are mismatched splitting will repeat. Current iteration: {i}"))
i = i + 1
a = b.get(False)
print(f"Splitting Completed Successfully\n")
So without this, if I ran some ML algorithms, they would fail for obvious reasons; "feature in test set is not found in training set", essentially.
I tried to search if there was an elegant solution to this but I don't think I came across anything (other than a solution in R via caret) for this.
I should add, the issue is some labels are only seen once in the entire dataset - so it's impossible to stratify them across both.

Can I use StandardScaler() on whole data set, or should I calculate on train and test sets separately?

I'm developing a SVR for ~100 continuous features and a continuous label.
For scaling the data, I wrote:
#Read in
df = pd.read_csv(data_path,sep='\t')
features = df.iloc[:,1:-1] #100 features
target = df.iloc[:,-1] #The label
names = df.iloc[:,0] #Column names
#Scale features
scaler = StandardScaler()
scaled_df = scaler.fit_transform(features)
# rename columns (since now its an np array)
features.columns = df_columns
So now I have a scaled data frame, and my next step was to split into train and test, and then develop a model (SVR):
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
model = SVR()
...and then I fit the model to the data.
But I noticed other people don't fit the StandardScaler() to the whole data frame, but they split the dataframe into train and test first, and then apply StandardScaler() to each separately.
Is there a difference between whether you apply the StandardScaler to the whole data frame, or train and test separately?

The previous answer says that you should separate the training and testing set when scaling, otherwise the testing one might bias the transformation of the training one. This is half correct and half wrong.
If you do the transformation separately, then it might well be that the training set will get scaled to wrong proportions (e.g. if it comes from a narrow continuous time range, thus taking on a subset of the values range). You will end up having wrong values for the variables of the test set.
In general, what you must do is scale on the training set and transfer the scale over to the testing set. This is done by using the methods fit and transform separately, as seen in the documentation.

You need to apply StandardScaler to the training set to prevent the distribution of the test set leaking into the model. If you fit the scaler on the full dataset before splitting, the test set information is used to transform the training set and use it to train the model.

Multi labeled image classification with imbalanced data, how to split it?

I am working multi labeled image classification. This is my data frame:
[UPDATED]
As you can see images labeled with 26 features. "1" means exist, "0" means not exist.
My problem is in many of label has imbalanced data. For example:
[1] train_df.value_counts('Eyeglasses')
Output:
Eyeglasses
0 54735
1 1265
dtype: int64
[2] train_df.value_counts('Double_Chin')
Output:
Double_Chin
0 55464
1 536
dtype: int64
How can I split it both of for training and validation data as a balanced?
[UPDATE]
I tried to
from imblearn.over_sampling import SMOTE
smote = SMOTE()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
ValueError: Imbalanced-learn currently supports binary, multiclass and
binarized encoded multiclasss targets. Multilabel and multioutput
targets are not supported.

Your question mixes two concepts: splitting a multi-class, multi-label image dataset into subsets which have proportional representation, and resampling methods to deal with class imbalance. I am going to focus on just the splitting part of the problem, since that's what the title is about.
I would use a stratified-shuffle-split so to make sure that each subset has equal reprentation. Here's a handy visual for stratified sampling from Wikipedia:
For this I recommend skmultilearn's IterativeStratification method. It supports multi-label datasets.
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
stratifier = IterativeStratification(
n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
)
# this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
# NOTE: needs to be computed on hard labels.
train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))
# s3url array shape (N_samp,)
x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
# labels array shape (N_samp, n_classes)
Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]
I wrote a more complete solution, including unit tests, in a blog post.
One downside with skmultilearn is that it is not very well maintained and has some broken functionality. I documented a few of these sharp corners and gotchas in my blog post. Also note that this stratification procedure is painfully slow when you get to several million images because the stratifier only uses a single CPU.

Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.
training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test
i get negative r2 scores, and high mse
training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])
y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))]
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
I was expecting more or less the same results, and all the data types seem the same for both calls.
What am I missing?

I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).
Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.
Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.
Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)
If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).
P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)
1 - http://cv.znu.ac.ir/afsharchim/AI/lectures/Decision%20Trees%203.pdf

The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.
When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.
If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.

suppose your dataset contains this data.
1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0
So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better
1+1=2
2-2= 0
So it knows what do to when it sees this data
2+2
4-4#since it learned both addition and subtraction
But when you manually shuffle it like this
1 + 1 = 2
2 + 2 =4#only learned addition
It doesn't know what do do when it sees this data
2 - 2
4 - 4#test data is subtraction
Hope this answers you question

It may sound like a simple check but..
In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.
The other line that might be out of place is line 6:
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))]
good luck, let us know if this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.