Python prepare training data set with evenly distrubeted response variable

Python prepare training data set with evenly distrubeted response variable - python

I am working on a small machine learning project.
The dataset which i use has 56 input parameters and one categorical response variable (0/1). My problem is that the response variables are not evenly distributed. Now my question I want to prepare the training data set, that the responses are evenly distributed. How can this be done?
That's how the data looks like
-> the training dataset should have the same amount of 1 and 0 from the response.
Thanks for your help, as you can imagine i am really a beginner...

i am the same person like the one who asked the question. sorry for that.
first i load the data from a csv file.(not in the code shown here) this is stored as data, next, i create a new column named " response_class" based on the value in the column "response" if it is below .045, response_class =1, other 0. second, i randomly sample 10000 rows from the data. (due to computation limits), and here i want to make sure that i get the same amount of 1 and 0 from the response_class. at the end i split the data to make it ready for a correlation matrix and test and train data
Here is my code:
data = data[data.response != 0]
pd.DataFrame(data)
data['response_class'] = np.where(data['response'] <= 0.045, 1, 0)
#1=below .045 / 0=above 0.045
#reduce amount of data by picking random samples
data= data.sample(n=10000)
#split data
data.drop(['response'], axis=1, inplace=True)
y = data['response_class']
X = data.drop('response_class', axis=1)
X_names = X.columns
data.head()

found a solution:
#seperate based on the response variable in response_class
df_zero = pd.DataFrame(data[data.response_class== 0])
df_one = pd.DataFrame(data[data.response_class == 1])
# upsampling minority class
df_zero_min = resample(df_zero,
replace = True,
n_samples = len(df_one),
random_state = 123)
df_upsampled = pd.concat([df_one,df_zero_min])
df_upsampled.response_class.value_counts()

Related

Taking two samples from the data but with different observations

My data is made of about 9000 observations and 20 features (Edit - Pandas dataframe). I've taken a sample of 200 observations like this and conducted some analysis on it:
sample_data = data.sample(n = 200)
Now I want to randomely take a sample of 1000 observations from the original data, with non of the observations that showed up in the previous n = 200 sample. How do I do that?

If you are using pandas.DataFrame, you can simply do it by dropping the old ones and sampling 1000 new ones from the remaining data:
prev_sample_index = sample_data.index
filtered_data = data.drop(prev_sample_index)
new_sample = filtered_data.sample(n = 1000)

Spliting dataset to train and test in python

I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?

Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.

You should write your own function to do this,
One way to do this is select rows by index and shuffle it after take them.

Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)

How can I partitioning data set (csv file) with systematic sampling method?(python)

Here are the requirements:
Partitioning data set into train data set and test data set.
Systematic sampling should be used when partitioning data.
The train data set should be about 80% of all data points and the test data set should be 20% of them.
I have tried some codes:
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
and
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
The codes either do systematic sampling or data partition but I'm not sure how to satisfy two conditions at the same time

Systematic sampling:
It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.
You may use the following:
import pandas as pd
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
trainSize = 0.8 * len(df)
step = int(len(df)/trainSize)
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)
Now, in the train_df, you have %80 of the original data and in the remaining_df you have the test data.
For others reading this, it was a good reference to read about this question: Read Me!

Multiple-series training input is giving NaN loss while same data but One-serie training input is not

I want to train a N-Beats time series model using Darts. I have a time serie DataFrame for each users so I want to use Multiple-Series training but when I feed the list of TimeSeries I directly get NaN as losses during training. If I concatenate all users's TimeSeries into one, I get a normal loss. In both cases the data is scale, fill and cast to float.32
data = scaler.transform(filler.transform(data)).astype(np.float32)
Here is the code that I use combine the list of TimeSeries into a single TimeSeries. I also have a pure Darts code for that but it is much slower for the same result.
SPLIT = 0.8
if concatenate_to_one_ts:
all_dfs = []
all_dfs_cov = []
for i in range(len(list_of_target_ts)):
all_dfs.append(list_of_target_ts[i].pd_series())
all_dfs_cov.append(list_of_cov_ts[i].pd_dataframe())
all_dfs = pd.concat(all_dfs)
all_dfs_cov = pd.concat(all_dfs_cov)
nbr_train_sample = int(len(all_dfs) * SPLIT)
all_dfs_train = all_dfs[:nbr_train_sample]
all_dfs_test = all_dfs[nbr_train_sample:]
list_of_target_ts_train = TimeSeries.from_series(all_dfs_train.reset_index(drop=True))
list_of_target_ts_test = TimeSeries.from_series(all_dfs_test.reset_index(drop=True))
all_dfs_cov_train = all_dfs_cov[:nbr_train_sample]
all_dfs_cov_test = all_dfs_cov[nbr_train_sample:]
list_of_cov_ts_train = TimeSeries.from_dataframe(all_dfs_cov_train.reset_index(drop=True))
list_of_cov_ts_test = TimeSeries.from_dataframe(all_dfs_cov_test.reset_index(drop=True))
else:
nbr_train_sample = int(len(list_of_target_ts) * SPLIT)
list_of_target_ts_train = list_of_target_ts[:nbr_train_sample]
list_of_target_ts_test = list_of_target_ts[nbr_train_sample:]
list_of_cov_ts_train = list_of_cov_ts[:nbr_train_sample]
list_of_cov_ts_test = list_of_cov_ts[nbr_train_sample:]
model = NBEATSModel(input_chunk_length=4,
output_chunk_length=1,
batch_size=512,
n_epochs=5,
nr_epochs_val_period=1,
model_name="NBEATS_test",
generic_architecture=True,
force_reset=True,
save_checkpoints=True,
show_warnings=True,
log_tensorboard=True,
torch_device_str='cuda:0'
)
model.fit(series=list_of_target_ts_train,
past_covariates=list_of_cov_ts_train,
val_series=list_of_target_ts_val,
val_past_covariates=list_of_cov_ts_val,
verbose=True,
num_loader_workers=20)
As Multiple-Series training I get:
Epoch 0: 8%|██████████▉ | 2250/27807 [03:00<34:11, 12.46it/s, loss=nan, v_num=logs, train_loss=nan.0
As a single serie training I get:
Epoch 0: 24%|█████████████████████████▋ | 669/2783 [01:04<03:24, 10.33it/s, loss=0.00758, v_num=logs, train_loss=0.00875]
I am also confused by the number of sample per epoch with the same batch size as from what I read here: https://unit8.com/resources/training-forecasting-models/ the single serie should have more sample as the window size cut is not happening for each Multiple Series.

Regarding the NaNs, I would try reducing the learning rate if I were you. Also double check that there's no NaN remaining in your data (see corresponding entry here 1)
Regarding the number of samples, each of the separate time series are split into several (input, output) slices. For the single series, this split is done once overall, whereas for the multiple series, this split is done once per series and then all the resulting samples are regrouped in a common training set. So it is expected to have more training samples with multiple series (and each training sample will have fewer dimensions compared to the single-multivariate-series case).

Thanks Julien Herzen your answer helped me a lot to find the issue. I want to add more details on what was happening.
Regarding the NaNs: the filler from Darts is by default using pandas interpolation. That interpolation was not possible for the multi-series as some of the series had only NaN in those columns so nothing to interpolate from thus returning series with still NaN values. It was not happening for the concatenated to one single series because as all multi-series were concatenated there was value to interpolate from. If you do not need interpolation just add fill=0.0 in MissingValuesFiller(fill=0.0)
regarding the number of samples, after digging in Darts code, I found out that the NBeats model is using GenericShiftedDataset which for multi series is computing the length of the dataset by:
getting the length of the longest sub series and multiplying by the number of series.
self.max_samples_per_ts = (max(len(ts) for ts in self.target_series) - self.size_of_both_chunks + 1)
then when getitem is called
target_idx = idx // self.max_samples_per_ts
target_series = self.target_series[target_idx]
It select a series by dividing the idx by the max number of samples thus shorter series will be sampled more than longer one as they have less data but same chance to get sampled.
Here is my smallest example with input_chunk_length = 4 and output = 1:
Multi series with lengths: [71, 19] -> number of samples (71 * 2) - (2 * input_chunk_length) = 134
Concantenated into a Single serie: 90 -> number of samples: 90 - input_chunk_length = 86
In the multi series the sample in the short sub series will likely be sampled more time.

Nearest Neighbor's kneighbors method return different output for different sample sizes

I've built a NearestNeighbor model with Scikit-learn. Clusters seem fine when get clusters with kneighbors method just after fitting model.
model = NearestNeighbors(n_jobs=-1, n_neighbors=5).fit(np.array(df))
distance, indices = model.kneighbors(np.array(df)) ## one of the distances is always 0, as expected. And clusters are acceptable.
But when I save model and then read for train data, outputs are not acceptable.
model = pickle.load(f)
distance, indices = model.kneighbors(np.array(df)) ## same dataset, average/bad results. None of distances are 0.
And, biggest problem, indices and distances change according to df size.
model = pickle.load(f)
df_1 = df[df["id"] == "1"] # Trying for just one user
distance, indices = model.kneighbors(np.array(df_1)) ## one row, same output for every user.
df_2 = df[df["id"] == "2"]
distance, indices = model.kneighbors(np.array(df_2)) ## same output
df = df[df["id"] == "2" | df["id"] == "1"]
distance, indices = model.kneighbors(np.array(df)) ## different output for both
Train/test dataset looks like this
feature1 | feature2 | feature3
0 1 1
1 1 1
0 0 0
Why we train and save model if it's not possible use after with different dataset? Is this expected behavior of model or am I missing something?

Well, it was a horrible mistake I made, and I want to share the problem and solution. Very simple but it may be hard to see.
I read docs thousand times, and then noticed they use np.array but not DataFrame. Well, I used Dataframe for prediction and columns randomized. So, it was not working correctly.
If you have problem like that, be careful about numpy indexes!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python prepare training data set with evenly distrubeted response variable - python

Related

Taking two samples from the data but with different observations

Spliting dataset to train and test in python

How can I partitioning data set (csv file) with systematic sampling method?(python)

Multiple-series training input is giving NaN loss while same data but One-serie training input is not

Nearest Neighbor's kneighbors method return different output for different sample sizes

Categories

Resources