Taking two samples from the data but with different observations - python

My data is made of about 9000 observations and 20 features (Edit - Pandas dataframe). I've taken a sample of 200 observations like this and conducted some analysis on it:
sample_data = data.sample(n = 200)
Now I want to randomely take a sample of 1000 observations from the original data, with non of the observations that showed up in the previous n = 200 sample. How do I do that?

If you are using pandas.DataFrame, you can simply do it by dropping the old ones and sampling 1000 new ones from the remaining data:
prev_sample_index = sample_data.index
filtered_data = data.drop(prev_sample_index)
new_sample = filtered_data.sample(n = 1000)

Related

How can I partitioning data set (csv file) with systematic sampling method?(python)

Here are the requirements:
Partitioning data set into train data set and test data set.
Systematic sampling should be used when partitioning data.
The train data set should be about 80% of all data points and the test data set should be 20% of them.
I have tried some codes:
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
and
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)
The codes either do systematic sampling or data partition but I'm not sure how to satisfy two conditions at the same time
Systematic sampling:
It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20)
Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.
So, you want to have this and also partition your data into two separated data of one %80 and the other %20 of the size of original data.
You may use the following:
import pandas as pd
def systematic_sampling(df, step):
indexes = np.arange(0, len(df), step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
trainSize = 0.8 * len(df)
step = int(len(df)/trainSize)
train_df = systematic_sampling(df, step)
# First, concat both data frames, so the output will have some duplicates!
remaining_df = pd.concat([df, train_df])
# Then, drop those which are duplicate, it is like "df - train_df"
remaining_df = remaining_df.drop_duplicates(keep=False)
Now, in the train_df, you have %80 of the original data and in the remaining_df you have the test data.
For others reading this, it was a good reference to read about this question: Read Me!

Multiple-series training input is giving NaN loss while same data but One-serie training input is not

I want to train a N-Beats time series model using Darts. I have a time serie DataFrame for each users so I want to use Multiple-Series training but when I feed the list of TimeSeries I directly get NaN as losses during training. If I concatenate all users's TimeSeries into one, I get a normal loss. In both cases the data is scale, fill and cast to float.32
data = scaler.transform(filler.transform(data)).astype(np.float32)
Here is the code that I use combine the list of TimeSeries into a single TimeSeries. I also have a pure Darts code for that but it is much slower for the same result.
SPLIT = 0.8
if concatenate_to_one_ts:
all_dfs = []
all_dfs_cov = []
for i in range(len(list_of_target_ts)):
all_dfs.append(list_of_target_ts[i].pd_series())
all_dfs_cov.append(list_of_cov_ts[i].pd_dataframe())
all_dfs = pd.concat(all_dfs)
all_dfs_cov = pd.concat(all_dfs_cov)
nbr_train_sample = int(len(all_dfs) * SPLIT)
all_dfs_train = all_dfs[:nbr_train_sample]
all_dfs_test = all_dfs[nbr_train_sample:]
list_of_target_ts_train = TimeSeries.from_series(all_dfs_train.reset_index(drop=True))
list_of_target_ts_test = TimeSeries.from_series(all_dfs_test.reset_index(drop=True))
all_dfs_cov_train = all_dfs_cov[:nbr_train_sample]
all_dfs_cov_test = all_dfs_cov[nbr_train_sample:]
list_of_cov_ts_train = TimeSeries.from_dataframe(all_dfs_cov_train.reset_index(drop=True))
list_of_cov_ts_test = TimeSeries.from_dataframe(all_dfs_cov_test.reset_index(drop=True))
else:
nbr_train_sample = int(len(list_of_target_ts) * SPLIT)
list_of_target_ts_train = list_of_target_ts[:nbr_train_sample]
list_of_target_ts_test = list_of_target_ts[nbr_train_sample:]
list_of_cov_ts_train = list_of_cov_ts[:nbr_train_sample]
list_of_cov_ts_test = list_of_cov_ts[nbr_train_sample:]
model = NBEATSModel(input_chunk_length=4,
output_chunk_length=1,
batch_size=512,
n_epochs=5,
nr_epochs_val_period=1,
model_name="NBEATS_test",
generic_architecture=True,
force_reset=True,
save_checkpoints=True,
show_warnings=True,
log_tensorboard=True,
torch_device_str='cuda:0'
)
model.fit(series=list_of_target_ts_train,
past_covariates=list_of_cov_ts_train,
val_series=list_of_target_ts_val,
val_past_covariates=list_of_cov_ts_val,
verbose=True,
num_loader_workers=20)
As Multiple-Series training I get:
Epoch 0: 8%|██████████▉ | 2250/27807 [03:00<34:11, 12.46it/s, loss=nan, v_num=logs, train_loss=nan.0
As a single serie training I get:
Epoch 0: 24%|█████████████████████████▋ | 669/2783 [01:04<03:24, 10.33it/s, loss=0.00758, v_num=logs, train_loss=0.00875]
I am also confused by the number of sample per epoch with the same batch size as from what I read here: https://unit8.com/resources/training-forecasting-models/ the single serie should have more sample as the window size cut is not happening for each Multiple Series.
Regarding the NaNs, I would try reducing the learning rate if I were you. Also double check that there's no NaN remaining in your data (see corresponding entry here 1)
Regarding the number of samples, each of the separate time series are split into several (input, output) slices. For the single series, this split is done once overall, whereas for the multiple series, this split is done once per series and then all the resulting samples are regrouped in a common training set. So it is expected to have more training samples with multiple series (and each training sample will have fewer dimensions compared to the single-multivariate-series case).
Thanks Julien Herzen your answer helped me a lot to find the issue. I want to add more details on what was happening.
Regarding the NaNs: the filler from Darts is by default using pandas interpolation. That interpolation was not possible for the multi-series as some of the series had only NaN in those columns so nothing to interpolate from thus returning series with still NaN values. It was not happening for the concatenated to one single series because as all multi-series were concatenated there was value to interpolate from. If you do not need interpolation just add fill=0.0 in MissingValuesFiller(fill=0.0)
regarding the number of samples, after digging in Darts code, I found out that the NBeats model is using GenericShiftedDataset which for multi series is computing the length of the dataset by:
getting the length of the longest sub series and multiplying by the number of series.
self.max_samples_per_ts = (max(len(ts) for ts in self.target_series) - self.size_of_both_chunks + 1)
then when getitem is called
target_idx = idx // self.max_samples_per_ts
target_series = self.target_series[target_idx]
It select a series by dividing the idx by the max number of samples thus shorter series will be sampled more than longer one as they have less data but same chance to get sampled.
Here is my smallest example with input_chunk_length = 4 and output = 1:
Multi series with lengths: [71, 19] -> number of samples (71 * 2) - (2 * input_chunk_length) = 134
Concantenated into a Single serie: 90 -> number of samples: 90 - input_chunk_length = 86
In the multi series the sample in the short sub series will likely be sampled more time.

Preparing Grid weather data for ConvLSTM2d

I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
*
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df.append(temp_df)
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.

Python prepare training data set with evenly distrubeted response variable

I am working on a small machine learning project.
The dataset which i use has 56 input parameters and one categorical response variable (0/1). My problem is that the response variables are not evenly distributed. Now my question I want to prepare the training data set, that the responses are evenly distributed. How can this be done?
That's how the data looks like
-> the training dataset should have the same amount of 1 and 0 from the response.
Thanks for your help, as you can imagine i am really a beginner...
i am the same person like the one who asked the question. sorry for that.
first i load the data from a csv file.(not in the code shown here) this is stored as data, next, i create a new column named " response_class" based on the value in the column "response" if it is below .045, response_class =1, other 0. second, i randomly sample 10000 rows from the data. (due to computation limits), and here i want to make sure that i get the same amount of 1 and 0 from the response_class. at the end i split the data to make it ready for a correlation matrix and test and train data
Here is my code:
data = data[data.response != 0]
pd.DataFrame(data)
data['response_class'] = np.where(data['response'] <= 0.045, 1, 0)
#1=below .045 / 0=above 0.045
#reduce amount of data by picking random samples
data= data.sample(n=10000)
#split data
data.drop(['response'], axis=1, inplace=True)
y = data['response_class']
X = data.drop('response_class', axis=1)
X_names = X.columns
data.head()
found a solution:
#seperate based on the response variable in response_class
df_zero = pd.DataFrame(data[data.response_class== 0])
df_one = pd.DataFrame(data[data.response_class == 1])
# upsampling minority class
df_zero_min = resample(df_zero,
replace = True,
n_samples = len(df_one),
random_state = 123)
df_upsampled = pd.concat([df_one,df_zero_min])
df_upsampled.response_class.value_counts()

Custom Histogram Input for a very large dataset

I have a usecase of create a histogram that is more meaningful then the default ones.
I have elasticsearch as a datastore where all my numberical data is stored.
It has a number field of price with highly varing values. most of price in the range of 100 to 999 are centered around 399 to 500, from 501 to 999 there are few then again from 999 to 1299 a huge range and so on.
example:
399-500: 1542
501-999: 7501
1000-1299: 10214
1299-2000: 154
...
While creating histogram of bucket size 200 only 2 of the 8 are having 75% of the bar height others are very small.
If i chose small bucket size then the chart become heavy to render with 1000+ buckets.
If i chose big bucket size then the insights are not useful from the chart plotted.
I want to make a intelligent bucketing where i can split the big buckets in to small ranges say of 50-70 and in the same time merge the small buckets into single big one say of 1000. so that the charts be more meaningful.
Is there a python code for such use case.
Edited:
Due to the two spikes i can neither visualize the flat regions to show the actual variations nor can i show the distribution of power in the spikes to say that price range say 449 to 499 is the most contibuting in the spike of 399 to 500
Correct me if I am wrong but if you change the bin-width depending on the amount of data located in a bin there is no longer any reason to use the histogram since it will not give the same amount of information anymore.
Why don't you instead use a normal plot to show the result? A simple code to do so would be the following for example.
import numpy as np
import matplotlib.pyplot as plt
data_1 = np.random.normal(450, 50, 1542)
data_2 = np.random.normal(700, 200, 7501)
data_3 = np.random.normal(1150, 150, 10214)
data_4 = np.random.normal(1650, 350, 154)
data = np.concatenate((data_1, data_2, data_3, data_4))
nr_of_samples = len(data)
nr_of_bins = 1000
offset = min(data)
range = max(data) - min(data)
bins = np.zeros(nr_of_bins)
for d in data:
bin_index = int(((d-offset)/range)*nr_of_bins)-1
bins[bin_index] += 1
plt.plot(np.linspace(min(data), max(data), nr_of_bins), bins)
plt.xlabel("Value")
plt.ylabel(f"Nr of values, binwidth = {round(range/nr_of_bins,2)}")
plt.show()
Giving the following final result:

Categories

Resources