Applying K means clustering to 3 dim data - python

I am trying to apply k-means clustering in sklearn on a (52,168,2) dimensional dataset. As expected, it's giving dimension error for the estimator as 2D data is expected. What should be the way forward?
I have wind and load data in two separate files for a year with weekly data (one-hour resolution) in each row in both the files. The wind and load data are correlated (i.e., week 1 wind data corresponds to week 2). I am trying to apply K-means clustering to reduce operating periods from 52 weeks to an appropriate number of weeks(ideally 12 weeks). Hence, each data point, in this case, is a 168*2 np array with weekly wind and load data combined.
The dimension of data is coming out to be (52,168,2), since I have 52 weeks and each data point is 168*2. However, I can't apply this to sklearn k-means as it requires 2D data. I am wondering if i reshape data as data.reshape(52,168*2), will it preserve what I am aiming to do?
Load_data = pd.read_csv('Scenario_Load_Data.csv', header = None)
Load_data_final = Load_data.to_numpy()
Wind_data = pd.read_csv('Scenario_Wind_Data.csv', header = None)
Wind_data_final = Wind_data.to_numpy()
create_list = []
for i in range(len(Load_data_final)):
intermediate_v = np.column_stack((Load_data_final[i,:],Wind_data_final[i,:]))
create_list.append(intermediate_v)
data = np.array(create_list)
ValueError: Found array with dim 3. Estimator expected <= 2.

As you wanna group that by week, I believe that you can concatenate the wind and load data in the same array. I mean, 1 week will be a line and 168 + 168 will be the attributes. So, you're gonna have something like:
Week_1: at1, at2, at3, ..., at336
Week_2: at1, at2, at3, ..., at336
...
Week_52: at1, at2, at3, ..., at336
SO, I think it's pretty much like you're intending to do with reshape

Related

Multiple-series training input is giving NaN loss while same data but One-serie training input is not

I want to train a N-Beats time series model using Darts. I have a time serie DataFrame for each users so I want to use Multiple-Series training but when I feed the list of TimeSeries I directly get NaN as losses during training. If I concatenate all users's TimeSeries into one, I get a normal loss. In both cases the data is scale, fill and cast to float.32
data = scaler.transform(filler.transform(data)).astype(np.float32)
Here is the code that I use combine the list of TimeSeries into a single TimeSeries. I also have a pure Darts code for that but it is much slower for the same result.
SPLIT = 0.8
if concatenate_to_one_ts:
all_dfs = []
all_dfs_cov = []
for i in range(len(list_of_target_ts)):
all_dfs.append(list_of_target_ts[i].pd_series())
all_dfs_cov.append(list_of_cov_ts[i].pd_dataframe())
all_dfs = pd.concat(all_dfs)
all_dfs_cov = pd.concat(all_dfs_cov)
nbr_train_sample = int(len(all_dfs) * SPLIT)
all_dfs_train = all_dfs[:nbr_train_sample]
all_dfs_test = all_dfs[nbr_train_sample:]
list_of_target_ts_train = TimeSeries.from_series(all_dfs_train.reset_index(drop=True))
list_of_target_ts_test = TimeSeries.from_series(all_dfs_test.reset_index(drop=True))
all_dfs_cov_train = all_dfs_cov[:nbr_train_sample]
all_dfs_cov_test = all_dfs_cov[nbr_train_sample:]
list_of_cov_ts_train = TimeSeries.from_dataframe(all_dfs_cov_train.reset_index(drop=True))
list_of_cov_ts_test = TimeSeries.from_dataframe(all_dfs_cov_test.reset_index(drop=True))
else:
nbr_train_sample = int(len(list_of_target_ts) * SPLIT)
list_of_target_ts_train = list_of_target_ts[:nbr_train_sample]
list_of_target_ts_test = list_of_target_ts[nbr_train_sample:]
list_of_cov_ts_train = list_of_cov_ts[:nbr_train_sample]
list_of_cov_ts_test = list_of_cov_ts[nbr_train_sample:]
model = NBEATSModel(input_chunk_length=4,
output_chunk_length=1,
batch_size=512,
n_epochs=5,
nr_epochs_val_period=1,
model_name="NBEATS_test",
generic_architecture=True,
force_reset=True,
save_checkpoints=True,
show_warnings=True,
log_tensorboard=True,
torch_device_str='cuda:0'
)
model.fit(series=list_of_target_ts_train,
past_covariates=list_of_cov_ts_train,
val_series=list_of_target_ts_val,
val_past_covariates=list_of_cov_ts_val,
verbose=True,
num_loader_workers=20)
As Multiple-Series training I get:
Epoch 0: 8%|██████████▉ | 2250/27807 [03:00<34:11, 12.46it/s, loss=nan, v_num=logs, train_loss=nan.0
As a single serie training I get:
Epoch 0: 24%|█████████████████████████▋ | 669/2783 [01:04<03:24, 10.33it/s, loss=0.00758, v_num=logs, train_loss=0.00875]
I am also confused by the number of sample per epoch with the same batch size as from what I read here: https://unit8.com/resources/training-forecasting-models/ the single serie should have more sample as the window size cut is not happening for each Multiple Series.
Regarding the NaNs, I would try reducing the learning rate if I were you. Also double check that there's no NaN remaining in your data (see corresponding entry here 1)
Regarding the number of samples, each of the separate time series are split into several (input, output) slices. For the single series, this split is done once overall, whereas for the multiple series, this split is done once per series and then all the resulting samples are regrouped in a common training set. So it is expected to have more training samples with multiple series (and each training sample will have fewer dimensions compared to the single-multivariate-series case).
Thanks Julien Herzen your answer helped me a lot to find the issue. I want to add more details on what was happening.
Regarding the NaNs: the filler from Darts is by default using pandas interpolation. That interpolation was not possible for the multi-series as some of the series had only NaN in those columns so nothing to interpolate from thus returning series with still NaN values. It was not happening for the concatenated to one single series because as all multi-series were concatenated there was value to interpolate from. If you do not need interpolation just add fill=0.0 in MissingValuesFiller(fill=0.0)
regarding the number of samples, after digging in Darts code, I found out that the NBeats model is using GenericShiftedDataset which for multi series is computing the length of the dataset by:
getting the length of the longest sub series and multiplying by the number of series.
self.max_samples_per_ts = (max(len(ts) for ts in self.target_series) - self.size_of_both_chunks + 1)
then when getitem is called
target_idx = idx // self.max_samples_per_ts
target_series = self.target_series[target_idx]
It select a series by dividing the idx by the max number of samples thus shorter series will be sampled more than longer one as they have less data but same chance to get sampled.
Here is my smallest example with input_chunk_length = 4 and output = 1:
Multi series with lengths: [71, 19] -> number of samples (71 * 2) - (2 * input_chunk_length) = 134
Concantenated into a Single serie: 90 -> number of samples: 90 - input_chunk_length = 86
In the multi series the sample in the short sub series will likely be sampled more time.

Preparing Grid weather data for ConvLSTM2d

I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
*
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df.append(temp_df)
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.

Combine the arrays in these datasets of a HDF5 file and finally get a 2D

I have an HDF5 file and it contains 500 datasets (named A000, A001, A002, A003 .... A499) and each dataset contains arrays of (200, 5400) sizes. I want to combine the arrays in these datasets and finally get a 2D array. For this, I can achieve the result by doing certain manipulations in the for loop, but this process takes a long time. So what I did is like this:
for datasets in my_list:
dataset = f[(datasets)][:]
i_data = dataset['real']
q_data = dataset['imag']
# power = 10.log(10.(I^2+Q^2)+1)
power = np.log10(((np.add(np.square(np.abs(i_data)),np.square(np.abs(q_data ))))*10)+1)*10
power = np.rot90(power)
power_list.append(power)
print("Dataset: ", datasets)
power = np.concatenate(power_list,1)
So, is there any way to doing this in shorter time like maybe without for loop.

Apply Mann Whitney U test on multidimensional array and replace single values of variable of xarray data array in Python?

I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
#mannwhintey-u-test
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
Output:
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:
http://xarray.pydata.org/en/stable/indexing.html#assigning-values-with-indexing

Put a fixed quantity of missing values in a dataset - Azure ML

I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.

Categories

Resources