Optimizing data pipeline for a large time series dataset

Optimizing data pipeline for a large time series dataset - python

I have the following file structure for my time-series dataset.
directory 1, directory 2, ...
Each directory contains a variable amount of CSV files. And each CSV file includes a triple of (time, X, y) and it has a varying length (1 day or 3 days, etc.).
Modelling process/requirement
I am training a single output regression model which takes an N-dimensional vector as an input vector. And data is fed to the model in a sliding-window way.
Problem
Given smaller amount of CSV files, tf.keras.utils.timeseries_dataset_from_array works fine (I merge all the files into one long vector beforehand). However, this doesn't work when I have a large amount of CSV this solution fails because it consumes all of the memory.
Following this, I tried to create a custom data generator, for the sake of simplicity I have skipped unimportant code:
def __getitem__(self, idx):
random_tpls = []
while len(random_tpls) < self.batch_size:
...
# file_path: randomly chosen csv path
# rand_line_num: is a randomly chosen line number in the csv file
tmp_tpl = (file_path, rand_line_num)
random_tpls.append(tmp_tpl)
X, y = self.__data_generation(random_tpls=random_tpls)
return X, y
def __data_generation(self, random_tpls):
...
for idx, tpl in enumerate(random_tpls):
fp, num = tpl[0], tpl[1]
tmp_df = pd.read_csv(fp, skiprows=num, nrows=600, names=['col1', 'col2'])
X[idx] = tmp_df['col1'].values
y[idx] = tmp_df['col2'].loc[0]
X, y = self.__transform(X, y) # some transformation
return X, y
In a nutshell, every time __getitem__ method is called, it creates a list of random tuples which contain the CSV path and a line number. Then windows of data are loaded from the corresponding files and form a batch of data. This solves the out-of-memory issue but the training process becomes painfully slow with data loading becoming the bottleneck.
Is it possible to improve this? How else can I optimize the data loading part?

Related

How to normalize data in a tensorflow time series with respect to the last element

I am trying to follow the tutorial from tensorflow on time series forecasting. I have some temperature data that varies a lot and as such, I don't actually want to forecast the actual value, just how much it has changed compared to the last element.
Let's imagine that I have a time series training data point of type (input, label): ([8, 7, 5, 9, 11, 10], [12]) and I would like to convert this to a data point where everything is normalized by the last entry in the input. In this concrete example, I would like a data point that looks like this: ([0.8, 0.7, 0.5, 0.9, 1.1, 1.0], [1.2]).
Following the tutorial I am able to create a dataset of windows from the original temperature data collected at a fixed interval. This creates a dataset of tensors, however I am not allowed to modify the values in these tensors. As such, I can't just loop over all the windows in the created dataset and then normalize by dividing by the last value in the input array. Furthermore, I can't normalize the data before the dataset is created as I need the window structure. How can I normalize my data in such a way with respect to each window?
My guess is that it needs to be done after the dataset, and therefore the windows, have been created. So somewhere in this function, maybe:
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.utils.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
ds = ds.map(self.split_window)
return ds

one way to do this is to use a data generator and you can implement the logic that you want inside, like divide by the biggest input in each batch, here an example of how to do so :
import numpy as np
my_array = np.arange(1024)
def generate_sequence(generate_data,batch_size=16,sequence_length=8):
list_of_x = []
list_of_y = []
while True:
for i in range(len(generate_data)//sequence_length):
born_inf = i * sequence_length
born_sup = (i+1) * sequence_length
current_data = generate_data[born_inf:born_sup]
max_n = np.max(current_data)
list_of_x.append(current_data/max_n)
list_of_y.append(generate_data[min(born_sup+1,len(generate_data)-1)]/max_n)
if len(list_of_x) == batch_size:
yield np.array(list_of_x),np.array(list_of_y)
list_of_x = []
lst_of_y = []

Perform same data augmentation in tf.data for CSV image files for both input and reference images

For the training of my model I want to perform data augmentation on the data set to improve performance. My input data instances consist of snapshots of wave height, saved in CSV (a 160x160 grid), which is the same for my labels/reference data.
For that reason I want to perform the same data augmentation procedure on each input-label pair. However, I am not finding any good way to load my csv data (basically simply as a 160,160 tensor) into the data api of tensorflow, so I tried a detour with pandas.
Here is some code of how I had expected it to work:
...
input_files = ['./Data/input_{}.csv'.format(i) for i in range(1, 200)]
label_files = ['./Data/label_{}.csv'.format(i) for i in range(1, 200)]
def add_offset(img):
offset = tf.random.uniform([1], 0, 5) * tf.ones(shape)
return img + offset
# Load a data set instance and perform same data augmentation on input and label
def load(file_path):
# Detour with pandas' read_csv -> tf.io.read_file + decode_csv don't work for arrays
img = pd.read_csv(file_path, header=None, delim_whitespace=False).values
img = tf.convert_to_tensor(img)
return img
def augment(file_path1, file_path2):
img = tf.stack([load(file_path1), load(file_path2)])
# Augment only a certain percentage
if tf.random.uniform(1) < 0.1:
img = add_offset(img)
return img[0], img[1]
train_data = tf.data.Dataset.from_tensor_slices((input_files, label_files))
train_data = train_data.map(lambda x, y: augment(x, y)).batch(1)
...
The error that I get is that the passed file_path in pd.read_csv is not a valid file path and indeed, it seems to be some kind of tensor. Any ideas on how to load the csv data as arrays and to perform the same data augmentation on both inputs and labels?
Edit: The error message reads
ValueError: Invalid file path or buffer object type: <class 'tensorflow.python.framework.ops.Tensor'>
And here is a small part of one of my csv files. The values range in general from 0.5 to 5.
1.393020,1.389099,1.386132,1.383231,1.380271
1.394938,1.392077,1.389846,1.387650,1.385154
1.395655,1.393057,1.391297,1.389553,1.387415
1.396266,1.393928,1.392303,1.390782,1.388740
1.396814,1.394653,1.393193,1.391659,1.389615
2nd Edit: Some additional information on how I obtain my CSV files. They come from a coastal wave simulation, that give me the wave heights in a 160x160 grid / array as seen above. I can easily open them with Pandas or numpy without any problem. Now I had to change my script, since I have to much data to simply load it into memory. That is why I would like to use tf.data, but I got stuck with the problem mentioned above.

How to combine many numpy arrays efficiently?

I am having difficulty trying to load 18k of training data for training with tensorflow. The files are npy files named as such: 0.npy, 1.npy...18000.npy.
I was looking around the web and came up with a simple code to first read the files in the correct sequence and trying to concatenate the training data together but it takes forever..
import numpy as np
import glob
import re
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
files = glob.glob('D:/project/train/*.npy')
files.sort(key=lambda var:[int(x) if x.isdigit() else x for x in
re.findall(r'[^0-9]|[0-9]+', var)])
# print(files)
final_dataset = []
i = 0
for file in files:
dataset = np.load(file, mmap_mode='r')
print(i)
#print("Size of dataset: {} ".format(dataset.shape))
if (i==0):
final_dataset = dataset
else:
final_dataset = np.concatenate((final_dataset, dataset), axis = 0)
i = i + 1
print("Size of final_dataset: {} ".format(final_dataset.shape))
np.save('combined_train.npy', final_dataset)

'Combining' arrays in any way involves (1), creating an array with the two arrays' total size; (2), copying their contents into the array. If you do this each time you load in an array, it repeats 18000 times - with time per iteration growing for each iteration (due to larger final_dataset).
A simple workaround is to append the arrays to a list - and then combine them all once at the end:
dataset = []
for file in files:
data = np.load(file, mmap_mode='r')
dataset.append(data)
final_dataset = np.concatenate(dataset, axis=0)
But beware: be sure final_dataset can actually fit in your RAM, else the program will crash. You can find out via ram_required = size_per_file * number_of_files. Relevant SO. (To speed things up even further, you can look into multiprocessing - but not simple to get working)

Applying a simple function to CSV and save multiple csv files

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)

Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

Is there a way to use tf.data.Dataset inside of another Dataset in Tensorflow?

I'm doing segmentation. Each training sample have multiple images with segmentation masks. I'm trying to write input_fn to merge all mask images in to one for each training sample.
I was planning on using two Datasets, one that iterates over samples folders and another that reads all masks as one large batch and then merges them to one tensor.
I'm getting an error when nested make_one_shot_iterator is called. I Know that this approach is a bit of a stretch and mostlikely datasets wheren't designed for such usage. But then how should I approach this problem so that I avoid using tf.py_func?
Here is a simplified version of the dataset:
def read_sample(sample_path):
masks_ds = (tf.data.Dataset.
list_files(sample_path+"/masks/*.png")
.map(tf.read_file)
.map(lambda x: tf.image.decode_image(x, channels=1))
.batch(1024)) # maximum number of objects
masks = masks_ds.make_one_shot_iterator().get_next()
return tf.reduce_max(masks, axis=0)
ds = tf.data.Dataset.from_tensor_slices(tf.glob("../input/stage1_train/*"))
ds.map(read_sample)
# ...
sample = ds.make_one_shot_iterator().get_next()
# ...

If the nested dataset has only a single element, you can use tf.contrib.data.get_single_element() on the nested dataset instead of creating an iterator:
def read_sample(sample_path):
masks_ds = (tf.data.Dataset.list_files(sample_path+"/masks/*.png")
.map(tf.read_file)
.map(lambda x: tf.image.decode_image(x, channels=1))
.batch(1024)) # maximum number of objects
masks = tf.contrib.data.get_single_element(masks_ds)
return tf.reduce_max(masks, axis=0)
ds = tf.data.Dataset.from_tensor_slices(tf.glob("../input/stage1_train/*"))
ds = ds.map(read_sample)
sample = ds.make_one_shot_iterator().get_next()
In addition, you can use the tf.data.Dataset.flat_map(), tf.data.Dataset.interleave(), or tf.contrib.data.parallel_interleave() transformationw to perform a nested Dataset computation inside a function and flatten the result into a single Dataset. For example, to get all of the samples in a single Dataset:
def read_all_samples(sample_path):
return (tf.data.Dataset.list_files(sample_path+"/masks/*.png")
.map(tf.read_file)
.map(lambda x: tf.image.decode_image(x, channels=1))
.batch(1024)) # maximum number of objects
ds = tf.data.Dataset.from_tensor_slices(tf.glob("../input/stage1_train/*"))
ds = ds.flat_map(read_all_samples)
sample = ds.make_one_shot_iterator().get_next()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing data pipeline for a large time series dataset - python

Related

How to normalize data in a tensorflow time series with respect to the last element

Perform same data augmentation in tf.data for CSV image files for both input and reference images

How to combine many numpy arrays efficiently?

Applying a simple function to CSV and save multiple csv files

Is there a way to use tf.data.Dataset inside of another Dataset in Tensorflow?

Categories

Resources