How to combine many numpy arrays efficiently? - python

I am having difficulty trying to load 18k of training data for training with tensorflow. The files are npy files named as such: 0.npy, 1.npy...18000.npy.
I was looking around the web and came up with a simple code to first read the files in the correct sequence and trying to concatenate the training data together but it takes forever..
import numpy as np
import glob
import re
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
files = glob.glob('D:/project/train/*.npy')
files.sort(key=lambda var:[int(x) if x.isdigit() else x for x in
re.findall(r'[^0-9]|[0-9]+', var)])
# print(files)
final_dataset = []
i = 0
for file in files:
dataset = np.load(file, mmap_mode='r')
print(i)
#print("Size of dataset: {} ".format(dataset.shape))
if (i==0):
final_dataset = dataset
else:
final_dataset = np.concatenate((final_dataset, dataset), axis = 0)
i = i + 1
print("Size of final_dataset: {} ".format(final_dataset.shape))
np.save('combined_train.npy', final_dataset)

'Combining' arrays in any way involves (1), creating an array with the two arrays' total size; (2), copying their contents into the array. If you do this each time you load in an array, it repeats 18000 times - with time per iteration growing for each iteration (due to larger final_dataset).
A simple workaround is to append the arrays to a list - and then combine them all once at the end:
dataset = []
for file in files:
data = np.load(file, mmap_mode='r')
dataset.append(data)
final_dataset = np.concatenate(dataset, axis=0)
But beware: be sure final_dataset can actually fit in your RAM, else the program will crash. You can find out via ram_required = size_per_file * number_of_files. Relevant SO. (To speed things up even further, you can look into multiprocessing - but not simple to get working)

Related

Optimizing data pipeline for a large time series dataset

I have the following file structure for my time-series dataset.
directory 1, directory 2, ...
Each directory contains a variable amount of CSV files. And each CSV file includes a triple of (time, X, y) and it has a varying length (1 day or 3 days, etc.).
Modelling process/requirement
I am training a single output regression model which takes an N-dimensional vector as an input vector. And data is fed to the model in a sliding-window way.
Problem
Given smaller amount of CSV files, tf.keras.utils.timeseries_dataset_from_array works fine (I merge all the files into one long vector beforehand). However, this doesn't work when I have a large amount of CSV this solution fails because it consumes all of the memory.
Following this, I tried to create a custom data generator, for the sake of simplicity I have skipped unimportant code:
def __getitem__(self, idx):
random_tpls = []
while len(random_tpls) < self.batch_size:
...
# file_path: randomly chosen csv path
# rand_line_num: is a randomly chosen line number in the csv file
tmp_tpl = (file_path, rand_line_num)
random_tpls.append(tmp_tpl)
X, y = self.__data_generation(random_tpls=random_tpls)
return X, y
def __data_generation(self, random_tpls):
...
for idx, tpl in enumerate(random_tpls):
fp, num = tpl[0], tpl[1]
tmp_df = pd.read_csv(fp, skiprows=num, nrows=600, names=['col1', 'col2'])
X[idx] = tmp_df['col1'].values
y[idx] = tmp_df['col2'].loc[0]
X, y = self.__transform(X, y) # some transformation
return X, y
In a nutshell, every time __getitem__ method is called, it creates a list of random tuples which contain the CSV path and a line number. Then windows of data are loaded from the corresponding files and form a batch of data. This solves the out-of-memory issue but the training process becomes painfully slow with data loading becoming the bottleneck.
Is it possible to improve this? How else can I optimize the data loading part?

Background isolation from complementing images

I'm trying to isolate a background from multiple images that have something different between each other, that is overlapping the background.
the images I have are individually listed here: https://imgur.com/a/Htno7lm
but there is a preview of all 6 of them combined here:
I wanted to do it in a sequence of images, as of I'm reading some video feed, and by getting the last frames I'm processing them to isolate the background, like this:
import os
import cv2
first = True
bwand = None
for filename in os.listdir('images'):
curImage = cv2.imread('images/%s' % filename)
if(first):
first = False
bwand = curImage
continue
bwand = cv2.bitwise_and(bwand,curImage)
cv2.imwrite("and.png",bwand)
From this code, I'm always incrementing my buffer with bitwise operations, but the results I get is not what I'm looking for:
Bitwise and:
the way of concurrent adding to a buffer its the best approach for me in terms of video filtering and performance, but if I treat it like a list, I can look for the median value like so:
import os
import cv2
import numpy as np
sequence = []
for filename in os.listdir('images'):
curImage = cv2.imread('images/%s' % filename)
sequence.append(curImage)
imgs = np.asarray(sequence)
median = np.median(imgs, axis=0)
cv2.imwrite("res.png",median)
it results me:
Which is still not perfect, because I'm looking for the median value, if I would look for the mode value the performance would decrease significantly.
Is there an approach for obtaining the result that works as a buffer like the first alternative but outputs me the best result with good performance?
--Edit
As suggested by #Christoph Rackwitz I used OpenCV background subtractor, it works as one of the requested features which is a buffer, but the result is not the most pleasant:
code:
import os
import cv2
mog = cv2.createBackgroundSubtractorMOG2()
for filename in os.listdir('images'):
curImage = cv2.imread('images/%s' % filename)
mog.apply(curImage)
x = mog.getBackgroundImage()
cv2.imwrite("res.png",x)
Since scipy.stats.mode takes ages to do its thing, I did the same manually:
calculate histogram (for every channel of every pixel of every row of every image)
argmax gets mode
reshape and cast
Still not video speed but oh well. numba can probably speed this up.
filenames = ...
assert len(filenames) < 256, "need larger dtype for histogram"
stack = np.array([cv.imread(fname) for fname in filenames])
sheet = stack[0]
hist = np.zeros((sheet.size, 256), dtype=np.uint8)
index = np.arange(sheet.size)
for sheet in stack:
hist[index, sheet.flat] += 1
result = np.argmax(hist, axis=1).astype(np.uint8).reshape(sheet.shape)
del hist # because it's huge
cv.imshow("result", result); cv.waitKey()
And if I didn't use histograms and extensive amounts of memory, but a fixed number of sheets and data access that's cache-friendly, it could likely be even faster.

How to calculate user-similarity matrix in a more efficient manner?

I have a set of 10 users, each with their own folder/directories, containing 25-30 images shared by them (in some social media, say). I want to calculate the similarities between the users based on the images shared by them.
For that, I use a feature extractor to convert each image into a 224x224x3 array, then loop through each user and each of the images in their folders to find the cosine similarity between each pair images, then take the average of all those pairwise image similarities for each pair of users to find the user similarity. (Please let me know if there's some mistake in this logic by the way).
My code to do all this is as follows:
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.applications import vgg16
from tensorflow.keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.models import Model
import os
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
# load the model
vgg_model = vgg16.VGG16(weights='imagenet')
# remove the last layers in order to get features instead of predictions
feat_extractor = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)
def processed_image(image):
original = load_img(image, target_size=(224, 224))
numpy_image = img_to_array(original)
image_batch = np.expand_dims(numpy_image, axis=0)
processed_image = preprocess_input(image_batch.copy())
img_features = feat_extractor.predict(processed_image)
return img_features
def image_similarity(image1, image2):
image1 = processed_image(image1)
image2 = processed_image(image2)
sim = cosine_similarity(image1, image2)
return sim[0][0]
user_list = ['User '+str(i) for i in range(1,11)]
user_sim_df = pd.DataFrame(columns=user_list, index=user_list)
for user1 in user_list:
for user2 in user_list:
sum_img_sim = 0
user1_files = [imgs_path + x for x in os.listdir('All_Users/'+user1) if "jpg" in x]
user2_files = [imgs_path + x for x in os.listdir('All_Users/'+user2) if "jpg" in x]
for image1 in user1_files:
for image2 in user2_files:
sum_img_sim += image_similarity(image1, image2)
user_sim_df[user1][user2] = 2*sum_img_sim/(len(user1_files)+len(user2_files))
Now, because there are 4 for loops involved in calculating the user similarity matrix, the code take a long time too run (its been more than 30 minutes as of typing this question, that the code has been running for 10 users with 25-30 images each).
So, how do I rewrite the last portion of this to make the code run faster?
Nested for loops are particularly bad for Python, but some work can be done to improve here.
First of all, you are doing work twice in the comparisons. user_sim_df[user_i][user_j] has the same value as user_sim_df[user_j][user_i] for all pairs i, j. Could benefit from using already calculated values, instead of computing them again in later iterations. Besides this, is computing the values on the diagonal (user_sim_df[user_i][user_i]) necessary for your application?
These simple changes will reduce execution time to half. Is that enough? Maybe not. Further lines of improvement:
the img_to_array() operation is being applied many times on every image (every time you calculate similarity with another one). Is it a bottleneck? In that case, performance could also improve if you first run a loop on all images and create a new file ready for numpy to read later, for example with numpy.read() - or maybe, just save the preprocessed files output from the Tensorflow currently being used.
if you're using the standard Python interpreter, changing to PyPy can help (in general). You could also try adapting the code to consist only of operations on numpy structures (e.g. adapt the pandas parts) and use Numba in a way similar to this SO link. Using Numba you can also benefit from parallelism. See some practical guidelines here.

Applying a simple function to CSV and save multiple csv files

I am trying to replicate the data by multiplying every value with a range of values and save the results as CSV.
I have created a function "Replicate_Data" which takes the input numpy array and multiply with a random value between a range. What is the best way to create a 100 files and save as P3D1 , P4D1 and so on.
def Replicate_Data(data: np.ndarray) -> np.ndarray:
Rep_factor= random.uniform(-3,7)
data1 = data * Rep_factor
return data1
P2D1 = Replicate_Data(P1D1)
np.savetxt("P2D1.csv", P2D1, delimiter="," , dtype = complex)
Here is an example you can use as reference.
I generate toy data named toy, then I make n random values using np.random.uniform and call it randos, then I multiply these two objects to form out using numpy broadcasting. You could also do this multiplication in a loop (the same one you save in, in fact): depending on the size of your input array it could be very memory intensive as I've written it. A more complete answer probably depends on the shape of your input data.
import numpy as np
toy = np.random.random(size=(2,2)) # a toy input array
n = 100 # number of random values
randos = np.random.uniform(-3,7,size=n) # generate 100 uniform randoms
# now multiply all elements in toy by the randoms in randos
out = toy[None,...]*randos[...,None,None] # this depends on the shape.
# this will work only if toy has two dimensions. Otherwise requires modification
# it will take a lot of memory... 100*toy.nbytes worth
# now save in the loop..
for i,o in enumerate(out):
name = 'P{}D1'.format(str(i+1))
np.savetxt(name,o,delimiter=",")
# a second way without the broadcasting (slow, better on memory)
# more like 2*toy.nbytes
#for i,r in enumerate(randos):
# name = 'P{}D1'.format(str(i+1))
# np.savetxt(name,r*toy,delimiter=",")

knn search using HDF5

I'm trying to do knn search on big data with limited memory.
I'm using HDF5 and python.
I tried bruteforce linear search(using pytables) and kd-tree search (using sklearn)
It's suprising but kd-tree method takes more time(maybe kd-tree will work better if we increase batch size? but I don't know optimal size also it limited by memory)
Now I'm looking for how to speed up calculations, I think HDF5 file can be tuned for individual PC, also norm calculation can be speeded maybe using nymexpr or some python tricks.
import numpy as np
import time
import tables
import cProfile
from sklearn.neighbors import NearestNeighbors
rows = 10000
cols = 1000
batches = 100
k= 10
#USING HDF5
vec= np.random.rand(1,cols)
data = np.random.rand(rows,cols)
fileName = 'C:\carray1.h5'
shape = (rows*batches, cols) # predefined size
atom = tables.UInt8Atom() #?
filters = tables.Filters(complevel=5, complib='zlib') #?
#create
# h5f = tables.open_file(fileName, 'w')
# ca = h5f.create_carray(h5f.root, 'carray', atom, shape, filters=filters)
# for i in range(batches):
# ca[i*rows:(i+1)*rows]= data[:]+i # +i to modify data
# h5f.close()
#can be parallel?
def test_bruteforce_knn():
h5f = tables.open_file(fileName)
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
print (time.time()-t0)
ndx = d.argsort()
print ndx[:k]
h5f.close()
def test_tree_knn():
h5f = tables.open_file(fileName)
# it will not work
# t0= time.time()
# nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray)
# distances, indices = nbrs.kneighbors(vec)
# print (time.time()-t0)
#need to concatenate distances, indices somehow
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray[i*rows:(i+1)*rows])
distances, indices = nbrs.kneighbors(vec) # put in dict?
#d[i*rows:(i+1)*rows] =
print (time.time()-t0)
#ndx = d.argsort()
#print ndx[:k]
h5f.close()
cProfile.run('test_bruteforce_knn()')
cProfile.run('test_tree_knn()')
If I understand correctly your data has 1000 dimensions? If this is the case then it's expected that kd-tree won't fare well as it suffers from the curse of dimensionality.
You might want to have a look at Approximate Nearest Neighbors search methods instead. For instance have a look at flann.

Categories

Resources