Combining Numpy Arrays Feature Matrices In Python - python

I have the data with 38910 rows and 2 columns. As its a string data, so I used two feature creation methods A and B.
Method A gives me data of numpy arrays of the shape as:
a.shape = (38910, 17, 21)
Method B gives me data of numpy arrays of the shape as:
b.shape = (38910, 16, 441)
Now, for applying Convolution Neural Network and other methods, I need to combine both the features to make a numpy array of the shape = (38910, 17, 21, 16, 441). What is the best way I can do that such that I don't face memory issues.

One one to avoid memory issues is to process the rows in batches. Assuming that you have a function combine_features(a, b) that combines the outputs of method A and method B, here's a rough outline of a solution:
a_batches = np.array_split(a, 500)
b_batches = np.array_split(b, 500)
for i, batch in enumerate(zip(a_batches, b_batches)):
a_batch, b_batch = batch
output = combine_features(a_batch, b_batch)
np.save(f"{destination_folder}/data-{i}.npy", output)
Then as you are training, you can iterate through the saved files and load one at a time.

Related

Tensorflow: Is there a way to create multiple gather() outputs and stack them parallely in a compute-and-memory-efficient manner?

I'm trying to essentially create a 3-D tensor from the indexed rows of a 2-D tensor. For example, assuming I have:
A = tensor(shape=[200, 256]) # 2-D Tensor.
Aidx = tensor(shape=[1000, 10]) # 2-D Tensor holding row indices of A for each of 1000 batches.
I wish to create:
B = tensor(shape=[1000, 10, 256]) # 3-D Tensor with each batch being of dims (10, 256) selected from A.
Right now, I'm doing this in a memory inefficient manner by doing a tf.broadcast() and then using a tf.gather(). This is very fast, but also takes up a lot of RAM:
A = tf.broadcast_to(A, [1000, A.shape[0], A.shape[1]])
A = tf.gather(A, Aidx, axis=1, batch_dims=1)
Is there a more memory efficient way of doing the above operation? Naively, one can make use of a for loop, but that is very compute inefficient for my use case. Thanks in advance!
You have to extract 10,000 rows correct? (10 rows 1000 different times)
make this [1000, 10] array into 1 dimensional array [10000] with reshape
See this answer
How to fetch specific rows from a tensor in Tensorflow?
This will give you output [10000, 256]
Then reshape the output into your final form. [1000, 10, 256]
I haven't tried it.

Pytorch batch row-wise application of function

I would like to figure out a way to apply a function which calculates pairwise distances, let's call it dists(A, B), row-wise for every input element in a batch, meaning:
(100, 16, 3) -- input, 100 is the batch size so 100 instances, 16 is let's say image size, and 3 filters (asking for Conv2D)
(5, 3) -- tensor for which I want to calculate the row-wise distance (assume it's A in dists(A, B) and is fixed)
Now, for every instance I am supposed to get back a matrix of shape (5, 16). Naturally, I could use a for to span the batch and get my final (100,5,16) result. However, I would love to know if there is an easier way to apply my function row-wise, in parallel, using GPU.
Thank you very much for your time.
Suppose we are using the L1 distance:
import torch
# data and target
a = torch.randn(100, 16, 3)
b = torch.randn(5, 3)
# Reshape the tensors
a = a.unsqueeze(1)
b = b.unsqueeze(0).unsqueeze(2)
print(a.shape, b.shape)
# Compute distance
dist = (a-b).abs().sum(3)
print(dist.shape)

Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.
As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.
How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.
Here's how I index into hdf5 file:
In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')
In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'
In [6]: group_key = list(hf.keys())[0]
In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">
# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)
Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?
Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)
According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) array
arr = ds[:] # should load the whole dataset into memory.
arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
group
sample
36 x 2048
may help in indexing speed.

assigning different weights to every numpy column

I have the following numpy array:
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import numpy as np
# NumPy array comprising associate metrics
# i.e. Open TA's, Open SR's, Open SE's
associateMetrics = np.array([[11, 28, 21],
[27, 17, 20],
[19, 31, 3],
[17, 24, 17]]).astype(np.float64)
print("raw metrics=", associateMetrics)
Now, I want to assign different weights to every column in the above array & later normalize this. For eg. lets say i want to assign higher weight to 1st column by multiplying by 5, multiple column 2 by 3 and the last column by 2.
How do i do this in python? Sorry a bit new to python and numpy.
I have tried this for just 1 column but it wont work:
# Assign weights to metrics
weightedMetrics = associateMetrics
np.multiply(2, weightedMetrics[:,0])
print("weighted metrics=", weightedMetrics)
You should make use of numpy's array broadcasting. This means that lower-dimensional arrays can be automatically expanded to perform a vectorized operation with an array of higher (but compatible) dimensions. In your specific case, you can multiply your (4,3)-shaped array with a 1d weight array of shape (3,) and obtain what you want:
weightedMetrics = associateMetrics * np.array([5,3,2])
The trick is that you can imagine numpy ndarrays to have leading singleton dimensions, along which broadcasting is automatic. By this I mean that your 1d numpy weight array of shape (3,) can be thought to have a leading singleton dimension (but only from the point of view of broadcasting!). And it's easy to see how the array of shape (4,3) and (1,3) should be multiplied: each element of the latter has to be used for full columns of the former.
In the very general case, you can even use arithmetic operations on, say, an array of shape (3,1,3,1,4) and one of shape (2,3,4,4). What's important that dimensions that meet should either agree, or one of the arrays should have a singleton dimension at that place, and one of the arrays is allowed to be longer (in the front).
i found my answer. This is what i used:
print("weighted metrics=", np.multiply([ 1, 2, 3], associateMetrics))

How to create numpy ndarray from numpy ndarrays?

I used the MNIST dataset for training a neural network, where the training data is returned as a tuple with two entries. The first entry contains the actual training images. This is a numpy ndarray with 50,000 entries. Each entry is, in turn, a numpy ndarray with 784 values, representing the 28 * 28 = 784 pixels in a single MNIST image.
I would like to create a new training set, however I do not know how to create an ndarray from other ndarrays. For instance, if I have the following two ndarrays:
a = np.ndarray((3,1), buffer=np.array([0.9,1.0,1.0]), dtype=float)
b = np.ndarray((3,1), buffer=np.array([0.8,1.0,1.0]), dtype=float)
how to make a third one containing these two?
I tried the following but it creates only one entry.
c = np.ndarray((1,6,1), buffer=np.array(([a],[b])), dtype=float)
I would need it to be two entries.
Thanks, in the meanwhile I figured out it is simply:
c = np.array((a, b))

Categories

Resources