As I am learning to write Neural Networks with Python, I have just written the following linear association network that takes in K input vectors x_1, ..., x_K of respective length L and K output vectors of respective length N and finds optimal weights using gradient descent.
As the calculation times explodes really quickly when adjusting K, L and N, I was searching on how to speed this up. I discovered cupy, but cupy is much, much slower than numpy in this case. Why would this be? When changing the code to the cupy variation, I do nothing but substituting every np to cp as I imported cupy as cp.
I have also used f = njit()(ManyAssociations.fit), but then I had to return W in fit instead of writing ManyAssociations.weights = W. Is there any way to use njit inside of the class or apart from that is there a better way to use numba/cuda? It turns out to be much quicker after "warming up" with a first function call, but it still reaches its limit at with vectors of the mentioned shapes around K = L = N = 9.
What are some other good ways to speed up code like the below one? Could I be writing more efficiently? Could I be using the GPU better?
import numpy as np
class ManyAssociations:
def fit(x_train, y_train, learning_rate, tol):
L_L = x_train.shape[1]
L_N = y_train.shape[1]
W = np.zeros((L_N, L_L))
for n in range(L_N):
learning = True
w = np.random.rand(L_L)
while learning:
delta = (x_train # w - y_train[:,n])
grad_E = delta # x_train
w = w - learning_rate * grad_E
if (grad_E # grad_E) < tol:
W[n] = w
learning = False
ManyAssociations.weights = W
def predict(x_pred, W):
preds = []
for k in range(x_pred.shape[0]):
preds.append(W # x_pred[k])
return np.array(preds)
I discovered cupy, but cupy is much, much slower than numpy in this case. Why would this be?
Computations on GPU are split into basic computationally-intensive building-blocks called kernels. The kernels are submitted to the GPU by the CPU. Each kernel call take some time: the CPU has to communicate with the GPU and often use the relatively slow PCI interconnect (both should be synchronized), allocations should be made on the GPU so that resulting data can be written, etc. The CuPy package transform each basic Numpy instruction to a GPU kernel naively. Since you loop executes a lot of small kernels, the overhead is huge. Thus, if you want you code to be faster on GPUs using CuPy, you need either to work on huge chunk data or to write directly your own kernel (this is hard since GPU are quite complex).
Is there any way to use njit inside of the class or apart from that is there a better way to use numba/cuda?
You can use #jitclass. You can find more information in the documentation.
Moreover, you can take advantage of parallelism to speed you code up. To do that, you can for exemple replace range by prange and add the property parallel=True to Numba's njit. You can find more information here.
What are some other good ways to speed up code like the below one? Could I be writing more efficiently? Could I be using the GPU better?
Neural networks are generally very computationally intensive. Numba should be quite good to get reasonably high performance. But if you want a fast code, then you will either need to use higher-level library or to get your hands dirty by rewriting what the libraries do yourself (likely with a much lower-level code).
The standard way to work with neural networks is to use dedicated libraries like TensorFlow, PyTorch, Keras, etc. AFAIK, the former is flexible and highly optimized although it is a bit low-level than the other.
Related
The documentation for JAX says,
Not all JAX code can be JIT compiled, as it requires array shapes to be static & known at compile time.
Now I am somewhat surprised because tensorflow has operations like tf.boolean_mask that does what JAX seems incapable of doing when compiled.
Why is there such a regression from Tensorflow? I was under the assumption that the underlying XLA representation was shared between the two frameworks, but I may be mistaken. I don't recall Tensorflow ever having troubles with dynamic shapes, and functions such as tf.boolean_mask have been around forever.
Can we expect this gap to close in the future? If not, why makes it impossible to do in JAX' jit what Tensorflow (among others) enables?
EDIT
The gradient passes through tf.boolean_mask (obviously not on mask values, which are discrete); case in point here using TF1-style graphs where values are unknown, so TF cannot rely on them:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
x1 = tf.placeholder(tf.float32, (3,))
x2 = tf.placeholder(tf.float32, (3,))
y = tf.boolean_mask(x1, x2 > 0)
print(y.shape) # prints "(?,)"
dydx1, dydx2 = tf.gradients(y, [x1, x2])
assert dydx1 is not None and dydx2 is None
Currently, you can't (as discussed here)
This is not a limitation of JAX jit vs TensorFlow, but a limitation of XLA or rather how the two compile.
JAX uses simply XLA to compile the function. XLA needs to know the static shape. That's an inherent design choice within XLA.
TensorFlow uses the function: this creates a graph which can have shapes that are not statically known. This is not as efficient as using XLA, but still fine. However, tf.function offers an option jit_compile, which will compile the graph inside the function with XLA. While this offers often a decent speedup (for free), it comes with restrictions: shapes need to be statically known (surprise, surprise,...)
This is overall not too surprising behavior: computations in computers are in general faster (given a decent optimizer went over it) the more is previously known as more parameters (memory layout,...) can be optimally scheduled. The less is known, the slower the code (on this end is normal Python).
I don't think JAX isn't more incapable of doing this than TensorFlow. Nothing forbid you to do this in JAX:
new_array = my_array[mask]
However, mask should be indices (integers) and not booleans. This way, JAX is aware of the shape of new_array (the same as mask). In that sens, I'm pretty sure that tf.boolean_mask is not differentiable i.e. it will raise an error if you try to compute its gradient at some point.
More generally, if you need to mask an array, whatever library you are using, there are two approaches:
if you know in advance what indices need to be selected and you need to provide these indices such that the library can compute the shape before compilation;
if you can't define these indices, for whatever reason, then you need to design your code in order to avoid the prevent the padding to affect your result.
Examples for each situation
Let say you're writing a simple embedding layer in JAX. The input is a batch of token indices corresponding to several sentences. To get word embeddings corresponding to these indices, I will simply write word_embeddings = embeddings[input]. Since I don't know the length of the sentences in advance, I need to pad all token sequences to the same length beforehand, such that input is of shape (number_of_sentences, sentence_max_length). Now, JAX will compile the masking operation every time this shape changes. To minimize the number of compilations, you can provide the same number of sentences (also called batch size) and you can set the sentence_max_length to the maximum sentence length in the entire corpus. This way, there will be only one compilation during training. Of course, you need to reserve one row in word_embeddings that corresponds to the pad index. But still, the masking works.
Later in the model, let say you want to express each word of each sentence as a weighted average of all other words in the sentence (like a self-attention mechanism). The weights are computed in parallel for the entire batch and are stored in the matrix A of dimension (number_of_sentences, sentence_max_length, sentence_max_length). The weighted averages are computed with the formula A # word_embeddings. Now, you need to make sure the pad tokens don't affect this previous formula. To do so, you can zero out the entries of A corresponding to the pad indices to remove their influence in the averaging. If the pad token index is 0, you would do:
mask = jnp.array(input > 0, dtype=jnp.float32)
A = A * mask[:, jnp.newaxis, :]
weighted_mean = A # word_embeddings
So here we used a boolean mask, but the masking is somehow differentiable since we multiply the mask with another matrix instead of using it as an index. Note that we should proceed the same way to remove the rows of weighted_mean that also correspond to pad tokens.
I am going through Andrew Ng’s tutorial from the CS230 Stanford course, and in every epoch of the training, evaluation is performed by calculating the metrics.
But before calculating the metrics, they are sending the batches to CPU and converting them to numpy arrays (code here).
# extract data from torch Variable, move to cpu, convert to numpy arrays
output_batch = output_batch.data.cpu().numpy()
labels_batch = labels_batch.data.cpu().numpy()
# compute all metrics on this batch
summary_batch = {metric: metrics[metric](output_batch, labels_batch) for metric in metrics}
My question is: why do they do that? Why don’t they just calculate the metrics (which is done here) on GPU using torch methods (e.g. torch.sum as opposed to np.sum)?
I would think that GPU to CPU transfers would slow things down, so there should be a very good reason for doing them?
I am new to PyTorch so I might be missing something.
Correct me if I'm wrong. Sending back the data to the CPU allows to reduce the GPU load even though memory is replaced when entering the following loop cycle. Futhermore, I believe converting to numpy has the advantage of freeing memory since you are detaching your data from the calculation graph. You end up manipulating labels_batch.cpu().numpy() a fixed array vs labels_batch a tensor attached to the entire network through linked backward_fn callbacks.
I am benchmarking knn with sklearn. Here is sys info.
sys info
Intel(R) Xeon(R) L5640 (6 cores 12 siblings);
Ubuntu 18.04, Python 3.7.3, numpy 1.16.4, sklearn 0.21.2;
There is no any other jobs/tasks occupying the cpu cores.
dataset
the benchmark is running on sklearn MNIST, which has 1797 Samples, 10 Classes, 8*8 Dimensionality and 17 Features.
Each square in this sample image stands for one pixel, 8*8 Dimensionality in total. Each pixel ranges from 0 to 16.
code
here is the code.
snippet_1:
n_neighbors=5; n_jobs=1; algorithm = 'brute'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.1s
snippet_2:
n_neighbors=5; n_jobs=1; algorithm = 'kd_tree'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.2s
I repeated the benchmark multiple times, no matter which one I ran first, snippet_1 is always 2 times faster than snippet_2.
question
Why does 'kd_tree' take more time than 'brute'?
I know "curse of dimensionality", since the doc says it clearly, what I am asking is why is that?
The answer seems to be related to dimensionality associated to your models. Curse of dimensionality as it is also known. KD-tree has a very poor scaling when it comes to a dimension above 15/20 (kinda exponential) whereas brute Force seems to follow a more linear-like pattern. When run on GPUs, for higher dimensions, brute force can indeed be faster. Here another researcher found a similar problem: Comparison search time between K-D tree and Brute-force
In general, KD-Tree will be slower than brute force if N < 2**k, where k is the number of dimensions (in this case 8 * 8 = 64) and N is the number of samples. In this case 2**64 = 1.8E19 >> 1797, so KDTree is far slower.
Basically, a KDTree does binary splits of the data along each dimension as a first step. If it has enough data to do that, it can guess the closest neighbors by the number of splits in common they have. If N < 2**k, it runs out of data before it runs out of dimensions to split. It then has no distance information about the other dimensions. With no good guess, it still has to brute force the rest of the dimensions, making the KDTree unnecessary overhead.
A more in-depth discussion of the issues and possible solutions can be found here. For this application, the third answer suggesting using PCA first to reduce your dimensionality is probably your best bet.
I am trying to perform PCA on an image dataset with 100.000 images each of size 224x224x3.
I was hoping to project the images into a space of dimension 1000 (or somewhere around that).
I am doing this on my laptop (16gb ram, i7, no GPU) and already set svd_solver='randomized'.
However, fitting takes forever. Is the dataset and the image dimension just too large or is there some trick I could be using?
Thanks!
Edit:
This is the code:
pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit(X)
Z = pca.transform(X)
X is a 100000 x 150528 matrix whose rows represent a flattened image.
You should really reconsider your choice of dimensionality reduction if you think you need 1000 principal components. If you have that many, then you no longer have interpretability so you might as well use other and more flexible dimensionality reduction algorithms (e.g. variational autencoders, t-sne, kernel-PCA). A key benefit of PCA is the interpretability if the principal components.
If you have a video stream of the same place, then you should be fine with <10 components (though principal component pursuit might be better). Moreover, if your image-dataset is not comprised of similar-ish images, then PCA is probably not the right choice.
Also, for images, nonnegative matrix factorisation (NMF) might be better suited. For NMF, you can perform stochastic gradient optimisation, subsampling both pixels and images for each gradient step.
However, if you still insist on performing PCA, then I think that the randomised solver provided by Facebook is the best shot you have. Run pip install fbpca and run the following code
from fbpca import pca
# load data into X
U, s, Vh = pca(X, 1000)
It's not possible to get faster than that without utilising some matrix structure, e.g. sparsity or block composition (which your dataset is unlikely to have).
Also, if you need help to pick the correct number of principal components, I reccomend using this code
import fbpca
from bisect import bisect_left
def compute_explained_variance(singular_values):
return np.cumsum(singular_values**2)/np.sum(singular_values**2)
def ideal_number_components(X, wanted_explained_variance):
singular_values = fbpca.svd(X, compute_uv=False) # This line is a bottleneck.
explained_variance = compute_explained_variance(singular_values)
return bisect_left(explained_variance, wanted_explained_variance)
def auto_pca(X, wanted_explained_variance):
num_components = ideal_number_components(X, explained_variance)
return fbpca.pca(X, num_components) # This line is a bottleneck if the number of components is high
Of course, the above code doesn't support cross validation, which you really should use to choose the correct number of components.
You can try to set
svd_solver="svd_solver"
The training should be much faster.
You could also try to use :
from sklearn.decomposition import FastICA
Which is more scalable
Last resort solution could be to turn your images black & white, to reduce the dimension by 3, this might be a good step if your task is not color-sentitive (for instance Optical character Recognition)
try to experiment with iterated_power parameter of PCA
I am trying to train the SGDClassifier with text data using the HashingVectorizer. I wonder how I could assemble the batches which are passed to partial_fit() reading from multiple files.
Is the following code an appropriate way to get the data in batches via an iterable? Is there any best practice or recommended way for doing this?
class MyIterable:
def __init__(self, files, batch_size):
self.files = files
self.batch_size = batch_size
def __iter__(self):
batchstartmark = 0
for line in fileinput.input(self.files):
while batchstartmark < self.batch_size
yield line.split('\t')
batchstartmark += 1
Thanks in advance!
Just judging the theory of this approach here:
That's a very very bad approach!
As SGDClassifier is using Stochastic Gradient Descent (using mini-batches if you want), you should try to fulfill the assumptions of SGDs mathematical analysis.
The basic idea of SGD is: pick some random element and descent. Your code obviously diverges by two points:
A) You are picking your elements in the same order in every epoch
B) You are sampling (not really) without replacement
So x17 will not get picked until every other x was picked in this epoch
Your ignorance of A will lead to very bad performance with some high probability.
The point B is hard to analyze. There are different theoretical views, mostly dependent on some specific problem (of course there are differences between convex and non-convex problems), and while sampling-with-replacement is the classic one (with the most general convergence proofs), sometimes sampling-without-replacement (aka: shuffle and iterate during epoch / cycling) is used and often it's faster in convergence.