Python multithreading slow af on neural network activation function

I am making a neural network WITHOUT the usage of any neural network libraries. Because of this, my code is extremely unoptimized. This is why I decided to use multithreading. However, that seemed to slow down my code by atleast 3 times. Why is multithreading slower in my case, and what optimizations can I do to speed it up.
For the code below: inputs are each layer's inputs while weights are simply the allocated weights of that layer. weightedInputs are 3 dimensional like this : [[[1,2,3],[1,2,3],[1,2,3]],[[1,2,3],[1,2,3],[1,2,3]]]
Here is my activation function:
def sigmoid(x):
return 1 / (1 + np.exp(-np.sum(x)))
Here is my initial layer function:
def layerSingle(inputs, weights):
biasedInputs = np.append(np.array(inputs), 1)
neuronInputs = np.repeat(np.array([biasedInputs]), len(weights), axis = 0)
weightedInputs = neuronInputs * np.array(weights)
out = list(map(sigmoid, weightedInputs))
return np.array(out)
Here is with multithreading:
def layerMulti(inputs, weights):
biasedInputs = np.append(np.array(inputs), 1)
neuronInputs = np.repeat(np.array([biasedInputs]), len(weights), axis = 0)
weightedInputs = neuronInputs * np.array(weights)
with concurrent.futures.ThreadPoolExecutor() as executor:
result =, weightedInputs)
out = np.array([])
for row in result:
out = np.append(out, row)
return np.array(out)
Here are the libraries I import:
import numpy as np
import concurrent.futures
from copy import deepcopy
Any help is extremely appreciated.


Computation of second derivatives with batch_jacobian in tensorflow is really slow during training

I am trying to compute the Hessian of the output of a neural network with respect to its inputs. To give you an idea, this is the matrix I am trying to compute:
I am running Tensorflow 2.5.0 and my code to calculate the M matrix looks like this:
def get_Mass_Matrix(self, q, dq):
nDof = dq.shape[1]
with tf.GradientTape(persistent = True) as t2:
with tf.GradientTape(persistent = True) as t1:
T = self.kinetic(q, dq)
g = t1.gradient(T, dq)
h = t2.batch_jacobian(g, dq)
return h
The function self.kinetic() calls a multilayer perceptron. When I compute M like this, I get the correct answer but my neural network training slows down significantly, even when running on a GPU.
I was wondering if there is a more efficient way to perform the same computation that doesn't result in so much overhead? Thank you.
For reference, I am using the subclassing approach to build the model (it inherits from tf.keras.Model).
Adding more details about the self.kinetic function:
def kinetic(self, q, qdot):
nDof = q.shape[1]
qdq = tf.concat([tf.reshape(q, ((-1, nDof))),
tf.reshape(qdot, ((-1, nDof)))], axis = -1)
return self.T_layers(qdq)
T_layers is defined as:
self.T_layers = L(nlayers = 4, n = 8, input_dim = (latent_dim, 1), nlact = 'swish', oact = 'linear')
Which is calling:
class L(tf.keras.layers.Layer):
def __init__(self, nlayers, n, nlact, input_dim, oact = 'linear'):
super(L, self).__init__()
self.layers = nlayers
self.dense_in = tf.keras.layers.Dense(n, activation = nlact, input_shape = input_dim)
self.dense_lays = []
for lay in range(nlayers):
self.dense_lays.append(tf.keras.layers.Dense(n, activation = nlact, kernel_regularizer = 'l1'))
self.dense_out = tf.keras.layers.Dense(1, activation = oact, use_bias = False)
def call(self, inputs):
x = self.dense_in(inputs)
for lay in range(self.layers):
x = self.dense_lays[lay](x)
return self.dense_out(x)
I suspect part of the problem might be that I am not "building" the layers? Any advice is appreciated!
In order to get a reasonable performance from tensorflow, especially when computing gradients, you have to decorate your get_Mass_Matrix with #tf.function to make sure it runs in graph mode. To do this, everything inside the function have to be graph-mode compatible.
In the call function of class L, it is better to iterate the list directly instead of indexing it, i.e.:
class L(tf.keras.layers.Layer):
def call(self, inputs):
x = self.dense_in(inputs)
for l in self.dense_lays:
x = l(x)
return self.dense_out(x)
Then, you can decorate your get_Mass_Matrix.
def get_Mass_Matrix(self, q, dq):
with tf.GradientTape() as t2:
with tf.GradientTape() as t1:
T = self.kinetic(q, dq)
g = t1.gradient(T, dq)
return t2.batch_jacobian(g, dq)
Remark: q and dq that are passed into get_Mass_Matrix must be tensors of constant shape(constant between calls), otherwise, it will retrace every time there is a new shape and slow down instead.

Short circuit computation in mixture of experts model using tensorflow keras functional api

I am trying to swap between multiple different "expert" layers based on the output of a "gating" layer (as a mixture of experts).
I created a custom layer that takes in the outputs of the expert and gating layers, but this ends up throwing away some outputs rather than not computing them in the first place.
How can I make the model "short circuit" to only evaluate the gating layer and the selected expert layer(s) to save computation time?
I am using tensorflow 2.0 gpu and the keras functional api
Keras models can be implemented fully dynamically, to support the efficient routing that you mentioned. The following example shows one way in which this can be done. The example is written with the following premises:
It assumes there are two experts (LayerA and LayerB)
It assumes that a mix-of-experts model (MixOfExpertsModel) switches dynamically between the two expert layer classes depending on the per-example output of a Keras Dense layer
It satisfies the need to run training on the model in a batch fashion.
Pay attention to the comments in the code to see how the switching is done.
import numpy as np
import tensorflow as tf
# This is your Expert A class.
class LayerA(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_a", shape=input_shape[1:])
def call(self, x):
return x + self.weight
# This is your Expert B class.
class LayerB(tf.keras.layers.Layer):
def build(self, input_shape):
self.weight = self.add_weight("weight_b", shape=input_shape[1:])
def call(self, x):
return x * self.weight
class MixOfExpertsModel(tf.keras.models.Model):
def __init__(self):
super(MixOfExpertsModel, self).__init__()
self._expert_a = LayerA()
self._expert_b = LayerB()
self._gating_layer = tf.keras.layers.Dense(1, activation="sigmoid")
def call(self, x):
z = self._gating_layer(x)
# The switching logic:
# - examples with gating output <= 0.5 are routed to expert A
# - examples with gating output > 0.5 are routed to expert B.
mask_a = tf.squeeze(tf.less_equal(z, 0.5), axis=-1)
mask_b = tf.squeeze(tf.greater(z, 0.5), axis=-1)
# `input_a` is a subset of slices of the original input (`x`).
# So is `input_b`. As such, no compute is wasted.
input_a = tf.boolean_mask(x, mask_a, axis=0)
input_b = tf.boolean_mask(x, mask_b, axis=0)
if tf.size(input_a) > 0:
output_a = self._expert_a(input_a)
output_a = tf.zeros_like(input_a)
if tf.size(input_b) > 0:
output_b = self._expert_b(input_b)
output_b = tf.zeros_like(input_b)
# Return `mask_a`, and `mask_b`, so that the caller can know
# which example is routed to which expert and whether its output
# appears in `output_a` or `output_b`. # This is necessary
# for writing a (custom) loss function for this class.
return output_a, output_b, mask_a, mask_b
# Create an intance of the mix-of-experts model.
mix_of_experts_model = MixOfExpertsModel()
# Generate some dummy data.
num_examples = 32
xs = np.random.random([num_examples, 8]).astype(np.float32)
# Call the model.
I didn't write a custom loss function that would support the training of this class. But that's doable by using the return values of, namely the outputs and masks.

Efficient way to avoid modifying parameter by inplace operation

I have a model which has noisy linear layers (for which you can sample values from a mu and sigma parameter) and need to create two decorrelated outputs of it.
This means I have something like:
output_1 = model(input)
with torch.no_grad():
output_2 = model(input)
sample_noise actually modifies weights attached to the model according to a normal distribution.
But in the end this leads to
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation
The question actually is, what's the best way to avoid modifying these parameters. I could actually deepcopy the model every iteration and then use it for the second forward pass, but this does not sound very efficient to me.
If I understand your problem correctly, you want to have a linear layer with matrix M and then create two outputs
y_1 = (M + μ_1) * x + b
y_2 = (M + μ_2) * x + b
where μ_1, μ_2 ~ P. The simplest way would be, in my opinion, to create a custom class
import torch
import torch.nn.functional as F
from torch import nn
class NoisyLinear(nn.Module):
def __init__(self, n_in, n_out):
super(NoisyLinear, self).__init__()
# or any other initialization you want
self.weight = nn.Parameter(torch.randn(n_out, n_in))
self.bias = nn.Parameter(torch.randn(n_out))
def sample_noise(self):
# implement your noise generation here
return torch.randn(*self.weight.shape) * 0.01
def forward(self, x):
noise = self.sample_noise()
return F.linear(x, self.weight + noise, self.bias)
nl = NoisyLinear(4, 3)
x = torch.randn(2, 4)
y1 = nl(x)
y2 = nl(x)
print(y1, y2)

Unknown bug in neural network. Is the because matrices are not commutative?

I'm having trouble with my first neural network. I simply cannot find the source of the error.
Reading the book "Make your own neural network" by Tariq Rashid I tried to implement Handwriting recognition using NN which would classify images and determine which digit from 0 to 9 is written down.
After training the NN the tests show that each of the letters have ~99% match, which is obviously wrong.
In the book the author approaches NN matrices a bit deferent then I have. For example he multiplies input-hidden layer weights with input, which I do other way around by multiplying input with input-hidden weights.
Here is illustration of the way I do matrix multiplication while querying NN (feedforward):
I'm aware that matrices do not posses commutative property for dot product but I it don't notice that I have made an error there.
Should I take different approach i.e. transpose all matrices and multiply them in different order?
Is there de facto standard for dimensions of an input and output matrix i.e. should they be shaped as 1×n or n×1?
If this is wrong approach then it certainly has manifested itself in backpropagation with gradient descent used for training.
Source code
import numpy as np
import matplotlib.pyplot
from matplotlib.pyplot import imshow
import scipy.special as scipy
from PIL import Image
class NeuralNetwork(object):
def __init__(self):
self.input_neuron_count = 28*28 # One for each pixel, 28*28 = 784 in total.
self.hidden_neuron_count = 100 # Arbitraty.
self.output_neuron_count = 10 # One for each digit from 0 to 9.
self.learning_rate = 0.1 # Arbitraty.
# Sampling the weights from a normal probability distribution
# centered around zero and with standard deviation
# that is related to the number of incoming links into a node,
# 1/√(number of incoming links).
generate_random_weight_matrix = lambda input_neuron_count, output_neuron_count: (
np.random.normal(0.0, pow(input_neuron_count, -0.5), (input_neuron_count, output_neuron_count))
self.input_x_hidden_weights = generate_random_weight_matrix(self.input_neuron_count, self.hidden_neuron_count)
self.hidden_x_output_weights = generate_random_weight_matrix(self.hidden_neuron_count, self.output_neuron_count)
self.activation_function = lambda value: scipy.expit(value) # Sigmoid function
def train(self, input_array, target_array):
inputs = np.array(input_array, ndmin=2)
targets = np.array(target_array, ndmin=2)
hidden_layer_input =, self.input_x_hidden_weights)
hidden_layer_output = self.activation_function(hidden_layer_input)
output_layer_input =, self.hidden_x_output_weights)
output_layer_output = self.activation_function(output_layer_input)
output_errors = targets - output_layer_output
self.hidden_x_output_weights += self.learning_rate *, (output_errors * output_layer_output * (1 - output_layer_output)))
hidden_errors =, self.hidden_x_output_weights.T)
self.input_x_hidden_weights += self.learning_rate *, (hidden_errors * hidden_layer_output * (1 - hidden_layer_output)))
def query(self, input_array):
inputs = np.array(input_array, ndmin=2)
hidden_layer_input =, self.input_x_hidden_weights)
hidden_layer_output = self.activation_function(hidden_layer_input)
output_layer_input =, self.hidden_x_output_weights)
output_layer_output = self.activation_function(output_layer_input)
return output_layer_output
Replication (Training and testing)
The original source of training and testing data is from The MNIST Database. I have used CSV version which I downloaded from the book authors web page The MNIST Dataset of Handwitten Digits.
Here is the code I have used for training and testing so far:
def prepare_data(handwritten_digit_array):
return ((handwritten_digit_array / 255.0 * 0.99) + 0.0001).flatten()
def create_target(digit_target):
target = np.zeros(10) + 0.01
target[digit_target] = target[digit_target] + 0.98
return target
# Training
neural_network = NeuralNetwork()
training_data_file = open('mnist_train.csv', 'r')
training_data = training_data_file.readlines()
for data in training_data:
handwritten_digit_raw = data.split(',')
handwritten_digit_array = np.asfarray(handwritten_digit_raw[1:]).reshape((28, 28))
handwritten_digit_target = int(handwritten_digit_raw[0])
neural_network.train(prepare_data(handwritten_digit_array), create_target(handwritten_digit_target))
# Testing
test_data_file = open('mnist_test_10.csv', 'r')
test_data = test_data_file.readlines()
for data in test_data:
handwritten_digit_raw = data.split(',')
handwritten_digit_array = np.asfarray(handwritten_digit_raw[1:]).reshape((28, 28))
handwritten_digit_target = int(handwritten_digit_raw[0])
output = neural_network.query(handwritten_digit_array.flatten())
print('target', handwritten_digit_target)
print('output', output)
This is one of those facepalm moments. Neural network has been working as expected all along. The truth is that I have now noticed I've overlooked the test results and read numbers written in scientific notation incorrectly.
Measured on 10000 test data from The MNIST Database this NN has accuracy of 94.01%.

Evaluating Tensorflow operation is very slow in a loop

I'm trying to learn tensorflow by coding up some simple problems: I was trying to find the value of pi using a direct sampling Monte Carlo method.
The run time is much longer than I thought it would be when using a for loop to do this. I've seen other posts about similar things and I've tried to follow the solutions, but I think I still must be doing something wrong.
Attached below is my code:
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
x = tf.random_uniform(shape=(), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
init_op = tf.global_variables_initializer()
# Make sure no new nodes are added to the graph
start = time.time()
# Run monte carlo trials -- This is very slow
for _ in range(n_trials):
hits = hit.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))
>>> Pi is 3.15208
>>> Tensorflow operation took 8.98 s
In comparison, doing a for loop type solution in numpy is an order of magnitude faster
start = time.time()
hits = [ 1 if np.sqrt(np.sum(np.square(np.random.uniform(size=2)))) < 1 else 0 for _ in range(n_trials) ]
a = 0
for hit in hits:
print("numpy operation took {:.2f} s".format((time.time()-start)))
print("Pi is {}".format(4*a/n_trials))
>>> Pi is 3.14032
>>> numpy operation took 0.75 s
Attached below is a plot of the difference in overall executioin times for various numbers of trials.
Please note: my question is not about "how to perform this task the fastest", I recognize there are much more effective ways of calculating Pi. I've only used this as a benchmarking tool to check the performance of tensorflow against something I'm familiar with (numpy).
The slow in speed has got to do with some communication overhead between Python and Tensorflow in, which is executed multiple times inside your loop. I would suggest using tf.while_loop to execute the computations within Tensorflow. That would be a better comparison over numpy.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
hit = tf.Variable(0, name='hit')
def body(ctr):
x = tf.random_uniform(shape=[2], name='x')
r = tf.sqrt(tf.reduce_sum(tf.square(x))
is_inside = tf.cond(tf.less(r,1), lambda: tf.constant(1), lambda: tf.constant(0))
hit_op = hit.assign_add(is_inside)
with tf.control_dependencies([hit_op]):
return ctr + 1
def condition(ctr):
return ctr < n_trials
with tf.Session() as sess:
result = tf.while_loop(condition, body, [tf.constant(0)])
start = time.time()
hits = hit.eval()
print("Pi is {}".format(4.*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))
Simple, has much overhead, and it is not designed to be used that way. Normally, having e.g. a neural net you would call a single for a dozen of multiplications of big matrices, then this 0.2 ms it takes would not matter at all.
As for your case, you wanted something like that probably. It runs 5 times faster than numpy version on my machine.
By the way, you do exactly same thing in numpy. If you used loop to reduce instead of np.sum it would be much slower.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
x = tf.random_uniform(shape=(n_trials,), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit2= tf.reduce_sum(is_inside)
#hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
# init_op = tf.global_variables_initializer()
# Make sure no new nodes are added to the graph
start = time.time()
# Run monte carlo trials -- This is very slow
#for _ in range(n_trials):
hits = hit2.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))

