Evaluating Tensorflow operation is very slow in a loop

Evaluating Tensorflow operation is very slow in a loop - python

I'm trying to learn tensorflow by coding up some simple problems: I was trying to find the value of pi using a direct sampling Monte Carlo method.
The run time is much longer than I thought it would be when using a for loop to do this. I've seen other posts about similar things and I've tried to follow the solutions, but I think I still must be doing something wrong.
Attached below is my code:
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
x = tf.random_uniform(shape=(), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
init_op = tf.global_variables_initializer()
sess.run(init_op)
# Make sure no new nodes are added to the graph
sess.graph.finalize()
start = time.time()
# Run monte carlo trials -- This is very slow
for _ in range(n_trials):
sess.run(hit_op)
hits = hit.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))
>>> Pi is 3.15208
>>> Tensorflow operation took 8.98 s
In comparison, doing a for loop type solution in numpy is an order of magnitude faster
start = time.time()
hits = [ 1 if np.sqrt(np.sum(np.square(np.random.uniform(size=2)))) < 1 else 0 for _ in range(n_trials) ]
a = 0
for hit in hits:
a+=hit
print("numpy operation took {:.2f} s".format((time.time()-start)))
print("Pi is {}".format(4*a/n_trials))
>>> Pi is 3.14032
>>> numpy operation took 0.75 s
Attached below is a plot of the difference in overall executioin times for various numbers of trials.
Please note: my question is not about "how to perform this task the fastest", I recognize there are much more effective ways of calculating Pi. I've only used this as a benchmarking tool to check the performance of tensorflow against something I'm familiar with (numpy).

The slow in speed has got to do with some communication overhead between Python and Tensorflow in sess.run, which is executed multiple times inside your loop. I would suggest using tf.while_loop to execute the computations within Tensorflow. That would be a better comparison over numpy.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
hit = tf.Variable(0, name='hit')
def body(ctr):
x = tf.random_uniform(shape=[2], name='x')
r = tf.sqrt(tf.reduce_sum(tf.square(x))
is_inside = tf.cond(tf.less(r,1), lambda: tf.constant(1), lambda: tf.constant(0))
hit_op = hit.assign_add(is_inside)
with tf.control_dependencies([hit_op]):
return ctr + 1
def condition(ctr):
return ctr < n_trials
with tf.Session() as sess:
tf.global_variables_initializer().run()
result = tf.while_loop(condition, body, [tf.constant(0)])
start = time.time()
sess.run(result)
hits = hit.eval()
print("Pi is {}".format(4.*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))

Simple, session.run has much overhead, and it is not designed to be used that way. Normally, having e.g. a neural net you would call a single session.run for a dozen of multiplications of big matrices, then this 0.2 ms it takes would not matter at all.
As for your case, you wanted something like that probably. It runs 5 times faster than numpy version on my machine.
By the way, you do exactly same thing in numpy. If you used loop to reduce instead of np.sum it would be much slower.
import tensorflow as tf
import numpy as np
import time
n_trials = 50000
tf.reset_default_graph()
x = tf.random_uniform(shape=(n_trials,), name='x')
y = tf.random_uniform(shape=(), name='y')
r = tf.sqrt(x**2 + y**2)
hit = tf.Variable(0, name='hit')
# perform the monte carlo step
is_inside = tf.cast(tf.less(r, 1), tf.int32)
hit2= tf.reduce_sum(is_inside)
#hit_op = hit.assign_add(is_inside)
with tf.Session() as sess:
# init_op = tf.global_variables_initializer()
sess.run(tf.initialize_all_variables())
# Make sure no new nodes are added to the graph
sess.graph.finalize()
start = time.time()
# Run monte carlo trials -- This is very slow
#for _ in range(n_trials):
sess.run(hit2)
hits = hit2.eval()
print("Pi is {}".format(4*hits/n_trials))
print("Tensorflow operation took {:.2f} s".format((time.time()-start)))

Related

Python multithreading slow af on neural network activation function

I am making a neural network WITHOUT the usage of any neural network libraries. Because of this, my code is extremely unoptimized. This is why I decided to use multithreading. However, that seemed to slow down my code by atleast 3 times. Why is multithreading slower in my case, and what optimizations can I do to speed it up.
For the code below: inputs are each layer's inputs while weights are simply the allocated weights of that layer. weightedInputs are 3 dimensional like this : [[[1,2,3],[1,2,3],[1,2,3]],[[1,2,3],[1,2,3],[1,2,3]]]
Here is my activation function:
def sigmoid(x):
return 1 / (1 + np.exp(-np.sum(x)))
Here is my initial layer function:
def layerSingle(inputs, weights):
biasedInputs = np.append(np.array(inputs), 1)
neuronInputs = np.repeat(np.array([biasedInputs]), len(weights), axis = 0)
weightedInputs = neuronInputs * np.array(weights)
out = list(map(sigmoid, weightedInputs))
return np.array(out)
Here is with multithreading:
def layerMulti(inputs, weights):
biasedInputs = np.append(np.array(inputs), 1)
neuronInputs = np.repeat(np.array([biasedInputs]), len(weights), axis = 0)
weightedInputs = neuronInputs * np.array(weights)
with concurrent.futures.ThreadPoolExecutor() as executor:
result = executor.map(sigmoid, weightedInputs)
out = np.array([])
for row in result:
out = np.append(out, row)
return np.array(out)
Here are the libraries I import:
import numpy as np
import concurrent.futures
from copy import deepcopy
Any help is extremely appreciated.

Tensorflow graph run speed keeps decreasing each iteration

I have a very simply tensorflow setup but one aspect of it (calculating the accuracy) keeps increasing in how long it takes to run. I'm confused about why this is. I've simplified down the code as much as I can while stile keeping the error. Here is the code
import time
import tensorflow as tf
import numpy as np
# dummy data
data = np.zeros((12, 784))
labels = np.zeros((12, 10))
xs = tf.placeholder(tf.float32, [12, 784])
ys = tf.placeholder(tf.float32, [12, 10])
weights = tf.Variable(tf.truncated_normal([784, 10], stddev=0.1))
prediction = tf.nn.softmax(tf.matmul(xs, weights))
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
while True:
y_pre = sess.run(prediction, feed_dict={xs: data})
correct_prediction = tf.equal(tf.argmax(y_pre, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
start_time = time.time()
r = sess.run(accuracy, feed_dict={xs: data, ys: labels})
time_taken = time.time() - start_time
#why does time_taken keep growing?
print("time_taken", time_taken)
I suspect it's something I'm doing wrong in the while True loop. For my experience time_taken starts of low around 0.01 but then seemingly indefinitely grows to 0.30 and beyond if you leave it long enough. Is there some way to keep the time_taken constant? Any help would be appreciated thanks.

Can you take a look at your RAM during execution?

Tensorflow: Why my code is running slower and slower?

I am new to tensorflow. The following code can run successfully, without any error. In the first 10 lines of output, the computation is fast, and the output (defined in the last line) flies line by line. However, as the iteration goes up, the computation become slower and slower, and finally become intolerable. So I wonder whether there are any modifications that can speed this up.
Here is a brief description of this code:
This code apply the single hidden-layer neural network to the dataset. It aims to find the best parameter for rate[0] and rate[1], which are parameters that will effect the loss function. During each step of training, one tuple is fed to the model, and the accuracy of the tuple is immediately evaluated (this kind of data comes as a stream in real world).
import tensorflow as tf
import numpy as np
n_hidden=50
n_input=37
n_output=2
data_raw=np.genfromtxt(r'data.csv',delimiter=",",dtype=None)
data_info=np.genfromtxt(r'data2.csv',delimiter=",",dtype=None)
def pre_process( tuple):
ans = []
temp = [0 for i in range(24)]
temp[int(tuple[0])] = 1
# np.append(ans,np.array(temp))
ans.extend(temp)
temp = [0 for i in range(7)]
temp[int(tuple[1]) - 1] = 1
ans.extend(temp)
# np.append(ans,np.array(temp))
temp = [0 for i in range(3)]
temp[int(tuple[3])] = 1
ans.extend(temp)
temp = [0 for i in range(2)]
temp[int(tuple[4])] = 1
ans.extend(temp)
ans.extend([int(tuple[5])])
return np.array(ans)
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
b1=tf.Variable(tf.zeros([n_hidden]))
W2=tf.Variable(tf.zeros([n_hidden,n_output]))
b2=tf.Variable(tf.zeros([n_output]))
logits_1 = tf.matmul(x, W1) + b1
relu_layer= tf.nn.relu(logits_1)
logits_2 = tf.matmul(relu_layer, W2) + b2
correct_prediction = tf.equal(tf.argmax(logits_2,1), tf.argmax(y_,0))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
for j in range(-100,i,10):
rate[1]=j
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=logits_2)*[rate[0],rate[1]])
# loss2=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_r, logits=logits_2)*[rate[2],rate[3]])
# loss=loss1+loss2
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
data_line=1
accur=0
local_local=0
remote_remote=0
local_remote=0
remote_local=0
total=0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(200):
# print(int(data_raw[data_line][0]),data_info[i][0])
if i>100:
total+=1
if int(data_raw[data_line][0])==data_info[i][0]:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),y_:[1,0],y_r:[0,1]})
# print(sess.run(logits_2,{x:pre_process(data_info[i]).reshape(1,-1), y_: #[1,0]}))
data_line+=1;
if data_line==len(data_raw):
break
if i>100:
acc=accuracy.eval(feed_dict={x: pre_process(data_info[i]).reshape(1,-1), y_: [1,0], y_r:[0,1]})
local_local+=acc
local_remote+=1-acc
accur+=acc
else:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),y_:[0,1], y_r:[1,0]})
# print(sess.run(logits_2,{x: pre_process(data_info[i]).reshape(1,-1), y_: #[0,1]}))
if i>100:
acc=accuracy.eval(feed_dict={x: pre_process(data_info[i]).reshape(1,-1), y_: [0,1], y_r:[1,0]})
remote_remote+=acc
remote_local+=1-acc
accur+=acc
print("correctness: (%.3d,%.3d): \t%.2f %.2f %.2f %.2f %.2f" % (rate[0],rate[1],accur/total,local_local/total,local_remote/total,remote_local/total,remote_remote/total))

Though GPhilo's answer addresses the issue why running the code is getting slower and slower, but in reality, that solution will result in creation of computation graph again and again which is not good.
The following two lines of code, (GPhilo has also mentioned) are continuously adding operations to your graph for each iteration.
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits( \
labels=y_, logits=logits_2)*[rate[0],rate[1]])
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
As I can see, you are having two values rate[0], rate[1] which needs to be supplied to your graph. Why are you not supplying these two values through placeholder and define your graph only once. Once you start running Session you shouldn't add more operations in your graph. Also, you shouldn't be considering initializing your Session for iteration.
Check this modified code (only important parts)
# To clear previously created graph (if any) present in memory.
tf.reset_default_graph()
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
# Add these two placeholders (Assuming they are single float value)
rate0 = tf.placeholder(tf.float32, shape = [])
rate1 = tf.placeholder(tf.float32, shape = [])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
....
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Bring this code outside from loop (Note replacement of rate[0] with placeholder)
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(labels=y_, \
logits=logits_2) * [rate0, rate1])
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
# Instantiate session only once.
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Move the subsequent looping code inside.
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
After this modification, whenever your Session runs train_step, you need to supply these two extra placeholders in your feed_dict.
Ex:
sess.run(train_step,feed_dict={x:pre_process(data_info[i]).reshape(1,-1),
y_:[1,0],y_r:[0,1], rate0: rate[0], rate1: rate[1]})
In this way, you will not be creating graph for every iteration and in fact this code will be faster than GPhilo's solution.

Every time you run train_step = tf.train.GradientDescentOptimizer(0.01).minimize(loss) you're adding (quite some) operations to your graph, which becomes bigger and bigger with more loops of your program. The bigger the graph, the slower the execution.
Put your model definition in the loops' body and call tf.reset_default_graph() each time you start a new iteration:
rate=[0,0]
for i in range(-100,200,10):
rate[0]=i;
for j in range(-100,i,10):
tf.reset_default_graph()
x=tf.placeholder(tf.float32, shape=[1,n_input])
y_=tf.placeholder(tf.float32,shape=[n_output])
y_r=tf.placeholder(tf.float32,shape=[n_output])
W1=tf.Variable(tf.random_uniform([n_input, n_hidden]))
b1=tf.Variable(tf.zeros([n_hidden]))
W2=tf.Variable(tf.zeros([n_hidden,n_output]))
b2=tf.Variable(tf.zeros([n_output]))
logits_1 = tf.matmul(x, W1) + b1
relu_layer= tf.nn.relu(logits_1)
logits_2 = tf.matmul(relu_layer, W2) + b2
correct_prediction = tf.equal(tf.argmax(logits_2,1), tf.argmax(y_,0))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
rate[1]=j
#...

Something wrong with Keras code Q-learning OpenAI gym FrozenLake

Maybe my question will seem stupid.
I'm studying the Q-learning algorithm. In order to better understand it, I'm trying to remake the Tenzorflow code of this FrozenLake example into the Keras code.
My code:
import gym
import numpy as np
import random
from keras.layers import Dense
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
model = Sequential()
model.add(Dense(16, activation='relu', kernel_initializer='uniform', input_shape=(16,)))
model.add(Dense(4, activation='softmax', kernel_initializer='uniform'))
def custom_loss(yTrue, yPred):
return K.sum(K.square(yTrue - yPred))
model.compile(loss=custom_loss, optimizer='sgd')
# Set learning parameters
y = .99
e = 0.1
#create lists to contain total rewards and steps per episode
jList = []
rList = []
num_episodes = 2000
for i in range(num_episodes):
current_state = env.reset()
rAll = 0
d = False
j = 0
while j < 99:
j+=1
current_state_Q_values = model.predict(np.identity(16)[current_state:current_state+1], batch_size=1)
action = np.reshape(np.argmax(current_state_Q_values), (1,))
if np.random.rand(1) < e:
action[0] = env.action_space.sample() #random action
new_state, reward, d, _ = env.step(action[0])
rAll += reward
jList.append(j)
rList.append(rAll)
new_Qs = model.predict(np.identity(16)[new_state:new_state+1], batch_size=1)
max_newQ = np.max(new_Qs)
targetQ = current_state_Q_values
targetQ[0,action[0]] = reward + y*max_newQ
model.fit(np.identity(16)[current_state:current_state+1], targetQ, verbose=0, batch_size=1)
current_state = new_state
if d == True:
#Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break
print("Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%")
When I run it, it doesn't work well: Percent of succesful episodes: 0.052%
plt.plot(rList)
The original Tensorflow code is much more better: Percent of succesful episodes: 0.352%
plt.plot(rList)
What have I done wrong ?

Besides setting use_bias=False as #Maldus mentioned in the comments, another thing you can try is to start with a higher epsilon value (e.g. 0.5, 0.75)? A trick might be to only decrease the epsilon value IF you reach the goal. i.e. don't decrease epsilon on the end of every episode. That way your player can keep on exploring the map randomly, until it starts to converge on a good route, and then it'll be a good idea to reduce the epsilon parameter.
I've actually implemented a similar model in keras in this gist using Convolutional layers instead of Dense layers. Managed to get it to work in under 2000 episodes. Might be of some help to others :)

Tensorflow sawtooth memory usage during training loop iteration

I am training a neural network in a Python while loop which continues until some stopping condition is reached. I've noticed that when I train my network, I can see "sawtooth"/wave-like memory usage patterns, like this:
I've managed to reproduce this using a much simpler example than my production model. Obviously this is somewhat different as I don't update parameters, but I believe it replicates the behavior I'm seeing.
import tensorflow as tf
import numpy as np
def main(x_vals):
x = tf.placeholder(tf.float32, [500, 1000, 1000])
rs = tf.reduce_sum(x)
sess = tf.Session()
v = sess.run(rs, feed_dict={x:x_vals})
print(v)
if __name__ == "__main__":
x_vals = np.random.rand(500, 1000, 1000)
while True:
main(x_vals)
The size of the sawtooth seems to approximately scale with the size of the input data. Understandably, there appears to be one cycle per loop iteration.
What is happening here? Is Tensorflow copying over all of my data on each session evaluation? This isn't a problem per-se, but if I could avoid copying over my data on each training loop iteration (since my entire dataset fits in memory), I'd like to do that as I imagine the allocations are quite expensive. Have I diverged from best practices somewhere?

Using feed_dict will typically copy the data. Recently there new functionality was added that will avoid the copy, but you have to make sure that your data is word-aligned, see discussion in
https://github.com/tensorflow/tensorflow/issues/9690

I wanted to post a follow-up after a couple days of investigation. #Yaroslav pointed me in the correct direction, but there's a bit of color to the full answer.
One can limit some amount of memory allocation in each session by trying to avoid feed_dict by say, using a preloaded variable (if your dataset fits in memory). However, there's also some dynamic allocation done by the optimization computation graph, presumably to store gradients.
I've included code which demonstrates this. Here's a snippet of what the memory use looks like. On the left is repeated calling the preloaded function, while the right demonstrates repeatedly calling loaded_each_iteration.
import tensorflow as tf
import numpy as np
A_shape = [100000, 3000]
b_shape = [100000, 1]
x_shape = [3000, 1]
def loaded_each_iteration(A_vals):
A = tf.placeholder(tf.float32, A_shape)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run(x.initializer)
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost], feed_dict={A:A_vals})
print(c)
def preloaded(A_vals):
A_init = tf.placeholder(tf.float32, A_shape)
A = tf.Variable(A_init, trainable=False, collections=[], name='A', dtype=tf.float32)
b = tf.constant(np.random.rand(*b_shape), name='b', dtype=tf.float32)
x = tf.Variable(np.zeros(x_shape, dtype=np.float32), name='x')
diff = tf.matmul(A, x) - b
cost = tf.nn.l2_loss(diff)
train_op = tf.train.AdagradOptimizer(0.00001).minimize(cost, var_list=[x])
sess = tf.Session()
sess.run([A.initializer, x.initializer], feed_dict={A_init:A_vals})
sess.run(tf.global_variables_initializer())
while True:
_, c = sess.run([train_op, cost])
print(c)
if __name__ == "__main__":
A_vals = np.random.rand(*A_shape)
while True:
loaded_each_iteration(A_vals)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Evaluating Tensorflow operation is very slow in a loop - python

Related

Python multithreading slow af on neural network activation function

Tensorflow graph run speed keeps decreasing each iteration

Tensorflow: Why my code is running slower and slower?

Something wrong with Keras code Q-learning OpenAI gym FrozenLake

Tensorflow sawtooth memory usage during training loop iteration

Categories

Resources