TensorFlow: minimalist program fails on distributed mode

TensorFlow: minimalist program fails on distributed mode - python

I wrote a very simple program that runs just fine without distribution but hangs on CheckpointSaverHook in distributed mode (everything on my localhost though!). I've seen there's been a few questions about hanging in distributed mode, but none seem to match my question.
Here's the script (made to toy with the new layers API):
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib import layers
DATA_SIZE=10
DIMENSION=5
FEATURES='features'
def generate_input_fn():
def _input_fn():
mid = int(DATA_SIZE/2)
data = np.array([np.ones(DIMENSION) if x < mid else -np.ones(DIMENSION) for x in range(DATA_SIZE)])
labels = ['0' if x < mid else '1' for x in range(DATA_SIZE)]
table = tf.contrib.lookup.string_to_index_table_from_tensor(tf.constant(['0', '1']))
label_tensor = table.lookup(tf.convert_to_tensor(labels, dtype=tf.string))
return dict(zip([FEATURES], [tf.convert_to_tensor(data, dtype=tf.float32)])), label_tensor
return _input_fn
def build_estimator(model_dir):
features = layers.real_valued_column(FEATURES, dimension=DIMENSION)
return tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
dnn_feature_columns=[features],
dnn_hidden_units=[20,20])
def generate_exp_fun():
def _exp_fun(output_dir):
return tf.contrib.learn.Experiment(
build_estimator(output_dir),
train_input_fn=generate_input_fn(),
eval_input_fn=generate_input_fn(),
train_steps=100
)
return _exp_fun
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.DEBUG)
learn_runner.run(generate_exp_fun(), 'job_dir')
To test distributed mode, I simply launch it with the environment variable TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"} (this is for the worker, the same with ps type is used to launch the parameter server.
I use tensorflow-1.0.1 (but had the same behavior with 1.0.0) on windows-64, only CPU. I actually never get any error, it just hang on after INFO:tensorflow:Create CheckpointSaverHook. forever... I've tried to attach VisualStudio C++ debugger to the process but with little success so far, so I can't print a stack for what's happening in the native part.
P.S.: it's not a problem with DNNLinearCombinedClassifier because it fails as well with a simple tf.contrib.learn.LinearClassifier. And as noted in the comments, it's not due to both process running on localhost, since it fails also when running on separate VMs.
EDIT: I think there's actually an issue with server launching. It looks like the server is not launched when you're in local mode (no matter if distributed or not), cf. tensorflow/contrib/learn/python/learn/experiment.py l.250-258:
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
This will prevent the server from being started in local mode for the workers... Anyone has an idea if it's a bug or there's something I'm missing?

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

Related

What is the right way to check mp.managers.DictProxy' size?

I'm trying to learn multiprocessing and keep the memory usage low, so I'd like to check the size of my objects now and then.
I'm using a mp.managers.DictProxy to store results from multiple mp.Process() 'workers'(?) that put their results in an output mp.Queue().
I tried to see what size the dictionary got, but it kept returning 56 when tried to test it.
EDIT: after testing it in python 3.9.7 (same device) the output now magically is 48.
import sys
import multiprocessing as mp
if __name__ == '__main__': # needed if running on windows
manager = mp.Manager()
result_storage = manager.dict({k:k for k in range(100000)})
print(sys.getsizeof(result_storage))
# print USED to be 56 for me when i created this question
# print is now 48 consistently. same device. I didn't record the previous python version
# current python version is 3.9.7
print(sys.getsizeof(result_storage.copy()))
# prints 5242968
I've tried to print(sys.getsizeof(result_storage.copy())). That returned 5242976 5242968 Can I reliably use this to check the size or is there a better way? I've never used .copy() and I don't know what the impact is. The docs are beyond me.

Unable to set TIMEOUT for ldap in Python 2.7

I'd like to setup a "timeout" for the ldap library (python-ldap-2.4.15-2.el7.x86_64) and python 2.7
I'm forcing my /etc/hosts to resolve an non existing IP address in
order to raise the timeout.
I've followed several examples and took a look at the documentation or questions like this one without luck.
By now I've tried forcing global timeouts before the initialize:
ldap.set_option(ldap.OPT_NETWORK_TIMEOUT, 1)
ldap.set_option(ldap.OPT_TIMEOUT, 1)
ldap.protocol_version=ldap.VERSION3
Forced the same values at object level:
ldap_obj = ldap.initialize("ldap://%s:389" % LDAPSERVER ,trace_level=9)
ldap_obj.set_option(ldap.OPT_NETWORK_TIMEOUT, 1)
ldap_obj.set_option(ldap.OPT_TIMEOUT, 1)
I've also tried with:
ldap.network_timeout = 1
ldap.timelimit = 1
And used both methods, search and search_st (The synchronous form with timeout)
Finally this is the code:
def testLDAPConex( LDAPSERVER ) :
"""
Check ldap
"""
ldap.set_option(ldap.OPT_NETWORK_TIMEOUT, 1)
ldap.set_option(ldap.OPT_TIMEOUT, 1)
ldap.protocol_version=ldap.VERSION3
ldap.network_timeout = 1
ldap.timelimit = 1
try:
ldap_obj = ldap.initialize("ldap://%s:389" % LDAPSERVER ,trace_level=9)
ldap_result_id = ldap_obj.search("ou=people,c=this,o=place", ldap.SCOPE_SUBTREE, "cn=user")
I've printed the constants of the object OPT_NETWORK_TIMEOUT and OPT_TIMEOUT, the values were assigned correctly.
The execution time is 56s everytime and I'm not able to control the amount of seconds for the timeout.
Btw, the same code in python3 does work as intended:
real 0m10,094s
user 0m0,072s
sys 0m0,013s

After some testing I've decided to rollback the VM to a previous state.
The Networking Team were doing changes across the network configuration, that might have changed the behaviour over the interface and some kind of bug when the packages were being sent over the wire in:
ldap_result_id = ldap_obj.search("ou=people,c=this,o=place", ldap.SCOPE_SUBTREE, "cn=user")
After restoring the VM, which included a restart, the error of not being able to reach LDAP raises as expected.

Distributed tensorflow monopolizes GPUs after running server.init

I have two computers with two GPUs each. I am trying to start with distributed tensorflow and very confused about how it all works. On computer A I would like to have one ps tasks (I have the impression this should go on the CPU) and two worker tasks (one per GPU). And I would like to have two 'worker' tasks on computer B. Here's how I have tried to implement this, in test.py
import tensorflow as tf
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--job_name', required = True, type = str)
parser.add_argument('--task_idx', required = True, type = int)
args, _ = parser.parse_known_args()
JOB_NAME = args.job_name
TASK_INDEX = args.task_idx
ps_hosts = ["computerB-i9:2222"]
worker_hosts = ["computerA-i7:2222", "computerA-i7:2223", "computerB-i9:2223", "computerB-i9:2224"]
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name = JOB_NAME, task_index = TASK_INDEX)
if JOB_NAME == "ps":
server.join()
elif JOB_NAME == "worker":
is_chief = (TASK_INDEX == 0)
with tf.device(tf.train.replica_device_setter(
worker_device = "/job:worker/task:%d" % FLAGS.task_index, cluster = cluster)):
a = tf.constant(8)
b = tf.constant(9)
with tf.Session(server.target) as sess:
sess.run(tf.multiply(a, b))
What I am finding by running python3 test.py --job_name ps == task_idx 0 on computer A, is that I see that both GPUs on computer A have immediately been reserved by the script and that computer B shows no activity. This is not what I expected. I thought that since for the ps job I simply run server.join() that this should not use the GPU. However I can see by setting pdb break points that as soon as the server is initialized, the GPUs are taken. This leaves me with several questions:
- Why does the server immediately take all the GPU capacity?
- How am I supposed to allocate GPU and launch different processes?
- Does my original plan even make sense? (I am still a little confused by tasks vs. clusters vs. servers etc...)
I have watched the Tensorflow Developer Summit 2017 video on distributed Tensorflow and I have also been looking around on Github and blogs. I have not been able to find a working code example using the latest or even relatively recent distributed tensorflow functions. Likewise, I notice that many questions on Stack Overflow are not answered, so I have read related questions but not any that resolve my questions. I would appreciate any guidance or recommendations about other resources. Thanks!

I found that the following will work when invoking from command line:
CUDA_VISIBLE_DEVICES="" python3 test.py --job_name ps --task_idx 0 --dir_name TEST
Since I found this in a lot of code examples it seems like this may be the standard way to control an individual server's access to GPU resources.

TensorFlow nullptr check failed on GPU

I am using the python API of TensorFlow to train a variant of an LSTM.
For that purpose I use the tf.while_loop function to iterate over the time steps.
When running my script on the cpu, it does not produce any error messages, but on the gpu python crashes due to:
...tensorflow/tensorflow/core/framework/tensor.cc:885] Check failed: nullptr != b.buf_ (nullptr vs. 00...)
The part of my code, that causes this failure (when commenting it out, it works) is in the body of the while loop:
...
h_gathered = h_ta.gather(tf.range(time))
h_gathered = tf.transpose(h_gathered, [1, 0, 2])
syn_t = self.syntactic_weights_ta.read(time)[:, :time]
syn_t = tf.expand_dims(syn_t, 1)
syn_state_t = tf.squeeze(tf.tanh(tf.matmul(syn_t, h_gathered)), 1)
...
where time is zero based and incremented after each step, h_ta is a TensorArray
h_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
clear_after_read=False,
element_shape=[batch_size, num_hidden],
tensor_array_name="fw_output")
and self.syntactic_weights_ta is also a TensorArray
self.syntactic_weights_ta = tf.TensorArray(
dtype=dtype,
size=max_seq_len,
tensor_array_name="fw_syntactic_weights")
self.syntactic_weights_ta = self.syntactic_weights_ta.unstack(syntactic_weights)
What I am trying to achieve in the code snippet is basically a weighted sum over the past outputs, stored in h_ta.
In the end I train the network with tf.train.AdamOptimizer.
I have tested the script again, but this time with swap_memory parameter in the while loop set to False and it works on GPU as well, though I'd really like to know why it does not work with swap_memory=True.

This looks like a bug in the way that TensorArray's tensor storage mechanisms interact with the allocation magic that is performed by while_loop when swap_memory=True.
Can you open an issue on TF's github? Please also include:
A full stack trace (TF built with -c dbg preferrable)
A minimal code example to reproduce
Describe whether the issue requires you to be calling backprop.
Whether this is reproducible in TF 1.2 / nightlies / master branch.
And respond here with the link to the github issue?

Python rpy2 and matplotlib conflict when using multiprocessing

I am trying to calculate and generate plots using multiprocessing. On Linux the code below runs correctly, however on the Mac (ML) it doesn't, giving the error below:
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
import rpy2.robjects as robjects
def main():
pool = multiprocessing.Pool()
num_figs = 2
# generate some random numbers
input = zip(np.random.randint(10,1000,num_figs),
range(num_figs))
pool.map(plot, input)
def plot(args):
num, i = args
fig = plt.figure()
data = np.random.randn(num).cumsum()
plt.plot(data)
main()
The Rpy2 is rpy2==2.3.1 and R is 2.13.2 (I could not install R 3.0 and rpy2 latest version on any mac without getting segmentation fault).
The error is:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
I have tried everything to understand what the problem is with no luck. My configuration is:
Danials-MacBook-Pro:~ danialt$ brew --config
HOMEBREW_VERSION: 0.9.4
ORIGIN: https://github.com/mxcl/homebrew
HEAD: 705b5e133d8334cae66710fac1c14ed8f8713d6b
HOMEBREW_PREFIX: /usr/local
HOMEBREW_CELLAR: /usr/local/Cellar
CPU: dual-core 64-bit penryn
OS X: 10.8.3-x86_64
Xcode: 4.6.2
CLT: 4.6.0.0.1.1365549073
GCC-4.2: build 5666
LLVM-GCC: build 2336
Clang: 4.2 build 425
X11: 2.7.4 => /opt/X11
System Ruby: 1.8.7-358
Perl: /usr/bin/perl
Python: /usr/local/bin/python => /usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/bin/python2.7
Ruby: /usr/bin/ruby => /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
Any ideas?

This error occurs on Mac OS X when you perform a GUI operation outside the main thread, which is exactly what you are doing by shifting your plot function to the multiprocessing.Pool (I imagine that it will not work on Windows either for the same reason - since Windows has the same requirement). The only way that I can imagine it working is using the pool to generate the data, then have your main thread wait in a loop for the data that's returned (a queue is the way I usually handle it...).
Here is an example (recognizing that this may not do what you want - plot all the figures "simultaneously"? - plt.show() blocks so only one is drawn at a time and I note that you do not have it in your sample code - but without I don't see anything on my screen - however, if I take it out - there is no blocking and no error because all GUI functions are happening in the main thread):
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
import rpy2.robjects as robjects
data_queue = multiprocessing.Queue()
def main():
pool = multiprocessing.Pool()
num_figs = 10
# generate some random numbers
input = zip(np.random.randint(10,10000,num_figs), range(num_figs))
pool.map(worker, input)
figs_complete = 0
while figs_complete < num_figs:
data = data_queue.get()
plt.figure()
plt.plot(data)
plt.show()
figs_complete += 1
def worker(args):
num, i = args
data = np.random.randn(num).cumsum()
data_queue.put(data)
print('done ',i)
main()
Hope this helps.

I had a similar issue with my worker, which was loading some data, generating a plot, and saving it to a file. Note that this is slightly different than what the OP's case, which seems to be oriented around interactive plotting. Still, I think it's relevant.
A simplified version of my code:
def worker(id):
data = load_data(id)
plot_data_to_file(data) # Generates a plot and saves it to a file.
def plot_something_parallel(ids):
pool = multiprocessing.Pool()
pool.map(worker, ids)
plot_something_parallel(ids=[1,2,3])
This caused the same error others mention:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
Following #bbbruce's train of thought, I solved my problem by switching the matplotlib backend from TKAgg to the default. Specifically, I commented out the following line in my matplotlibrc file:
#backend : TkAgg

This might be rpy2-specific.
There are reports of a similar problem with OS X and multiprocessing here and there.
I think that using an initializer that imports the packages needed to run the code in plot could solve the problem (multiprocessing-doc).

I had a similar issue and found that setting the start method in multiprocessing to use forkserver works as long as it comes after your if name == main: statement.
if __name__ == '__main__':
multiprocessing.set_start_method('forkserver')
first_process = multiprocessing.Process(target = targetOne)
second_process = multiprocessing.Process(target = targetTwo)
first_process.start()
second_process.start()

Try to upgrade matplotlib to 3.0.3:
pip3 install matplotlib --upgrade
Then everything goes fine.
=======================================================================
No need to read below anymore.
Yesterday, my multiprocess plot works on my MacBook Air. But not working on my MacBook Pro tomorrow morning with the same code, displaying many:
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
They are all using 4th gen i intel CPU (i5-4xxx with air and i7-4xxx with pro). So if there are no difference on hardware, it must be on software.
So I just tried update matplot to 3.0.3 on MacBook Pro( was 3.0.1), every thing goes fine.
Also, no need to do pool.apply_async anymore.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow: minimalist program fails on distributed mode - python

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

Related

What is the right way to check mp.managers.DictProxy' size?

Unable to set TIMEOUT for ldap in Python 2.7

Distributed tensorflow monopolizes GPUs after running server.init

TensorFlow nullptr check failed on GPU

Python rpy2 and matplotlib conflict when using multiprocessing

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow: minimalist program fails on distributed mode - python

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

Related

What is the right way to check mp.managers.DictProxy' size?

Unable to set TIMEOUT for ldap in Python 2.7

Distributed tensorflow monopolizes GPUs after running server.__init__

TensorFlow nullptr check failed on GPU

Python rpy2 and matplotlib conflict when using multiprocessing

Categories

Resources

Distributed tensorflow monopolizes GPUs after running server.init