Run distributed TensorFlow, UnavailableError: Endpoint read fail - python

I am new to TensorFlow and do not have much experience. I am now trying the distributed TensorFlow.
Following the official guide, I first create two servers. I run the following code in two seperate terminals
import sys
import tensorflow as tf
task_number = int(sys.argv[1])
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
print("Starting server #{}".format(task_number))
server.start()
server.join()
The server has been set up successfully
2018-01-25 20:05:37.651802: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job local -> {0 -> localhost:2222, 1 -> localhost:2223}
2018-01-25 20:05:37.652881: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
Starting server #0
2018-01-25 20:05:37.652938: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:328] Server already started (target: grpc://localhost:2222)
Then I run the following program
import tensorflow as tf
x = tf.constant(2)
with tf.device("/job:local/task:1"):
y2 = x - 66
with tf.device("/job:local/task:0"):
y1 = x + 300
y = y1 + y2
with tf.Session("grpc://localhost:2223") as sess:
result = sess.run(y)
print(result)
Then it gives me the following error message
E0125 20:05:49.573488650 10292 ev_epoll1_linux.c:1051] grpc epoll fd: 5
Traceback (most recent call last):
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1293, in _run_fn
self._extend_graph()
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1354, in _extend_graph
self._session, graph_def.SerializeToString(), status)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/****/Documents/intern/sample_data/try.py", line 25, in <module>
result = sess.run(y)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/****/miniconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: Endpoint read failed
I googled it and some suggest that it might be the problems with proxy, so I have disabled the proxy but nothing changed.
Does anyone have any ideas what the problems might be? Many thanks in advance.

Never mind, problems solved. It is the setting about the proxy. We need to unset proxy on both servers and clients to make the program work.

Related

Querying on mysql docker container via python, throwing timeout error after few hours

Inserting via debezium connector to mysql database brought up via docker container.
Trying to query and it is working fine until some number of hours. But, after that, same query is throwing below exception.
export JAVA_HOME=/tmp/tests/artifacts/java-17/jdk-17; export PATH=$PATH:/tmp/tests/artifacts/java-17/jdk-17/bin; docker exec -i mysql_be1e6a mysql --user=demo --password=demo -D demo -e "select count(k) from test_cdc_f0bf84 where uuid = 'd1e5cd6d-8f7a-457c-b2ea-880c2be52f69'"
2023-01-02 16:27:43,812:ERROR: failed to execute query MySQL rows count by uuid:
Traceback (most recent call last):
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/channel.py", line 699, in recv
out = self.in_buffer.read(nbytes, self.timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/buffered_pipe.py", line 164, in read
raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/cdc/abstract.py", line 667, in try_query
res = query_function()
^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/cdc/test_cdc.py", line 635, in <lambda>
query = lambda: self.mysql_query(
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/cdc/abstract.py", line 544, in mysql_query
result = self.ssh.exec_on_host(host, [
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/main/connection.py", line 335, in exec_on_host
return self._exec_on_host(host, commands, fetch, timeout=timeout, limit_output=limit_output)[host]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/main/connection.py", line 321, in _exec_on_host
res = list(out)
^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/file.py", line 125, in __next__
line = self.readline()
^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/file.py", line 291, in readline
new_data = self._read(n)
^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/channel.py", line 1361, in _read
return self.channel.recv(size)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/env/lib/python3.11/site-packages/paramiko/channel.py", line 701, in recv
raise socket.timeout()
TimeoutError
After some time, logged manually to machine and tried to read, it still reads fine. Not sure, what does this issue mean.
As explained, tried querying from database via python. Expected it will return count of rows, which it was happening until certain time, but after that, it threw timeout error and socket error.
Trying to query and it is working fine until some number of hours. But, after that, same query is throwing below exception.
The default value for interactive_timeout and wait_timeout is 28880 seconds (8 hours). you can disable this behavior by setting this system variable to zero in your MySQL config.
source: Configuring session timeouts

Runtime Error when running a simple cuML code in a Dask environment

I'm trying to test a simple code using two remote workers. I don't know what is going on and what the error refers to.
The code is simple:
#!/usr/bin/python3
from cuml.dask.cluster import KMeans
from cuml.dask.datasets import make_blobs
from dask.distributed import Client
c = Client("dask-scheduler:8786")
centers = 5
X, _ = make_blobs(n_samples=10000, centers=centers)
k_means = KMeans(n_clusters=centers)
k_means.fit(X)
labels = k_means.predict(X)
It connects but when it tries to execute the cluster algorithm, it throws the following error:
Traceback (most recent call last):
File "test_cuml.py", line 15, in <module>
k_means.fit(X)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/cluster/kmeans.py", line 161, in fit
comms.init(workers=data.workers)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 209, in init
wait=True,
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2506, in run
return self.sync(self._run, function, *args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 869, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 332, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 315, in f
result[0] = yield future
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2443, in _run
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
_func_init_nccl(sessionId, uniqueId)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init
The workers are reporting this issue:
distributed.worker - INFO - Run out-of-band function '_func_init_all'
distributed.worker - WARNING - Run Failed
Function: _func_init_all
args: (b'\x95d$\x89\x9beI\xf5\xa7\x8c7M\xe8V[v', b'\x02\x00\xc8\xdd\x8fj\x07\x90\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', False, {'tcp://dask-scheduler:40439': {'rank': 0}, 'tcp://dask-scheduler:39645': {'rank': 1}}, False, 0)
kwargs: {}
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 4553, in run
result = await function(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
_func_init_nccl(sessionId, uniqueId)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid usage'
Does anyone know what is happening or how to mitigate this? For me the error is not so clear. I tried with several versions of RAPIDS. IMPORTANT: I'm running in a docker environment sharing all GPUs (--gpus all) and network settings (--network host).

Failed to create multiple containers using docker python SDk

I am using python docker SDK to create multiple chrome containers. Below is my script
Here first I have pull docker image then try to create 2 containers out it. But it is failing with an error message that ports are already in use but I am incrementing the ports value as per container count.
import docker, sys
class CreateContainer:
def __init__(self):
self.client = CreateContainer.create_client()
#staticmethod
def create_client():
client = docker.from_env()
return client
def pull_image(self, image_name):
image = client.images.pull(image_name)
print(image.name)
def create_containers(self, image, container_name, expose_port, countainer_count=1):
container = self.client.containers.run(
image,
name=container_name,
hostname=container_name,
ports=expose_port,
detach=True
)
for line in container.logs():
print(line)
return container
if __name__ == '__main__':
threads = int(sys.argv[1])
c_obj = CreateContainer()
for i in range(1, threads+1):
c_obj.create_containers("selenium/standalone-chrome", "Chrome_{0}".format(i), expose_port={5550+i:4444})
------Run----------
python test.py 2
------error-----
Traceback (most recent call last):
File "C:\Program Files\Python39\lib\site-packages\docker\api\client.py", line 268, in _raise_for_status
response.raise_for_status()
File "C:\Program Files\Python39\lib\site-packages\requests\models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localnpipe/v1.40/containers/260894dbaec6946e5f31fdbfb5307182d2f621c12a38f328f6efac58df58854d/start
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Desktop\FreeLance\Utility\src\docker_engine\create_container.py", line 53, in <module>
c_obj.create_containers("selenium/standalone-chrome", "Chrome_{0}".format(i), expose_port={5550+i:4444})
File "C:\Users\Desktop\FreeLance\Utility\src\docker_engine\create_container.py", line 20, in create_containers
container = self.client.containers.run(
File "C:\Program Files\Python39\lib\site-packages\docker\models\containers.py", line 818, in run
container.start()
File "C:\Program Files\Python39\lib\site-packages\docker\models\containers.py", line 404, in start
return self.client.api.start(self.id, **kwargs)
File "C:\Program Files\Python39\lib\site-packages\docker\utils\decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "C:\Program Files\Python39\lib\site-packages\docker\api\container.py", line 1111, in start
self._raise_for_status(res)
File "C:\Program Files\Python39\lib\site-packages\docker\api\client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "C:\Program Files\Python39\lib\site-packages\docker\errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localnpipe/v1.40/containers/260894dbaec6946e5f31fdbfb5307182d2f621c12a38f328f6efac58df58854d/start: Internal Server Error ("driver failed programming external connectivity on endpoint Chrome_2 (c0c77528743a1e3153201565b2cb520243b66adbb903bb69e91a00c4399aca62): Bind for 0.0.0.0:4444 failed: port is already allocated")
While proving port the syntax is container_port:host_port
c_obj.create_containers("selenium/standalone-chrome", "Chrome_{0}".format(i), expose_port={4444:5550+i})
documentation : https://docker-py.readthedocs.io/en/stable/containers.html

How to handle "Redis.exceptions.ConnectionError: Connection has data"

I receive following output:
Traceback (most recent call last):
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/connection.py", line 1192, in get_connection
raise ConnectionError('Connection has data')
redis.exceptions.ConnectionError: Connection has data
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
timer()
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
cb(*args, **kw)
File "/home/ec2-user/env/lib64/python3.7/site-packages/eventlet/greenthread.py", line 214, in main
result = function(*args, **kwargs)
File "crawler.py", line 53, in fetch_listing
url = dequeue_url()
File "/home/ec2-user/WebCrawler/helpers.py", line 109, in dequeue_url
return redis.spop("listing_url_queue")
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/client.py", line 2255, in spop
return self.execute_command('SPOP', name, *args)
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/client.py", line 875, in execute_command
conn = self.connection or pool.get_connection(command_name, **options)
File "/home/ec2-user/env/lib64/python3.7/site-packages/redis/connection.py", line 1197, in get_connection
raise ConnectionError('Connection not ready')
redis.exceptions.ConnectionError: Connection not ready
I couldn't find any issue related to this particular error. I emptied/flushed all redis databases, so there should be no data there. I assume it has something to do with eventlet and patching. But even when I put following code right at the beginning of the file, the error appears.
import eventlet
eventlet.monkey_path()
What does this error mean?
Finally, I came up with the answer to my problem.
When connecting to redis with python, I specified the database with the number 0.
redis = redis.Redis(host=example.com, port=6379, db=0)
After changing the dabase to number 1 it worked.
redis = redis.Redis(host=example.com, port=6379, db=1)
Another way is to set protected_mode to no in etc\redis\redis.conf. Recommended when running redis locally.

Celery kombu fails after self.connections.acquire

When my celery service is running after 7-10 days I received this exception out of nowhere, this causes my Tasks not to be processed. A restart of celery fixes the problem.
INTERNAL ERROR: RuntimeError('Acquire on closed pool',)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 253, in trace_task
I, R, state, retval = on_error(task_request, exc, uuid)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 201, in on_error
R = I.handle_error_state(task, eager=eager)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 85, in handle_error_state
}[self.state](task, store_errors=store_errors)
File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 118, in handle_failure
req.id, exc, einfo.traceback, request=req,
File "/usr/lib/python2.7/dist-packages/celery/backends/base.py", line 121, in mark_as_failure
traceback=traceback, request=request)
File "/usr/lib/python2.7/dist-packages/celery/backends/amqp.py", line 124, in store_result
with self.app.amqp.producer_pool.acquire(block=True) as producer:
File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 868, in acquire
R = self.prepare(R)
File "/usr/lib/python2.7/dist-packages/kombu/pools.py", line 63, in prepare
conn = self._acquire_connection()
File "/usr/lib/python2.7/dist-packages/kombu/pools.py", line 38, in _acquire_connection
return self.connections.acquire(block=True)
File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 859, in acquire
raise RuntimeError('Acquire on closed pool')
RuntimeError: Acquire on closed pool
Software versions
software -> celery:3.1.20 (Cipater) kombu:3.0.35 py:2.7.6
billiard:3.3.0.22 py-amqp:1.4.9
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqp results:amqp
CELERY_ACCEPT_CONTENT: ['json', 'pickle', 'yaml']
CELERY_ENABLE_UTC: True
CELERY_IGNORE_RESULT: False
CELERY_IMPORTS:
('catalogue.app.voice.cluster.deploy_cluster',
'catalogue.app.common.install_uc',
'hypervisor.app.deploy_esx',
'hypervisor.app.vm_operations',
'tools.deploy_tools')
CELERYD_CHDIR: '/usr/local/src/imbue/application/app'
CELERY_TASK_RESULT_EXPIRES: 18000
CELERY_RESULT_PERSISTENT: True
CELERY_TIMEZONE: 'US/Eastern'
BROKER_URL: 'amqp://******:********#rabbitmq:5672//'
CELERY_RESULT_BACKEND: 'amqp'
Only workaround now is to restart.
Ubuntu 14.04 2 GB RAM/2 CPU/40 GB HDD
This looks like a bug in celery. Asksol fixed this few days back.
You can install celery from source code and try it. If it is still causing problems, please create new issue on github.

Categories

Resources