Unable to flush task queue within 1200 seconds - python

When running my automl pipeline, i am consistently getting an error during the MetricsAndSaveModel activity which causes my model training run to fail:
2019-12-06 22:48:01,233 - INFO - 295 : ActivityCompleted: Activity=MetricsAndSaveModel, HowEnded=Failure, Duration=1200977.92[ms]
2019-12-06 22:48:01,235 - CRITICAL - 295 : Type: Unclassified
Class: AzureMLException
Message: AzureMLException:
Message: Failed to flush task queue within 1200 seconds
InnerException None
ErrorResponse
{
"error": {
"message": "Failed to flush task queue within 1200 seconds"
}
}
Traceback:
File "fit_pipeline.py", line 222, in fit_pipeline
automl_run_context.batch_save_artifacts(strs_to_save, models_to_upload)
File "automl_run_context.py", line 201, in batch_save_artifacts
timeout_seconds=ARTIFACT_UPLOAD_TIMEOUT_SECONDS)
File "run.py", line 49, in wrapped
return func(self, *args, **kwargs)
File "run.py", line 1824, in upload_files
timeout_seconds=timeout_seconds)
File "artifacts_client.py", line 167, in upload_files
results.append(task)
File "task_queue.py", line 53, in __exit__
self.flush(self.identity)
File "task_queue.py", line 126, in flush
raise AzureMLException("Failed to flush task queue within {} seconds".format(timeout_seconds))

The current timeout limit is set to 20 minutes in the AutoML service and our product team is working to provide this as a configurable setting in future releases. Currently, to increase this limit you can modify the script automl_run_context.py to update ARTIFACT_UPLOAD_TIMEOUT_SECONDS to a higher value and retry running the pipeline.

Related

Why is the connection between a python motor client and a mongoDB being dropped for about 50% of queries?

I'm seeing odd behavior in my application
I have a graphQL python application that lives in a docker container and is using FastAPI and Uvicorn. It connects to a MongoDB in another docker container using Motor. This is run in a Kubernetes instance.
About 50% of the time, but with no real pattern, I won't get results back. The other 50% of the time it works fine.
Based on the mongodb logs it looks like the python application is closing the connection before the response can be sent back. This is not a long running query (< 10ms). Additionally I've tried to set the various timeouts in Motor/PyMongo very high (~3600s) but still get the same behavior.
Client Initialization:
try:
# kwargs contains username, password, connectTimeoutMS, socketTimeoutMS, serverSelectionTimeoutMS
return AsyncIOMotorClient(host, port, appname=appname, connect=True, **kwargs)
except:
#log it
The Query Code:
def callQuery(vendor, name):
match_query = {
"$match": {
"_id.product_vendor": vendor,
"_id.product_name": name,
}
}
add_fields = {
"$addFields": {
"product_name": "$_id.product_name",
"product_vendor": "$_id.product_vendor",
}
}
project = {"$project": {"_id": 0}}
cursor = collection.aggregate([match_query, add_fields, project])
products = await cursor.to_list(length=None) # line 173 in product_lookup to compare to bt below
def submit_aggregation(pipeline: List[dict], collection, **kwargs):
return collection.aggregate(pipeline, **kwargs)
MongoDB logs:
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"NETWORK", "id":22934, "ctx":"conn5","msg":"Starting server-side compression negotiation"}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"NETWORK", "id":22935, "ctx":"conn5","msg":"Compression negotiation not requested by client"}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"FTDC", "id":23905, "ctx":"conn5","msg":"Using exhaust for isMaster or hello protocol"}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn5","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"mappingapi","command":{"hello":1,"topologyVersion":{"processId":{"$oid":"619d4e56e9dc669ff6ade24b"},"counter":0},"maxAwaitTimeMS":10000,"$db":"admin","$readPreference":{"mode":"primary"}},"numYields":0,"reslen":313,"locks":{},"remote":"127.0.0.1:41430","protocol":"op_msg","durationMillis":0}}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D2", "c":"QUERY", "id":22783, "ctx":"conn5","msg":"Received interrupt request for unknown op","attr":{"opId":25787,"knownOps":[]}}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn5","msg":"Released the Client","attr":{"client":"conn5"}}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"conn5","msg":"Setting the Client","attr":{"client":"conn5"}}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn5","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"topologyVersion":{"processId":{"$oid":"619d4e56e9dc669ff6ade24b"},"counter":0},"maxAwaitTimeMS":10000,"$db":"admin","$readPreference":{"mode":"primary"}}}}
{"t":{"$date":"2021-11-23T20:59:25.690+00:00"},"s":"D3", "c":"FTDC", "id":23904, "ctx":"conn5","msg":"Using maxAwaitTimeMS for awaitable hello protocol","attr":{"maxAwaitTimeMS":10000}}
Backtrace of the disconnect from the python client:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/promise/promise.py", line 844, in handle_future_result
resolve(future.result())
File "./mappingdb/api.py", line 247, in resolve_productdbProducts
client=get_pdb_client(),
File "/usr/local/lib/python3.7/site-packages/productdata/_utils.py", line 42, in func_future
return await func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/productdata/query.py", line 173, in product_lookup
products = await cursor.to_list(length=None)
File "/usr/local/lib/python3.7/site-packages/motor/core.py", line 1372, in _to_list
result = get_more_result.result()
File "/usr/local/lib/python3.7/site-packages/motor/core.py", line 1611, in _on_started
pymongo_cursor = future.result()
File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.7/site-packages/pymongo/collection.py", line 2507, in aggregate
**kwargs)
File "/usr/local/lib/python3.7/site-packages/pymongo/collection.py", line 2421, in _aggregate
retryable=not cmd._performs_write)
File "/usr/local/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1525, in _retryable_read
return func(session, server, sock_info, secondary_ok)
File "/usr/local/lib/python3.7/site-packages/pymongo/aggregation.py", line 149, in get_cursor
user_fields=self._user_fields)
File "/usr/local/lib/python3.7/site-packages/pymongo/pool.py", line 726, in command
self._raise_connection_failure(error)
File "/usr/local/lib/python3.7/site-packages/pymongo/pool.py", line 929, in _raise_connection_failure
_raise_connection_failure(self.address, error)
File "/usr/local/lib/python3.7/site-packages/pymongo/pool.py", line 238, in _raise_connection_failure
raise NetworkTimeout(msg)
graphql.error.located_error.GraphQLLocatedError: productdb:27017: timed out

Azure ML: Upload File to Step Run's Output - Authentication Error

During a PythonScriptStep in an Azure ML Pipeline, I'm saving a model as joblib pickle dump to a directory in a Blob Container in the Azure Blob Storage which I've created during the setup of the Azure ML Workspace. Afterwards I'm trying to upload this model file to the step run's output directory using
Run.upload_file (name, path_or_stream)
(for the function's documentation, see https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#upload-file-name--path-or-stream--datastore-name-none-)
Some time ago when I created the script using the azureml-sdk version 1.18.0, everything worked fine. Now, I've updated the script's functionalities and upgraded the azureml-sdk to version 1.33.0 during the process and the upload function now runs into the following error:
Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 64, in upload_blob_from_stream
validate_content=True)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 93, in execute_func_with_reset
return ClientBase._execute_func_internal(backoff, retries, module_logger, func, reset_func, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 367, in _execute_func_internal
left_retry = cls._handle_retry(back_off, left_retry, total_retry, error, logger, func)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 399, in _handle_retry
raise error
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/clientbase.py", line 358, in _execute_func_internal
response = func(*args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/blockblobservice.py", line 614, in create_blob_from_stream
initialization_vector=iv
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 98, in _upload_blob_chunks
range_ids = [f.result() for f in futures]
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 98, in <listcomp>
range_ids = [f.result() for f in futures]
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/opt/miniconda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/miniconda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 210, in process_chunk
return self._upload_chunk_with_progress(chunk_offset, chunk_bytes)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 224, in _upload_chunk_with_progress
range_id = self._upload_chunk(chunk_offset, chunk_data)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/_upload_chunking.py", line 269, in _upload_chunk
timeout=self.timeout,
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/blob/blockblobservice.py", line 1013, in _put_block
self._perform_request(request)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 432, in _perform_request
raise ex
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 357, in _perform_request
raise ex
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/storageclient.py", line 343, in _perform_request
HTTPError(response.status, response.message, response.headers, response.body))
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_vendor/azure_storage/common/_error.py", line 115, in _http_error_handler
raise ex
azure.common.AzureHttpError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/.../azureml-setup/context_manager_injector.py", line 243, in execute_with_context
runpy.run_path(sys.argv[0], globals(), run_name="__main__")
File "/opt/miniconda/lib/python3.7/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/opt/miniconda/lib/python3.7/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/opt/miniconda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 318, in <module>
main()
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 286, in main
path_or_stream=model_path)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 53, in wrapped
return func(self, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 1989, in upload_file
datastore_name=datastore_name)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 114, in upload_artifact
return self.upload_artifact_from_path(artifact, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 107, in upload_artifact_from_path
return self.upload_artifact_from_stream(stream, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 99, in upload_artifact_from_stream
content_type=content_type, session=session)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 88, in upload_stream_to_existing_artifact
timeout=TIMEOUT, backoff=BACKOFF_START, retries=RETRY_LIMIT)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 71, in upload_blob_from_stream
raise AzureMLException._with_error(azureml_error, inner_exception=e)
azureml._common.exceptions.AzureMLException: AzureMLException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: mystorage\n\tContainerName: azureml\n\tStatusCode: 403",
"inner_error": {
"code": "Auth",
"inner_error": {
"code": "Authorization"
}
}
}
}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 318, in <module>
main()
File "401_AML_Pipeline_Time_Series_Model_Training_Azure_ML_CPU.py", line 286, in main
path_or_stream=model_path)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 53, in wrapped
return func(self, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/core/run.py", line 1989, in upload_file
datastore_name=datastore_name)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 114, in upload_artifact
return self.upload_artifact_from_path(artifact, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 107, in upload_artifact_from_path
return self.upload_artifact_from_stream(stream, *args, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 99, in upload_artifact_from_stream
content_type=content_type, session=session)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_restclient/artifacts_client.py", line 88, in upload_stream_to_existing_artifact
timeout=TIMEOUT, backoff=BACKOFF_START, retries=RETRY_LIMIT)
File "/opt/miniconda/lib/python3.7/site-packages/azureml/_file_utils/upload.py", line 71, in upload_blob_from_stream
raise AzureMLException._with_error(azureml_error, inner_exception=e)
UserScriptException: UserScriptException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException AzureMLException:
Message: Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.
StorageAccount: mystorage
ContainerName: azureml
StatusCode: 403
InnerException Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ErrorCode: AuthenticationFailed
<?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5d4e1b7e-c01e-0070-0d47-9bf8a0000000
Time:2021-08-27T13:30:02.2685991Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcw
2021-08-27T13:19:56Z
2021-08-28T13:29:56Z
/blob/mystorage/azureml/ExperimentRun/dcid.98d11a7b-2aac-4bc0-bd64-bb4d72e0e0be/outputs/models/Model.pkl
2019-07-07
b
</AuthenticationErrorDetail></Error>
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: verovisionstorage\n\tContainerName: azureml\n\tStatusCode: 403",
"inner_error": {
"code": "Auth",
"inner_error": {
"code": "Authorization"
}
}
}
}
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Encountered authorization error while uploading to blob storage. Please check the storage account attached to your workspace. Make sure that the current user is authorized to access the storage account and that the request is not blocked by a firewall, virtual network, or other security setting.\n\tStorageAccount: mystorage\n\tContainerName: azureml\n\tStatusCode: 403"
}
}
As far as I can tell from the code of the azureml.core.Run class and the subsequent function calls, the Run object tries to upload the file to the step run's output directory using SAS-Token-Authentication (which fails). This documentation article is linked in the code (but I don't know if this relates to the issue): https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas#service-sas-example
Did anybody encounter this error as well and knows what causes it or how it can be resolved?
Best,
Jonas
We’ve seen the before, it’s annoying. I think the answer is to go to the data stores page of the AML Studio UI and manually enter the storage account key again.

How to add retry for celery backend connection?

I am using celery 5.0.1 and using CELERY_BACKEND_URL as redis://:password#redisinstance1:6379/0. It works fine, but when there is a Redis instance loose connection, it breaks out tasks with an error.
Exception: Error while reading from socket: (104, 'Connection reset by peer')
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/redis/connection.py", line 198, in _read_from_socket
data = recv(self._sock, socket_read_size)
File "/usr/local/lib/python3.7/dist-packages/redis/_compat.py", line 72, in recv
return sock.recv(*args, **kwargs)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 477, in trace_task
uuid, retval, task_request, publish_result,
File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 154, in mark_as_done
self.store_result(task_id, result, state, request=request)
File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 439, in store_result
request=request, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 855, in _store_result
current_meta = self._get_task_meta_for(task_id)
File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 873, in _get_task_meta_for
meta = self.get(self.get_key_for_task(task_id))
File "/usr/local/lib/python3.7/dist-packages/celery/backends/redis.py", line 346, in get
return self.client.get(key)
File "/usr/local/lib/python3.7/dist-packages/redis/client.py", line 1606, in get
return self.execute_command('GET', name)
File "/usr/local/lib/python3.7/dist-packages/redis/client.py", line 901, in execute_command
return self.parse_response(conn, command_name, **options)
File "/usr/local/lib/python3.7/dist-packages/redis/client.py", line 915, in parse_response
response = connection.read_response()
File "/usr/local/lib/python3.7/dist-packages/redis/connection.py", line 739, in read_response
response = self._parser.read_response()
File "/usr/local/lib/python3.7/dist-packages/redis/connection.py", line 324, in read_response
raw = self._buffer.readline()
File "/usr/local/lib/python3.7/dist-packages/redis/connection.py", line 256, in readline
self._read_from_socket()
File "/usr/local/lib/python3.7/dist-packages/redis/connection.py", line 223, in _read_from_socket
(ex.args,))
redis.exceptions.ConnectionError: Error while reading from socket: (104, 'Connection reset by peer')
Celery worker: None
Celery task id: 244b56af-7c96-56cf-a01a-9256cfd98ade
Celery retry attempt: 0
Task args: []
Task kwargs: {'address': 'ipadd', 'uid': 'uid', 'hexID': 'hexID', 'taskID': '244b56af-7c96-56cf-a01a-9256cfd98ade'}
When I run the second tasks, it works fine, there is some glitch in the connection for a short period of time.
Can I set something by which, when celery tries to update the results to Redis, if it returns an error, it will retry after 2-5 seconds?
I know how to set retry in the task, but this does not task failure. My tasks work fine and it returns the data, but celery is losing connection while updating to the backend.
To deal with connection timeouts you can have the following in your Celery configuration:
app.conf.broker_transport_options = {
'retry_policy': {
'timeout': 5.0
}
}
app.conf.result_backend_transport_options = {
'retry_policy': {
'timeout': 5.0
}
}
There are few other Redis backend settings that you may want to consider having in your configuration, like the redis_retry_on_timeout for an example.

dask: How do I avoid timeout for a task?

In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text:
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
They are followed by a second traceback which (I think) indicates which line my task was running when the timeout occurred. (Exactly how distributed manages to do this is not clear to me -- maybe via a signal?)
Here's the dask portion of the second traceback:
... my code...
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
direct=direct)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
asynchronous=asynchronous)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
return sync(self.loop, func, *args, **kwargs)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
traceback)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
seq = list(seq)
File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
yield f(*a)
... my code ....
Does after timeout indicate that the task has taken too long, or is there some other timeout that is triggering the cancellation, such as a nanny or heartbeat timeout? (From what I can tell, there is no explicit timeout on the length of a task in dask, but maybe I'm confused.)
I see that the task was cancelled. But I would like to know why. Is there any easy way to figure out which line of code (in dask or distributed) is cancelling my task, and why?
I expect these tasks to take a long time -- they are uploading large buffers to a cloud store. How can I increase the timeout of a particular task in dask?
Dask does not impose a timeout on tasks by default.
The cancelled future that you're seeing isn't a Dask future, it's a Tornado future (Tornado is the library that Dask uses for network communication). So unfortunately all this is saying is that something failed.
The subsequent traceback hopefully includes information about exactly the code was that failed. Ideally this points to a line in your functions where the failure occurred. Perhaps that helps?
In general we recommend these steps when debugging code run through Dask: http://docs.dask.org/en/latest/debugging.html

Celery upgrade (3.1->4.1) - Connection reset by peer

We are working with celery at the last year, with ~15 workers, each one defined with concurrency between 1-4.
Recently we upgraded our celery from v3.1 to v4.1
Now we are having the following errors in each one of the workers logs, any ideas what can cause to such error?
2017-08-21 18:33:19,780 94794 ERROR Control command error: error(104, 'Connection reset by peer') [file: pidbox.py, line: 46]
Traceback (most recent call last):
File "/srv/dy/venv/lib/python2.7/site-packages/celery/worker/pidbox.py", line 42, in on_message
self.node.handle_message(body, message)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 129, in handle_message
return self.dispatch(**body)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 112, in dispatch
ticket=ticket)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 135, in reply
serializer=self.mailbox.serializer)
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/pidbox.py", line 265, in _publish_reply
**opts
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/srv/dy/venv/lib/python2.7/site-packages/kombu/messaging.py", line 203, in _publish
mandatory=mandatory, immediate=immediate,
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/channel.py", line 1748, in _basic_publish
(0, exchange, routing_key, mandatory, immediate), msg
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 64, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 178, in write_frame
write(view[:offset])
File "/srv/dy/venv/lib/python2.7/site-packages/amqp/transport.py", line 272, in write
self._write(s)
File "/usr/lib64/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer
BTW: our tasks in the form:
#app.task(name='EXAMPLE_TASK'],
bind=True,
base=ConnectionHolderTask)
def example_task(self, arg1, arg2, **kwargs):
# task code
We are also having massive issues with celery... I spend 20% of my time just dancing around weird idle-hang/crash issues with our workers sigh
We had a similar case that was caused by a high concurrency combined with a high worker_prefetch_multiplier, as it turns out fetching thousands of tasks is a good way to frack the connection.
If that's not the case: try to disable the broker pool by setting broker_pool_limit to None.
Just some quick ideas that might (hopefully) help :-)

Categories

Resources