I looking at the http callback tasks - http://celeryproject.org/docs/userguide/remote-tasks.html in celery. They work well enough when the remote endpoint is available - but when it is unavailable I would like it to retry (as per retry policy) - or even if the remote endpoint returns a failure. at the moment it seems to be an error and the task is ignored.
Any ideas ?
You should be able to define your task like:
class RemoteCall(Task):
default_retry_delay = 30 * 60 # retry in 30 minutes
def Run(self, arg, **kwargs):
try:
res = URL("http://example.com/").get_async(arg)
except (InvalidResponseError, RemoteExecuteError), exc:
self.retry([arg], exc=exc, *kwargs)
This will keep trying it up to max_retries tries, once every 30 minutes.
Related
I have an app deployed on GKE, separated in different microservices.
One of the microservices, let's call it "worker", receives tasks to execute from pubsub messages.
The tasks can take up to 1 hour to be executed. The regular acknowledgement deadline for Google pubsub messages being pretty short, we renew the deadline every 10s before it ends. Here is the piece of code responsible for that:
def watchdog(businessDoneEvent, subscription, ack_deadline, message, ack_id):
'''
Prevents message from being republished as long as computation is
running
'''
while True:
# Wait (defaultDeadline - 10) seconds before renewing if defaultDeadline
# is > 5 seconds; renewed every second otherwise
sleepTime = ack_deadline - 10 if ack_deadline > 10 else 1
startTime = time.time()
while time.time() - startTime < sleepTime:
LOGGER.info('Sleeping time: {} - ack_deadline: {}'.format(time.time() - startTime, ack_deadline))
if businessDoneEvent.isSet():
LOGGER.info('Business done!')
return
time.sleep(1)
subscriber = SubscriberClient()
LOGGER.info('Modifying ack deadline for message ' +
str(message.data) + ' processing to ' +
str(ack_deadline))
subscriber.modify_ack_deadline(subscription, [ack_id],
ack_deadline)
Once the execution is over, we reach this piece of code:
def callbackWrapper(callback,
subscription,
message,
ack_id,
endpoint,
context,
subscriber,
postAcknowledgmentCallback=None):
'''
Pub/sub message acknowledgment if everything ran correctly
'''
try:
callback(message.data, endpoint, context, **message.attributes)
except Exception as e:
LOGGER.info(message.data)
LOGGER.error(traceback.format_exc())
raise e
else:
LOGGER.info("Trying to acknowledge...")
my_retry = Retry(predicate=if_exception_type(ServiceUnavailable), deadline=3600)
subscriber.acknowledge(subscription, [ack_id], retry=my_retry)
LOGGER.info(str(ack_id) + ' has been acknowledged')
if postAcknowledgmentCallback is not None:
postAcknowledgmentCallback(message.data,
**message.attributes)
Note that we also use this code in most of our microservices and it works just fine.
My problem is, even though I do not get any error from this code and it seems that the acknowledgement request is sent properly, it is actually acknowledged later. For example, according to the GCP console, right now I have 8 unacknowledged messages, but I should only have 3. It also said there are 12 when I should only have 5 for an hour:
I have a horizontal pod autoscaler using the pubsub metric. When the pods are done, they are not scaled down, or only 1 hour later or more. This creates some useless costs that I would like to avoid.
Does anyone have an idea about why this is happening?
I am totally new to Kafka and Docker, and have been handed a problem to fix. Our Continuous Integration tests for Kafka (Apache) queues run just fine on local machines, but when on the Jenkins CI server, occasionally fail with this sort of error:
%3|1508247800.270|FAIL|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: 1/1 brokers are down
The working theory is that the Docker image takes time to get started, by which time the Kafka producer has given up. The offending code is
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'client.id': self._job_queue.client_id
}
try:
self._producer = kafka.Producer(**producer_properties)
except:
print("Bang!")
with the error lines above appearing in the creation of the producer. However, no exception is raised, and the call returns an otherwise valid looking producer, so I can't programmatically test the presence of the broker endpoint. Is there an API to check the status of a broker?
It seems the client doesn't throw exception if connection to broker fails. It actually tries to connect to bootstrap servers when first time producer tries to send the message. If connection fails, it repeatedly tries to connect to any of the brokers passed in the bootstrap list. Eventually, if the brokers come up, send will happen (and we may check the status in the callback function).
The confluent kafka python library is using librdkafka library and this client doesn't seem to have proper documentation. Some of the Kafka producer option specified by Kafka protocol, seem not supported by librdkafka.
Here is the sample code with callback I used:
from confluent_kafka import Producer
def notifyme(err, msg):
print err, msg.key(), msg.value()
p = Producer({'bootstrap.servers': '127.0.0.1:9092', 'retry.backoff.ms' : 100,
'message.send.max.retries' : 20,
"reconnect.backoff.jitter.ms" : 2000})
try:
p.produce(topic='sometopic', value='this is data', on_delivery=notifyme)
except Exception as e:
print e
p.flush()
Also, checking for the presence of the broker, you may just telnet to the broker ip on its port (in this example it is 9092). And on the Zookeeper used by Kafka cluster, you may check the contents of the znodes under /brokers/ids
Here is the code that seems to work for me. If it looks a bit Frankenstein, then you are right, it is! If there is a clean solution, I look forward to seeing it:
import time
import uuid
from threading import Event
from typing import Dict
import confluent_kafka as kafka
# pylint: disable=no-name-in-module
from confluent_kafka.cimpl import KafkaError
# more imports...
LOG = # ...
# Default number of times to retry connection to Kafka Broker
_DEFAULT_RETRIES = 3
# Default time in seconds to wait between connection attempts
_DEFAULT_RETRY_DELAY_S = 5.0
# Number of times to scan for an error after initiating the connection. It appears that calling
# flush() once on a producer after construction isn't sufficient to catch the 'broker not available'
# # error. At least twice seems to work.
_NUM_ERROR_SCANS = 2
class JobProducer(object):
def __init__(self, connection_retries: int=_DEFAULT_RETRIES,
retry_delay_s: float=_DEFAULT_RETRY_DELAY_S) -> None:
"""
Constructs a producer.
:param connection_retries: how many times to retry the connection before raising a
RuntimeError. If 0, retry forever.
:param retry_delay_s: how long to wait between retries in seconds.
"""
self.__error_event = Event()
self._job_queue = JobQueue()
self._producer = self.__wait_for_broker(connection_retries, retry_delay_s)
self._topic = self._job_queue.topic
def produce_job(self, job_definition: Dict) -> None:
"""
Produce a job definition on the queue
:param job_definition: definition of the job to be executed
"""
value = ... # Conversion to JSON
key = str(uuid.uuid4())
LOG.info('Produced message: %s', value)
self.__error_event.clear()
self._producer.produce(self._topic,
value=value,
key=key,
on_delivery=self._on_delivery)
self._producer.flush(self._job_queue.flush_timeout)
#staticmethod
def _on_delivery(error, message):
if error:
LOG.error('Failed to produce job %s, with error: %s', message.key(), error)
def __create_producer(self) -> kafka.Producer:
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'error_cb': self.__on_error,
'client.id': self._job_queue.client_id,
}
return kafka.Producer(**producer_properties)
def __wait_for_broker(self, retries: int, delay: float) -> kafka.Producer:
retry_count = 0
while True:
self.__error_event.clear()
producer = self.__create_producer()
# Need to call flush() several times with a delay between to ensure errors are caught.
if not self.__error_event.is_set():
for _ in range(_NUM_ERROR_SCANS):
producer.flush(0.1)
if self.__error_event.is_set():
break
time.sleep(0.1)
else:
# Success: no errors.
return producer
# If we get to here, the error callback was invoked.
retry_count += 1
if retries == 0:
msg = '({})'.format(retry_count)
else:
if retry_count <= retries:
msg = '({}/{})'.format(retry_count, retries)
else:
raise RuntimeError('JobProducer timed out')
LOG.warn('JobProducer: could not connect to broker, will retry %s', msg)
time.sleep(delay)
def __on_error(self, error: KafkaError) -> None:
LOG.error('KafkaError: %s', error.str())
self.__error_event.set()
I am using Tornado to asynchronously scrape data from many thousand URLS. Each of them is 5-50MB, so they take a while to download. I keep getting "Exception: HTTP 599: Connection closed http:…" errors, despite the fact that I am setting both connect_timeout and request_timeout to a very large number.
Why, despite the large timeout settings, am I still timing out on some requests after only a few minutes of running the script?* Is there a way to instruct httpclient.AsyncHTTPClient to NEVER time out? Or is there a better solution to prevent timeouts?
The following command is how I'm calling the fetch (each worker calls this request_and_save_url() sub-coroutine in the Worker() coroutine):
#gen.coroutine
def request_and_save_url(url, q_all):
try:
response = yield httpclient.AsyncHTTPClient().fetch(url, partial(handle_request, q_all=q_all), connect_timeout = 60*24*3*999999999, request_timeout = 60*24*3*999999999)
except Exception as e:
print('Exception: {0} {1}'.format(e, url))
raise gen.Return([])
As you note HTTPError 599 is raised on connection or request timeout, but this is not the only case. The other one is when connection has been closed by the server before request ends (including entire response fetch) e.g. due to its (server) timeout to handle request or whatever.
I'm posting a data to a web-service in Celery. Sometimes, the data is not posted to web-service because of the internet is down, and the task is retried infinite times until it is posted. The retrying of the task is un-necessary because the net was down and hence its not required to re-try it again.
I thought of a better solution, ie if a task fails thrice (retrying a min of 3 times), then it is shifted to another queue. This queue contains list of all failed tasks.
Now when the internet is up and the data is posted over the net , ie the task has been completed from the normal queue, it then starts processing the tasks from the queue having failed tasks.
This will not waste the CPU memory of retrying the task again and again.
Here's my code :- As of right now, I'm just retrying the task again, But I doubt whether that'll be the right way of doing it.
#shared_task(default_retry_delay = 1 * 60, max_retries = 10)
def post_data_to_web_service(data,url):
try :
client = SoapClient(
location = url,
action = 'http://tempuri.org/IService_1_0/',
namespace = "http://tempuri.org/",
soap_ns='soap', ns = False
)
response= client.UpdateShipment(
Weight = Decimal(data['Weight']),
Length = Decimal(data['Length']),
Height = Decimal(data['Height']),
Width = Decimal(data['Width']) ,
)
except Exception, exc:
raise post_data_to_web_service.retry(exc=exc)
How do I maintain 2 queues simultaneous and trying to execute tasks from both the queues.
Settings.py
BROKER_URL = 'redis://localhost:6379/0'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
By default celery adds all tasks to queue named celery. So you can run your task here and when an exception occurs, it retries, once it reaches maximum retries, you can shift them to a new queue say foo
from celery.exceptions import MaxRetriesExceededError
#shared_task(default_retry_delay = 1 * 60, max_retries = 10)
def post_data_to_web_service(data,url):
try:
#do something with given args
except MaxRetriesExceededError:
post_data_to_web_service([data, url], queue='foo')
except Exception, exc:
raise post_data_to_web_service.retry(exc=exc)
When you start your worker, this task will try to do something with given data. If it fails it will retry 10 times with a dealy of 60 seconds. Then when it encounters MaxRetriesExceededError it posts the same task to new queue foo.
To consume these tasks you have to start a new worker
celery worker -l info -A my_app -Q foo
or you can also consume this task from the default worker if you start it with
celery worker -l info -A my_app -Q celery,foo
I have a task:
#celery.task(name='request_task',default_retry_delay=2,acks_late=True)
def request_task(data):
try:
if some_condition:
request_task.retry()
except Exception as e:
request_task.retry()
I use celery with mongodb broker and mongodb results backend enabled.
When task's retry() method is called, neither from conditional statement nor after catching exception, the task is not retried.
in the worker's terminal I get message like this:
[2012-08-10 19:21:54,909: INFO/MainProcess] Task request_task[badb3131-8964-41b5-90a7-245a8131e68d] retry: Task can be retried
What can be wrong?
UPDATE: Finally, I did not solve this question and had to use while loop inside the task, so my tasks never get retried.
I know this answer is too late, but the log message you're seeing means that you're calling the task directly request_task() and not queuing it so it's not running on a worker, so doing this will raise the exception if there is any, or a Retry exception, this is the code from the Task.retry method if you want to look at:
# Not in worker or emulated by (apply/always_eager),
# so just raise the original exception.
if request.called_directly:
# raises orig stack if PyErr_Occurred,
# and augments with exc' if that argument is defined.
raise_with_context(exc or Retry('Task can be retried', None))
Using task.retry() does not retry the task on the same worker, it sends a new message using task.apply_async() so it might be retried using another worker, this is what you should take into account when handling retries, you can access the retries count using task.request.retries and you can also set max_retries option on the task decorator.
By the way using bind=True on the task decorator makes the task instance available as the first argument:
#app.task(bind=True, name='request_task', max_retries=3)
def request_task(self, data):
# Task retries count
self.request.retries
try:
# Your code here
except SomeException as exc:
# Retry the task
# Passing the exc argument will make the task fail
# with this exception instead of MaxRetriesExceededError on max retries
self.retry(exc=exc)
You should read the section on retrying in the Celery docs.
http://celery.readthedocs.org/en/latest/userguide/tasks.html#retrying
It looks like in order to retry, you must raise a retry exception.
raise request_task.retry()
That seems to make the retry be handled by the function that decorated your task.