Google pubsub late acknowledgement - python

I have an app deployed on GKE, separated in different microservices.
One of the microservices, let's call it "worker", receives tasks to execute from pubsub messages.
The tasks can take up to 1 hour to be executed. The regular acknowledgement deadline for Google pubsub messages being pretty short, we renew the deadline every 10s before it ends. Here is the piece of code responsible for that:
def watchdog(businessDoneEvent, subscription, ack_deadline, message, ack_id):
'''
Prevents message from being republished as long as computation is
running
'''
while True:
# Wait (defaultDeadline - 10) seconds before renewing if defaultDeadline
# is > 5 seconds; renewed every second otherwise
sleepTime = ack_deadline - 10 if ack_deadline > 10 else 1
startTime = time.time()
while time.time() - startTime < sleepTime:
LOGGER.info('Sleeping time: {} - ack_deadline: {}'.format(time.time() - startTime, ack_deadline))
if businessDoneEvent.isSet():
LOGGER.info('Business done!')
return
time.sleep(1)
subscriber = SubscriberClient()
LOGGER.info('Modifying ack deadline for message ' +
str(message.data) + ' processing to ' +
str(ack_deadline))
subscriber.modify_ack_deadline(subscription, [ack_id],
ack_deadline)
Once the execution is over, we reach this piece of code:
def callbackWrapper(callback,
subscription,
message,
ack_id,
endpoint,
context,
subscriber,
postAcknowledgmentCallback=None):
'''
Pub/sub message acknowledgment if everything ran correctly
'''
try:
callback(message.data, endpoint, context, **message.attributes)
except Exception as e:
LOGGER.info(message.data)
LOGGER.error(traceback.format_exc())
raise e
else:
LOGGER.info("Trying to acknowledge...")
my_retry = Retry(predicate=if_exception_type(ServiceUnavailable), deadline=3600)
subscriber.acknowledge(subscription, [ack_id], retry=my_retry)
LOGGER.info(str(ack_id) + ' has been acknowledged')
if postAcknowledgmentCallback is not None:
postAcknowledgmentCallback(message.data,
**message.attributes)
Note that we also use this code in most of our microservices and it works just fine.
My problem is, even though I do not get any error from this code and it seems that the acknowledgement request is sent properly, it is actually acknowledged later. For example, according to the GCP console, right now I have 8 unacknowledged messages, but I should only have 3. It also said there are 12 when I should only have 5 for an hour:
I have a horizontal pod autoscaler using the pubsub metric. When the pods are done, they are not scaled down, or only 1 hour later or more. This creates some useless costs that I would like to avoid.
Does anyone have an idea about why this is happening?

Related

Near Lake NFT indexer access is slow

We are using near Lake NFT indexer to listen to mainnet block information. We find that it is very slow to obtain data, sometimes in streamer_messages_queue.get() method waits for hundreds of seconds to get the data, and sometimes stops at the streamer_messages_queue.get() method, so we want to know if there is any configuration error or any restriction that causes us to get data very slowly?
stream_handle, streamer_messages_queue = streamer(config)
while True:
start_time = int(time.time())
logger.info("Start listening time:{}", start_time)
streamer_message = await streamer_messages_queue.get()
end_time = int(time.time())
logger.info("streamer_messages_queue.get() consuming time:{}", start_time - end_time)
logger.info(f"Block #{streamer_message.block.header.height} Shards: {len(streamer_message.shards)}")
start_time = int(time.time())
await handle_streamer_message(streamer_message)
end_time = int(time.time())
logger.info("handle_streamer_message consuming time:{}", start_time - end_time)
Output log:
Start listening time:1660711985
streamer_messages_queue.get() consuming time:-282
Block #71516637 Shards: 4
handle_streamer_message consuming time:0
When using the method provided by (near_lake_framework) to obtain data, it will wait for a long time at the above code (streamer_messages_queue.get())
I have heard such reports from people in some regions where AWS S3 is rate-limited. Try using VPN or running it on some server. You could also try Rust or JS versions of near-lake-framework (pick any of the tutorials).

Sleep python script execution until process completes

I know this question has been asked prior, but I found no answer that addressed by particular problem.
I have a Python script to create kubernetes clusters and nodes in Azure, which takes anywhere between 5-10 mins. There's a function(get_cluster_end) to get the cluster endpoint, but this fails as the endpoint is not yet ready when this function is call. The func I wrote(wait_for_end)does not seem to be correct.
def wait_for_endpoint(timeout=None):
endpoint = None
start = time.time()
while not endpoint:
if timeout is not None and (time.time() - start > timeout):
break
endpoint = **get_cluster_end()**
time.sleep(5)
return endpoint
My main func:
def main():
create_cluster()
start = time.time()
job.set_progress("Waiting for cluster IP address...")
endpoint = wait_for_endpoint(timeout=TIMEOUT)
if not endpoint:
return ("FAILURE","No IP address returned after {} seconds".format(TIMEOUT),
"")
The script fails, because no endpoint has yet been created. How do I set the sleep after the cluster has been created and before the "wait_for_endpoint()" is called?

Script from cron sends >50 mails on errors, 1 on success

I have an issue which I really cannot figure out.
The following snippets from my Python script zips a directory and sends a mail on success. It also sends a mail if an error occured. And here is the issue:
When I execute the script manually, everything works fine.
1 mail on success, 1 mail if an error occured.
If the script is run from cron though, I reveive over 50 emails if an error occures (on success only one)! All mails have the same content (the error message), and all mails are sent at the same time (exact as "hh:mm").
This is the script snippet:
def backup(pathMedia, pathZipMedia):
[...]
try:
createArchive(pathMedia, pathZipMedia)
except Exception as e:
sendMail('Error in zipping the media dir: ' + str(e))
sys.exit()
sendMail('Backup successfully created!')
def sendMail(msg):
sent = 0
SMTPserver = '[...]'
sender = '[...]'
destination = ['...']
USERNAME = '[...]'
PASSWORD = '[...]'
text_subtype = 'plain'
subject='Backup notification'
content=msg
try:
msg = MIMEText(content, text_subtype)
msg['Subject'] = subject
msg['From'] = sender
conn = SMTP(SMTPserver)
conn.set_debuglevel(False)
conn.login(USERNAME, PASSWORD)
try:
if (sent == 0):
conn.sendmail(sender, destination, msg.as_string())
sent = 1
finally:
conn.quit()
except Exception as e:
sys.exit()
My crontab is the following:
## run the backup script every 3 days at 4am
* 4 */3 * * /root/backup.py >/dev/null 2>&1
I fixed the orrucring errors now, but it still might happen again.
And I'm really curious about why this issue occurs!
Thanks!
The * at the beginning of your crontab line says "run this job every minute".
Presumably a successful run of the first job at 4:00 causes the following 59 runs to find that no work needs to be done, therefore they don't attempt to create a backup and they exit quietly without sending email. But an unsuccessful run at 4:00 will leave work to be done by the next job at 4:01, and again the minute after that, and so on until 4:59. All of those jobs try to create a backup and all of them fail, so you get something like 60 failure emails. (Or fewer if one of the jobs manages to succeed, breaking the chain of failures.)
To fix the crontab line to run the job only one time at 4:00am, change the first * to a 0.
I don't know why your failure emails all have the same timestamp. Are you certain that they're all exactly the same? If so, perhaps they're being batched by your mail system and are assigned a Date header at the time the batch is processed. Or perhaps all of the jobs are started by cron and then they all wait, blocked until some system timeout or other event occurs, and then they all experience the failure simultaneously and all send emails at the same time.

Kafka timing out because of Docker latency

I am totally new to Kafka and Docker, and have been handed a problem to fix. Our Continuous Integration tests for Kafka (Apache) queues run just fine on local machines, but when on the Jenkins CI server, occasionally fail with this sort of error:
%3|1508247800.270|FAIL|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: 1/1 brokers are down
The working theory is that the Docker image takes time to get started, by which time the Kafka producer has given up. The offending code is
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'client.id': self._job_queue.client_id
}
try:
self._producer = kafka.Producer(**producer_properties)
except:
print("Bang!")
with the error lines above appearing in the creation of the producer. However, no exception is raised, and the call returns an otherwise valid looking producer, so I can't programmatically test the presence of the broker endpoint. Is there an API to check the status of a broker?
It seems the client doesn't throw exception if connection to broker fails. It actually tries to connect to bootstrap servers when first time producer tries to send the message. If connection fails, it repeatedly tries to connect to any of the brokers passed in the bootstrap list. Eventually, if the brokers come up, send will happen (and we may check the status in the callback function).
The confluent kafka python library is using librdkafka library and this client doesn't seem to have proper documentation. Some of the Kafka producer option specified by Kafka protocol, seem not supported by librdkafka.
Here is the sample code with callback I used:
from confluent_kafka import Producer
def notifyme(err, msg):
print err, msg.key(), msg.value()
p = Producer({'bootstrap.servers': '127.0.0.1:9092', 'retry.backoff.ms' : 100,
'message.send.max.retries' : 20,
"reconnect.backoff.jitter.ms" : 2000})
try:
p.produce(topic='sometopic', value='this is data', on_delivery=notifyme)
except Exception as e:
print e
p.flush()
Also, checking for the presence of the broker, you may just telnet to the broker ip on its port (in this example it is 9092). And on the Zookeeper used by Kafka cluster, you may check the contents of the znodes under /brokers/ids
Here is the code that seems to work for me. If it looks a bit Frankenstein, then you are right, it is! If there is a clean solution, I look forward to seeing it:
import time
import uuid
from threading import Event
from typing import Dict
import confluent_kafka as kafka
# pylint: disable=no-name-in-module
from confluent_kafka.cimpl import KafkaError
# more imports...
LOG = # ...
# Default number of times to retry connection to Kafka Broker
_DEFAULT_RETRIES = 3
# Default time in seconds to wait between connection attempts
_DEFAULT_RETRY_DELAY_S = 5.0
# Number of times to scan for an error after initiating the connection. It appears that calling
# flush() once on a producer after construction isn't sufficient to catch the 'broker not available'
# # error. At least twice seems to work.
_NUM_ERROR_SCANS = 2
class JobProducer(object):
def __init__(self, connection_retries: int=_DEFAULT_RETRIES,
retry_delay_s: float=_DEFAULT_RETRY_DELAY_S) -> None:
"""
Constructs a producer.
:param connection_retries: how many times to retry the connection before raising a
RuntimeError. If 0, retry forever.
:param retry_delay_s: how long to wait between retries in seconds.
"""
self.__error_event = Event()
self._job_queue = JobQueue()
self._producer = self.__wait_for_broker(connection_retries, retry_delay_s)
self._topic = self._job_queue.topic
def produce_job(self, job_definition: Dict) -> None:
"""
Produce a job definition on the queue
:param job_definition: definition of the job to be executed
"""
value = ... # Conversion to JSON
key = str(uuid.uuid4())
LOG.info('Produced message: %s', value)
self.__error_event.clear()
self._producer.produce(self._topic,
value=value,
key=key,
on_delivery=self._on_delivery)
self._producer.flush(self._job_queue.flush_timeout)
#staticmethod
def _on_delivery(error, message):
if error:
LOG.error('Failed to produce job %s, with error: %s', message.key(), error)
def __create_producer(self) -> kafka.Producer:
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'error_cb': self.__on_error,
'client.id': self._job_queue.client_id,
}
return kafka.Producer(**producer_properties)
def __wait_for_broker(self, retries: int, delay: float) -> kafka.Producer:
retry_count = 0
while True:
self.__error_event.clear()
producer = self.__create_producer()
# Need to call flush() several times with a delay between to ensure errors are caught.
if not self.__error_event.is_set():
for _ in range(_NUM_ERROR_SCANS):
producer.flush(0.1)
if self.__error_event.is_set():
break
time.sleep(0.1)
else:
# Success: no errors.
return producer
# If we get to here, the error callback was invoked.
retry_count += 1
if retries == 0:
msg = '({})'.format(retry_count)
else:
if retry_count <= retries:
msg = '({}/{})'.format(retry_count, retries)
else:
raise RuntimeError('JobProducer timed out')
LOG.warn('JobProducer: could not connect to broker, will retry %s', msg)
time.sleep(delay)
def __on_error(self, error: KafkaError) -> None:
LOG.error('KafkaError: %s', error.str())
self.__error_event.set()

PyCurl request hangs infinitely on perform

I have written a script to fetch scan results from Qualys to be run each week for the purpose of metrics gathering.
The first part of this script involves fetching a list of references for each of the scans that were run in the past week for further processing.
The problem is that, while this will work perfectly sometimes, other times the script will hang on the c.perform() line. This is manageable when running the script manually as it can just be re-run until it works. However, I am looking to run this as a scheduled task each week without any manual interaction.
Is there a foolproof way that I can detect if a hang has occurred and resend the PyCurl request until it works?
I have tried setting the c.TIMEOUT and c.CONNECTTIMEOUT options but these don't seem to be effective. Also, as no exception is thrown, simply putting it in a try-except block also won't fly.
The function in question is below:
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("getting scan references...")
with open('refs_raw.txt','wb') as refsraw:
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.CONNECTTIMEOUT, 3)
c.setopt(c.TIMEOUT, 3)
refsbuffer = BytesIO()
c.setopt(c.WRITEDATA, refsbuffer)
c.perform()
body = refsbuffer.getvalue()
refsraw.write(body)
c.close()
print("Got em!")
I fixed the issue myself by launching a separate process using multiprocessing to launch the API call in a separate process, killing and restarting if it goes on for longer than 5 seconds. It's not very pretty but is cross-platform. For those looking for a solution that is more elegant but only works on *nix look into the signal library, specifically SIGALRM.
Code below:
# As this request for scan references sometimes hangs it will be run in a separate thread here
# This will be terminated and relaunched if no response is received within 5 seconds
def performRequest(usr, pwd):
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
refsBuffer = BytesIO()
c.setopt(c.WRITEDATA, refsBuffer)
c.perform()
c.close()
body = refsBuffer.getvalue()
refsraw = open('refs_raw.txt', 'wb')
refsraw.write(body)
refsraw.close()
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("Getting scan references...")
# Occasionally the request will hang infinitely. Launch in separate method and retry if no response in 5 seconds
success = False
while success != True:
sendRequest = multiprocessing.Process(target=performRequest, args=(usr, pwd))
sendRequest.start()
for seconds in range(5):
print("...")
time.sleep(1)
if sendRequest.is_alive():
print("Maximum allocated time reached... Resending request")
sendRequest.terminate()
del sendRequest
else:
success = True
print("Got em!")
The question is old but i will add this answer, it might help someone.
the only way to terminate a running curl after executing "perform()" is by using callbacks:
1- using CURLOPT_WRITEFUNCTION:
as stated from docs:
Your callback should return the number of bytes actually taken care of. If that amount differs from the amount passed to your callback function, it'll signal an error condition to the library. This will cause the transfer to get aborted and the libcurl function used will return CURLE_WRITE_ERROR.
the drawback with this method is curl calls the write function only when receives new data from the server, so in case of server stopped sending data curl will just keep waiting at the server side and will not receive your kill signal
2- the alternative and the best so far is using progress callback:
the beauty of progress callback is being curl will call it at least once per seconds even if no data coming from the server which will give you the opportunity to return non zero value as a kill switch to curl
use option CURLOPT_XFERINFOFUNCTION,
note it is better than using CURLOPT_PROGRESSFUNCTION as quoted in docs:
We encourage users to use the newer CURLOPT_XFERINFOFUNCTION instead, if you can.
also you need to set option CURLOPT_NOPROGRESS
CURLOPT_NOPROGRESS must be set to 0 to make this function actually get called.
This is an example to show you both write and progress functions implementations in python:
# example of using write and progress function to terminate curl
import pycurl
open('mynewfile', 'w') as f # used to save downloaded data
counter = 0
# define callback functions which will be used by curl
def my_write_func(data):
"""write to file"""
f.write(data)
counter += len(data)
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning -1 or any number
# not equal to len(data)
if counter >= 1024:
return -1
def progress(*data):
"""it receives progress from curl and can be used as a kill switch
Returning a non-zero value from this callback will cause curl to abort the transfer
"""
d_size, downloaded, u_size, uploade = data
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning non zero value
if downloaded >= 1024:
return -1
# initialize curl object and options
c = pycurl.Curl()
# callback options
c.setopt(pycurl.WRITEFUNCTION, my_write_func)
self.c.setopt(pycurl.NOPROGRESS, 0) # required to use a progress function
self.c.setopt(pycurl.XFERINFOFUNCTION, self.progress)
# self.c.setopt(pycurl.PROGRESSFUNCTION, self.progress) # you can use this option but pycurl.XFERINFOFUNCTION is recommended
# put other curl options as required
# executing curl
c.perform()

Categories

Resources