Google PubSub thread safety and ignoring duplicate messages - python

I am listening to financial data published by Google Cloud Platform, Monday to Friday. I would like to save all messages to disk. I am doing this in Python.
I need to recover any missing packets if my application goes down. I understand Google will automatically resend un-ack'd messages.
The GCP documentation lists many subscription techniques available (Asynchronous/Synchronous, Push/Pull, Streaming pull etc). There is an asynchronous sample code:
def callback(message):
print(f"Received {message}.")
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}..\n")
# Wrap subscriber in a 'with' block to automatically call close() when done.
with subscriber:
try:
# When `timeout` is not set, result() will block indefinitely,
# unless an exception is encountered first.
streaming_pull_future.result(timeout=5)
except TimeoutError:
streaming_pull_future.cancel()
https://cloud.google.com/pubsub/docs/pull
Is the callback thread-safe/is there only one thread calling-back?
What is the best way to ignore already-received messages? Does the client need to maintain a map?
UPDATE for Kamal Aboul-Hosn
I think I can persist ok but my problem is I need to manually check all messages have indeed been received. To do this I enabled ordered delivery. Our message data contains a sequence number, so I wanted to add a global variable like next_expected_seq_num. After I receive each message I will process and ack the message and increment next_expected_seq_num.
However, if I have say 10 threads invoking the callback method, I assume any of the 10 could contain the next message? And I'd have to make my callback method smart enough to block processing on the other 9 threads whilst the 10th thread processes the next message. Something like:
(pseudo code)
def callback(msg)
{
seq_num = getSeqNum(msg.data);
while(seq_num != next_expected_seq_num); // Make atomic
// When we reach here, we have the next message
assert(db.exists(seq_num) == false);
// persist message
++next_expected_seq_num; // make atomic/cannot be earlier
msg.ack();
}
Should I just disable multiple callback threads given i'm preventing multithreading anyway?
Is there a better way to check/guarantee we process every message?
I'm wondering if we should trust GCP like TCP, enable multithreading (and just lock around the database-write)?
def callback(msg)
{
seq_num = getSeqNum(msg.data);
lock();
if(db.exists(seq_num) == false)
{
// persist message
}
unlock();
msg.ack();
}

The callback is not thread safe if you are running in a Python environment that doesn't have a global interpreter lock. Multiple callbacks could be executed in parallel in that case and you would you have to guard any shared data structures with locks.
Since Cloud Pub/Sub has at-least-once delivery semantics, if you need to ignore duplicate messages then yes, you will need to maintain some kind of data structure with the already-received messages. Note that duplicates could be delivered across subscriber restarts. Therefore, you will probably need this to be some kind of persistent storage. Redis tends to be a popular choice for this type of deduplication.
With ordered delivery, it is guaranteed that the callback will only run for one message for an ordering key at a time. Therefore, you would not have to program expecting multiple messages to be running simultaneously for the key. Note that in general, using ordering keys to totally order all messages in the topic will only work if your throughput is no more than 1MB/s as that is the publish limit for messages with ordering keys. Also, only use ordering keys if it is important to process the messages in order.
With regard to when to use multithreading or not, it really depends on the nature of the processing. If most of the callback would need to be guarded with a lock, then multithreading won't help much. If, though, only small portions need to be guarded by locks, e.g., checking for duplicates, while most of the processing can safely be done in parallel, then multithreading could result in better performance.
If all you want to do is prevent duplicates, then you probably don't need to guard the writes to the database with a lock unless the database doesn't guarantee consistency. Also, keep in mind that the locking only helps if you have a single subscriber client.

Related

Kafka consume single message on request

I want to make a flask application/API with gunicorn that on every request-
reads a single value from a Kafka topic
does some processing
and returns the processed value to the user(or any application calling the API).
So, far I couldn't find any examples of it. So, is the following function is the correct way of doing this?
consumer = KafkaConsumer(
"first_topic",
bootstrap_servers='xxxxxx',
auto_offset_reset='xxxx',
group_id="my_group")
def get_value_from_topic:
for msg in consumer:
return msg
if __name__ == "__main__":
print(get_value_from_topic())
Or is there any better way of doing this using any library like Faust?
My reason for using Kafka is to avoid all the hassle of synchronization among the flask workers(in the case of traditional database) because I want to use a value from Kafka only once.
This seems okay at first glance. Your consumer iterator is iterated once, and you return that value.
The more idiomatic way to do that would be like this, however
def get_value_from_topic():
return next(consumer)
With your other settings, though, there's no guarantee this only polls one message because Kafka consumers poll in batches, and will auto-commit those batches of offsets. Therefore, you'll want to disable auto commits and handle that on your own, for example committing after handling the http request will give you at-least-once delivery, and committing before will give you at-most-once. Since you're interacting with an HTTP server, Kafka can't give you exactly once processing

RabbitMQ cleanup after consumers

I having problem handling the following scenario:
I have one publisher which wants to upload a lot of binary information (Like images), so instead I want it to save the image and upload a path or some reference to that file.
I have multiple different consumers which are reading from this MQ and do different things.
In order to do that, I simply send the information in fan-out to some exchange and define several queues for each different consumers.
This could work just fine, except for the trashing of the FS. Since no one is responsible for deleting the saved images. I need some way of defining a hook to the time every consumer is done consuming a message from an exchnage? Maybe setting some callback for the cleanup of the message in the exchnage?
Few notes:
Everything happens locally, we can assume that everything is on the same FS for simplicity.
I know that I can simply let the publisher save the image and give FS links for the different consumers, but this solution is problematic, since I want the publisher to be oblivious to the consumers. I don't want to update the publisher's code every time a new consumer may be used (or one can be removed).
I am working with python. (pika module)
I am new to Message Queues, so if you have a better suggestion to get things done, I would love to learn about it.
Once the image is processed by consumer publish message FileProcessed with the information related to the file. That message can be picked up by another consumer which is in charge of cleaning up the messages, so that consumer will remove the file.
Additionally, make sure that your messages is re-queued in case of failure, so they will be picked up later and their processing will be retried. Make sure the retry count is limited, when the limit is reached route the message to dead letter exchange.
Some useful links below:
pika.BasicProperties for handling retries.
RabbitMQ tutorial
Pika DLX Implementation

What happens if I subscribe to the same topic multiple times? (Python, Google Pubsub)

If I have the following code, will anything bad happen? Will it try to create new subscriptions? Is subscribe an idempotent operation?
subscriber = pubsub_v1.SubscriberClient()
def f(msg):
print(msg.data)
print(msg)
msg.ack()
def create_subscriptions():
restults = [] # some sql query
for result in results:
path = self.subscriber.subscription_path('PROJECT', result)
subscriber.subscribe(self.path, callback=f)
while True:
time.sleep(60)
create_subscriptions()
I need to be able to update my subscriptions based on when people create new ones. Is there any problem with this approach?
You should avoid repeatedly calling “subscribe” for the same subscription -- even though you will most likely not increase the number of duplicate messages that are delivered, you would create multiple instances of the receiving infrastructure. This is both inefficient, and defeats some of the flow control properties that Pub/Sub provides, since these are only computed per instance of the subscriber; i.e. it can cause your subscriber job to run out of memory and fail, for example.
Instead, I would suggest keeping track of which subscribers you’ve already created. Note that the “subscribe” method returns a future that you can use for this purpose, or to cancel the message receiving when necessary. You can find more details on the documentation.

RabbitMQ Queued messages keep increasing

We have a Windows based Celery/RabbitMQ server that executes long-running python tasks out-of-process for our web application.
What this does, for example, is take a CSV file and process each line. For every line it books one or more records in our database.
This seems to work fine, I can see the records being booked by the worker processes. However, when I check the rabbitMQ server with the management plugin (the web based management tool) I see the Queued messages increasing, and not coming back down.
Under connections I see 116 connections, about 10-15 per virtual host, all "running" but when I click through, most of them have 'idle' as State.
I'm also wondering why these connections are still open, and if there is something I need to change to make them close themselves:
Under 'Queues' I can see more than 6200 items with state 'idle', and not decreasing.
So concretely I'm asking if these are normal statistics or if I should worry about the Queues increasing but not coming back down and the persistent connections that don't seem to close...
Other than the rather concise help inside the management tool, I can't seem to find any information about what these stats mean and if they are good or bad.
I'd also like to know why the messages are still visible in the queues, and why they are not removed, as the tasks seem t be completed just fine.
Any help is appreciated.
Answering my own question;
Celery sends a result message back for every task in the calling code. This message is sent back via the same AMPQ queue.
This is why the tasks were working, but the queue kept filling up. We were not handling these results, or even interested in them.
I added ignore_result=True to the celery task, so the task does not send result messages back into the queue. This was the main solution to the problem.
Furthermore, the configuration option CELERY_SEND_EVENTS=False was added to speed up celery. If set to TRUE, this option has Celery send events for external monitoring tools.
On top of that CELERY_TASK_RESULT_EXPIRES=3600 now makes sure that even if results are sent back, that they expire after one hour if not picked up/acknowledged.
Finally CELERY_RESULT_PERSISTENT was set to False, this configures celery to not store these result messages on disk. They will vanish when the server crashes, which is fine in our case, as we don't use them.
So in short; if you don't need feedback in your app about if and when the tasks are finished, use ignore_result=True on the celery task, so that no messages are sent back.
If you do need that information, make sure you pick up and handle the results, so that the queue stops filling up.
If you don't need the reliability then you can make your queues transient.
http://celery.readthedocs.org/en/latest/userguide/optimizing.html#optimizing-transient-queues
CELERY_DEFAULT_DELIVERY_MODE = 'transient'

Celery Task Grouping/Aggregation

I'm planning to use Celery to handle sending push notifications and emails triggered by events from my primary server.
These tasks require opening a connection to an external server (GCM, APS, email server, etc). They can be processed one at a time, or handled in bulk with a single connection for much better performance.
Often there will be several instances of these tasks triggered separately in a short period of time. For example, in the space of a minute, there might be several dozen push notifications that need to go out to different users with different messages.
What's the best way of handling this in Celery? It seems like the naïve way is to simply have a different task for each message, but that requires opening a connection for each instance.
I was hoping there would be some sort of task aggregator allowing me to process e.g. 'all outstanding push notification tasks'.
Does such a thing exist? Is there a better way to go about it, for example like appending to an active task group?
Am I missing something?
Robert
I recently discovered and have implemented the celery.contrib.batches module in my project. In my opinion it is a nicer solution than Tommaso's answer, because you don't need an extra layer of storage.
Here is an example straight from the docs:
A click counter that flushes the buffer every 100 messages, or every
10 seconds. Does not do anything with the data, but can easily be
modified to store it in a database.
# Flush after 100 messages, or 10 seconds.
#app.task(base=Batches, flush_every=100, flush_interval=10)
def count_click(requests):
from collections import Counter
count = Counter(request.kwargs['url'] for request in requests)
for url, count in count.items():
print('>>> Clicks: {0} -> {1}'.format(url, count))
Be wary though, it works fine for my usage, but it mentions that is an "Experimental task class" in the documentation. This might deter some from using a feature with such a volatile description :)
An easy way to accomplish this is to write all the actions a task should take on a persistent storage (eg. database) and let a periodic job do the actual process in one batch (with a single connection).
Note: make sure you have some locking in place to prevent the queue from being processes twice!
There is a nice example on how to do something similar at kombu level (http://ask.github.com/celery/tutorials/clickcounter.html)
Personally I like the way sentry does something like this to batch increments at db level (sentry.buffers module)

Categories

Resources