I want to make a flask application/API with gunicorn that on every request-
reads a single value from a Kafka topic
does some processing
and returns the processed value to the user(or any application calling the API).
So, far I couldn't find any examples of it. So, is the following function is the correct way of doing this?
consumer = KafkaConsumer(
"first_topic",
bootstrap_servers='xxxxxx',
auto_offset_reset='xxxx',
group_id="my_group")
def get_value_from_topic:
for msg in consumer:
return msg
if __name__ == "__main__":
print(get_value_from_topic())
Or is there any better way of doing this using any library like Faust?
My reason for using Kafka is to avoid all the hassle of synchronization among the flask workers(in the case of traditional database) because I want to use a value from Kafka only once.
This seems okay at first glance. Your consumer iterator is iterated once, and you return that value.
The more idiomatic way to do that would be like this, however
def get_value_from_topic():
return next(consumer)
With your other settings, though, there's no guarantee this only polls one message because Kafka consumers poll in batches, and will auto-commit those batches of offsets. Therefore, you'll want to disable auto commits and handle that on your own, for example committing after handling the http request will give you at-least-once delivery, and committing before will give you at-most-once. Since you're interacting with an HTTP server, Kafka can't give you exactly once processing
Related
I am listening to financial data published by Google Cloud Platform, Monday to Friday. I would like to save all messages to disk. I am doing this in Python.
I need to recover any missing packets if my application goes down. I understand Google will automatically resend un-ack'd messages.
The GCP documentation lists many subscription techniques available (Asynchronous/Synchronous, Push/Pull, Streaming pull etc). There is an asynchronous sample code:
def callback(message):
print(f"Received {message}.")
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}..\n")
# Wrap subscriber in a 'with' block to automatically call close() when done.
with subscriber:
try:
# When `timeout` is not set, result() will block indefinitely,
# unless an exception is encountered first.
streaming_pull_future.result(timeout=5)
except TimeoutError:
streaming_pull_future.cancel()
https://cloud.google.com/pubsub/docs/pull
Is the callback thread-safe/is there only one thread calling-back?
What is the best way to ignore already-received messages? Does the client need to maintain a map?
UPDATE for Kamal Aboul-Hosn
I think I can persist ok but my problem is I need to manually check all messages have indeed been received. To do this I enabled ordered delivery. Our message data contains a sequence number, so I wanted to add a global variable like next_expected_seq_num. After I receive each message I will process and ack the message and increment next_expected_seq_num.
However, if I have say 10 threads invoking the callback method, I assume any of the 10 could contain the next message? And I'd have to make my callback method smart enough to block processing on the other 9 threads whilst the 10th thread processes the next message. Something like:
(pseudo code)
def callback(msg)
{
seq_num = getSeqNum(msg.data);
while(seq_num != next_expected_seq_num); // Make atomic
// When we reach here, we have the next message
assert(db.exists(seq_num) == false);
// persist message
++next_expected_seq_num; // make atomic/cannot be earlier
msg.ack();
}
Should I just disable multiple callback threads given i'm preventing multithreading anyway?
Is there a better way to check/guarantee we process every message?
I'm wondering if we should trust GCP like TCP, enable multithreading (and just lock around the database-write)?
def callback(msg)
{
seq_num = getSeqNum(msg.data);
lock();
if(db.exists(seq_num) == false)
{
// persist message
}
unlock();
msg.ack();
}
The callback is not thread safe if you are running in a Python environment that doesn't have a global interpreter lock. Multiple callbacks could be executed in parallel in that case and you would you have to guard any shared data structures with locks.
Since Cloud Pub/Sub has at-least-once delivery semantics, if you need to ignore duplicate messages then yes, you will need to maintain some kind of data structure with the already-received messages. Note that duplicates could be delivered across subscriber restarts. Therefore, you will probably need this to be some kind of persistent storage. Redis tends to be a popular choice for this type of deduplication.
With ordered delivery, it is guaranteed that the callback will only run for one message for an ordering key at a time. Therefore, you would not have to program expecting multiple messages to be running simultaneously for the key. Note that in general, using ordering keys to totally order all messages in the topic will only work if your throughput is no more than 1MB/s as that is the publish limit for messages with ordering keys. Also, only use ordering keys if it is important to process the messages in order.
With regard to when to use multithreading or not, it really depends on the nature of the processing. If most of the callback would need to be guarded with a lock, then multithreading won't help much. If, though, only small portions need to be guarded by locks, e.g., checking for duplicates, while most of the processing can safely be done in parallel, then multithreading could result in better performance.
If all you want to do is prevent duplicates, then you probably don't need to guard the writes to the database with a lock unless the database doesn't guarantee consistency. Also, keep in mind that the locking only helps if you have a single subscriber client.
I have hosted a Flask app on Heroku, written in Python. I have a function which is something like this:
#app.route("/execute")
def execute():
doSomething()
return Response()
Now, the problem is that doSomething() takes more than 30 seconds to execute, bypassing the 30-second-timeout duration of Heroku, and it kills the app.
I could make another thread and execute doSomething() inside it, but the Response object needs to return a file that will be made available only after doSomething() has finished execution.
I also tried working with generators and yield, but couldn't get them to work either. Something like:
#app.route("/execute")
def execute():
def generate():
yield ''
doSomething()
yield file
return Response(generate())
but the app requires me to refresh the page in order to get the second yielded object.
What I basically need to do is return an empty Response object initially, start the execution of doSomething(), and then return another Response object. How do I accomplish this?
Usually with http one request means one response, that's it.
For your issue you might want to look into:
Streaming Response, which are used for large response with many parts.
Sockets to allow multiple "responses" for a single "request".
Making multiple queries with your client, if you have control over your client code this is most likely the easiest solution
I'd recommend reading this, it gets a bit technical but it helped me understand a lot of things.
What you are trying to make is an asynchronous job. For that I recommend you use Celery (here you have a good example: https://blog.miguelgrinberg.com/post/using-celery-with-flask/page/7) or some another tool for asynchronous jobs. In the front-end you can do a simple pooling to wait for response, I recommend you to use SocketIO (https://socket.io/). It's a simple and efficient solution.
It's basically an asynchronous job. You can use Celery or Asyncio for these operations. You can never ask any user to wait for more than 3 seconds - 10 seconds for any operation.
1) Make an AJAX Request
2) Initialize a socket that listens to your operation.
3) As soon as you finish the operation, the socket sends the message back, you can show the user later on through a popup.
This is the best approach you can do
If you could share, what computation are you making then you can get more alternative approaches
I have a Django application that uses large data structures in-memory (due to performance constraints). This wouldn't be a problem, but I'm using Heroku, where if the python web process takes more than 30s to start, it will be stopped as it's considered a timeout error. Because of the problem aforementioned, I've used a daemon process(worker in Heroku) to handle the construction of the data structures and Redis to handle the message passing between processes.
When the worker finishes(approx 1 minute), it stores the data structures(50Mb or so) in Redis.
And now comes the crux of the matter...Django follows the request/response paradigm and it's synchronised. This implies a Django view should exist to handle the callback from the worker announcing it's done. Even if I use something fancier like a pub/sub from Redis, I'm still forced to evaluate the queue populated by a publisher in a view.
How can I circumvent the necessity of using a Django view? Isn't there an async way of doing this?
Below is the solution where I use a pub/sub inside a view. This seems bad, but I can't think of another way.
views.py
...
# data_handler can enqueue tasks on the default queue
data_handler = DataHandler()
strict_redis = redis.from_url(settings.DEFAULT_QUEUE)
pub_sub = strict_redis.pubsub()
# this puts the job of constructing the large data structures
# on the default queue so a worker can pick it up. Being async,
# it returns with an empty set of data structures.
data_structures = data_handler.start()
pub_sub.subscribe(settings.FINISHED_DATA_STRUCTURES_CHANNEL)
#require_http_methods(['POST'])
def store_and_fetch(request):
user_data = json.load(request.body.decode('utf8'))
message = pub_sub.get_message()
if message:
command = message['data'] if 'data' in message else ''
if command == settings.FINISHED_DATA_STRUCTURES_INIT.encode('utf-8'):
# this takes the data from redis and updates data_structures
data_handler.update(data_structures)
return HttpResponse(compute_response(user_data, data_structures))
Update: After working for multiple months with this, I can now say it's definitely better(and wiser) NOT to fiddle with Django's request/response cycle. There are things like Django RQ Scheduler, or Celery that can do async tasks just fine. If you want to update the main web process after some repeatable job completed, it's simpler to use something like python requests package, sending a POST to the web process from the worker that did the scheduled job. In this way we don't circumvent Django's mechanisms, and more importantly, it's simpler to do overall.
Regarding the Heroku constraints I mentioned at the beginning of the post. At the moment I wrote this question I was quite a newbie with heroku and didn't know much about the release phase. In the release phase we can set up all the complex logic we need for the main process. Thus, at the end of the release phase, we simply need to notify the web process, in the manner I've described above and use some distributed memory buffer (even Redis will work just fine).
I try to solve problem with sending mails(or any long task) in web.py project. What I want is to start sending any mail and return the http response. But this task (sending) is taking a long time. Is there any solution?
Example:
import web
''some settings urls, etc.''
class Index:
def GET(self):
''task''
sending_mail()
return 'response'
I found many examples about async tasks but I think that if this task put to background and return 'response' it will fail.
You could get away with sending email in a separate thread (you can spawn one when you need to send an email):
import threading
threading.Thread(target=sending_email).start()
However, the all-around best (and standard) solution would be to use an asynchronous task processor, such as Celery. In your web thread, simply create a new task, and Celery will asynchronously execute it.
There is no reason why "returning response" would fail when using a message queue, unless your response depends on the email being sent prior to sending the response (but in that case, you have an architectural problem).
Moving the sending_email() task to a background queue would be the best solution. This would allow you to return the response immediately and get the results of the sending_email task later on.
Let me also suggest taking a look at RQ
It is a lightweight alternative to Celery that I find easier to get up and running. I have used it in the past for sending emails in the background and it didn't disappoint.
When I get a GET request from a user, I send them the response and then spend maybe a second logging stuff about that request. Is there a way to close the connection when I have the response ready, but continue doing that logging part, so that the user wouldn't have to wait for it to complete?
From the Google App Engine docs for the Response object:
App Engine does not support sending
data to the user's browser before
exiting the handler. Some web servers
use this technique to "stream" data to
the user's browser over a period of
time in response to a single request.
App Engine does not support this
streaming technique.
So there's no easy way. If you have a bundle of data that you can pass to a longer-running "process and log" method, try using the deferred library. Note that this will requiring bundling your data up and sending it to the task queue to do your processing and logging, so
you may not save much time, and
the results may not look much like you'd want - for example, you'd be logging from a different request, so might need to radically alter the logging
Still, you could try.
You have two options:
Use the Task Queue API. Enqueueing a task should be fast, so long as you have less than 10k of data (which is the limit on a Task Queue payload).
Use the 'sneaky' trick described by Rafe in this video to do processing after the response completes.