Python producer / consumer with data persistence in database?

Python producer / consumer with data persistence in database? - python

I'm writing a producer / consumer to suit my needs in work.
Generally there's a producer thread which fetch some log from remote server, put it in the queue. And one or more consumer thread which read data from the queue and do some work. After that the data and the result both need to be saved (e.g. in sqlite3 db) for later analysis.
To make sure that each piece of log can be processed only once, every time before consuming the data, I have to query the database to see if it has been done. I wonder if there is a better way to accomplish this. If there are more than one consumer threads, database locking seems to be a problem.
Code relevant:
import Queue
import threading
import requests
out_queue = Queue.Queue()
class ProducerThread(threading.Thread):
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
# Read remote log and put chunk in out_queue
resp = requests.get("http://example.com")
# place chunk into out queue and sleep for some time.
self.out_queue.put(resp)
time.sleep(10)
class ConsumerThread(threading.Thread):
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
# consume the data.
chunk = self.out_queue.get()
# check whether chunk has been consumed before. query the database.
flag = query_database(chunk)
if not flag:
do_something_with(chunk)
# signals to queue job is done
self.out_queue.task_done()
# persist the data and other info insert to the database.
data_persist()
else:
print("data has been consumed before.")
def main():
# just one producer thread.
t = ProducerThread(out_queue)
t.setDaemon(True)
t.start()
for i in range(3):
ct = ConsumerThread(out_queue)
ct.setDaemon(True)
ct.start()
# wait on the queue until everything has been processed
out_queue.join()
main()

If the logs read remote server are not duplicated/repeated, then there is no need to check whether the logs are processed for multiple times, as Queue class implements all the required locking semantics and thus Queue.get() ensures a specific item could only be got by one ConsumerThread.
If the logs could be duplicated (I guess not), then you should do the checking in ProducerThread (before adding the logs to the queue), rather than the do checking in ConsumerThread. In this way, you don't need to consider locking.
update based on #dofine's confirmation on my understanding about the requirement in below comments:
For points #2 and #3, you may need a lightweight persistent queue such as FifoDiskQueue in queuelib. To be honest, I didn't use this lib before but I think it should work for you. Please check out the lib.
For point #1, I guess you can achieve it by using whatever a (non-memory) database, in combination with another queue of FifoDiskQueue:
The 2nd queue serves the purpose of re-queueing a log immediately if it fails to be processed by one consumer thread. Please see my first comment below for the idea
there is a single table in the db. The producer thread always adds new records to it, but never updates any records; and the consumer thread only updates those records it has picked from the queue
with above logic, you should never needs a lock the table
on application startup (prior to starting the consumers), you may have the producer query the db for those logs that are "lost" in track due to application's unexpected termination
this update is typed in mobile SO, so it is kind of inconvenient to extend it. If needed, I will update again when I get a chance

Related

RabbitMQ bulk consuming messages solution

I'm using RabbitMQ as a queue of different messages. When I consume this messages with two different consumers from one queue, I process them and insert processing results to a DB:
def consumer_callback(self, channel, delivery_tag, properties, message):
result = make_some_processing(message)
insert_to_db(result)
channel.basic_ack(delivery_tag)
I want to bulk consume messages from queue, that will reduce DB load. Since RabbitMQ does not support bulk reading messages by consumers I'm going to do smth like this:
some_messages_list = []
def consumer_callback(self, channel, delivery_tag, properties, message):
some_messages_list.append({delivery_tag: message})
if len(some_messages_list) > 1000:
results_list = make_some_processing_bulk(some_messages_list)
insert_to_db_bulk(results_list)
for tag in some_messages_list:
channel.basic_ack(tag)
some_messages_list.clear()
Messages are in queue before they all fully processed
If consumer falls or disconnects - messages stays safe
What do you think about that solution?
If it's okay, how can I get all unacknoleged messages anew, if consumer falls?

I've tested this solution for several months and can say that it is pretty good. Till AMPQ doesn't provide feature for bulk consuming, we have to use some walkarounds like this.
Note: if you decided to use this solution, beware of concurent consuming with several consumers (threads), or use some Locks (I've used python threading.Lock module) to provide guarantees for no race conditions happens.

ECS task only able to pick one message from SQS queue

I have an architecture which looks like that:
As soon as a message is sent to a SQS queue, an ECS task picks this message and process it.
Which means that if X messages are sent into the queue, X ECS task will be spun up in parallel. An ECS task is only able to fetch one message (per my code above)
The ECS task uses a dockerized Python container, and uses boto3 SQS client to retrieve and parse the SQS message:
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
while sqs_message is not None:
# Process it
# Delete if from the queue
# Get next message in queue
sqs_response = get_sqs_task_data('<sqs_queue_url>')
sqs_message = parse_sqs_message(sqs_response)
def get_sqs_task_data(queue_url):
client = boto3.client('sqs')
response = client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=1
)
return response
def parse_sqs_message(response_sqs_message):
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
# ... parse it and return a dict
return {
data_1 = ...,
data_2 = ...
}
All in all, pretty straightforward.
In get_sqs_data(), I explicitely specify that I want to retrieve only one message (because 1 ECS task has to process only one message).
In parse_sqs_message(), I test if there are some messages left in the queue with
if 'Messages' not in response_sqs_message:
logging.info('No messages found in queue')
return None
When there is only one message in the queue (meaning one ECS task has been triggered), everything is working fine. The ECS task is able to pick the message, process it and delete it.
However, when the queue is populated with X messages (X > 1) at the same time, X ECS task are triggered, but only ECS task is able to fetch one of the message and process it.
All the others ECS tasks will exit with No messages found in queue, although there are X - 1 messages left to be processed.
Why is that? Why are the others task not able to pick the messages left to be picked?
If that matters, the VisibilityTimeout of SQS is set to 30mins.
Any help would greatly be appreciated!
Feel free to ask for more precision if you want so.

I forgot to give an answer to that question.
The problem was the fact the the SQS was setup as a FIFO queue.
A FIFO Queue only allows one consumer at a time (to preserve the order of the message). Changing it to a normal (standard) queue fixed this issue.

I'm not sure to understand how the tasks are triggered from SQS, but from what I understand in the SQS SDK documentation, this might happen if the number of messages is small when using short polling. From the get_sqs_task_data definition, I see that your are using short polling.
Short poll is the default behavior where a weighted random set of machines
is sampled on a ReceiveMessage call. Thus, only the messages on the
sampled machines are returned. If the number of messages in the queue
is small (fewer than 1,000), you most likely get fewer messages than you requested
per ReceiveMessage call.
If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response.
If this happens, repeat the request.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message
You might want to try to use Long polling with a value superior to the visibility timeout
I hope it helps

How can I asynchronously receive processed messages in Celery?

I'm writing a data processing pipeline using Celery because this speeds things up considerably.
Consider the following pseudo-code:
from celery.result import ResultSet
from some_celery_app import processing_task # of type #app.task
def crunch_data():
results = ResultSet([])
for document in mongo.find(): #Around 100K - 1M documents
job = processing_task.delay(document)
results.add(job)
return results.get()
collected_data = crunch_data()
#Do some stuff with this collected data
I successfully spawn four workers with concurrency enabled and when I run this script, the data is processed accordingly and I can do whatever I want.
I'm using RabbitMQ as message broker and rpc as backend.
What I see when I open the RabbitMQ management UI:
First, all the documents are processed
then, and only then, are the documents retrieved by the collective results.get() call.
My question: Is there a way to do the processing and subsequent retrieval simultaneously? In my case, as all documents are atomic entities that do not rely on each other, there seems to be no need to wait for the job to be processed completely.

You could try the callback parameter in ResultSet.get(callback=cbResult) and then you could process the result in the callback.
def cbResult(task_id, value):
print(value)
results.get(callback=cbResult)

Sync message to twitter in background in a web application

I'm writing an web app. Users can post text, and I need to store them in my DB as well as sync them to a twitter account.
The problem is that I'd like to response to the user immediately after inserting the message to DB, and run the "sync to twitter" process in background.
How could I do that? Thanks

either you choose zrxq's solution, or you can do that with a thread, if you take care of two things:
you don't tamper with objects from the main thread (be careful of iterators),
you take good care of killing your thread once the job is done.
something that would look like :
import threading
class TwitterThreadQueue(threading.Thread):
queue = []
def run(self):
while len(self.queue!=0):
post_on_twitter(self.queue.pop()) # here is your code to post on twitter
def add_to_queue(self,msg):
self.queue.append(msg)
and then you instanciate it in your code :
tweetQueue = TwitterThreadQueue()
# ...
tweetQueue.add_to_queue(message)
tweetQueue.start() # you can check if it's not already started
# ...

django,fastcgi: how to manage a long running process?

I have inherited a django+fastcgi application which needs to be modified to perform a lengthy computation (up to half an hour or more). What I want to do is run the computation in the background and return a "your job has been started" -type response. While the process is running, further hits to the url should return "your job is still running" until the job finishes at which point the results of the job should be returned. Any subsequent hit on the url should return the cached result.
I'm an utter novice at django and haven't done any significant web work in a decade so I don't know if there's a built-in way to do what I want. I've tried starting the process via subprocess.Popen(), and that works fine except for the fact it leaves a defunct entry in the process table. I need a clean solution that can remove temporary files and any traces of the process once it has finished.
I've also experimented with fork() and threads and have yet to come up with a viable solution. Is there a canonical solution to what seems to me to be a pretty common use case? FWIW this will only be used on an internal server with very low traffic.

I have to solve a similar problem now. It is not going to be a public site, but similarly, an internal server with low traffic.
Technical constraints:
all input data to the long running process can be supplied on its start
long running process does not require user interaction (except for the initial input to start a process)
the time of the computation is long enough so that the results cannot be served to the client in an immediate HTTP response
some sort of feedback (sort of progress bar) from the long running process is required.
Hence, we need at least two web “views”: one to initiate the long running process, and the other, to monitor its status/collect the results.
We also need some sort of interprocess communication: send user data from the initiator (the web server on http request) to the long running process, and then send its results to the reciever (again web server, driven by http requests). The former is easy, the latter is less obvious. Unlike in normal unix programming, the receiver is not known initially. The receiver may be a different process from the initiator, and it may start when the long running job is still in progress or is already finished. So the pipes do not work and we need some permamence of the results of the long running process.
I see two possible solutions:
dispatch launches of the long running processes to the long running job manager (this is probably what the above-mentioned django-queue-service is);
save the results permanently, either in a file or in DB.
I preferred to use temporary files and to remember their locaiton in the session data. I don't think it can be made more simple.
A job script (this is the long running process), myjob.py:
import sys
from time import sleep
i = 0
while i < 1000:
print 'myjob:', i
i=i+1
sleep(0.1)
sys.stdout.flush()
django urls.py mapping:
urlpatterns = patterns('',
(r'^startjob/$', 'mysite.myapp.views.startjob'),
(r'^showjob/$', 'mysite.myapp.views.showjob'),
(r'^rmjob/$', 'mysite.myapp.views.rmjob'),
)
django views:
from tempfile import mkstemp
from os import fdopen,unlink,kill
from subprocess import Popen
import signal
def startjob(request):
"""Start a new long running process unless already started."""
if not request.session.has_key('job'):
# create a temporary file to save the resuls
outfd,outname=mkstemp()
request.session['jobfile']=outname
outfile=fdopen(outfd,'a+')
proc=Popen("python myjob.py",shell=True,stdout=outfile)
# remember pid to terminate the job later
request.session['job']=proc.pid
return HttpResponse('A new job has started.')
def showjob(request):
"""Show the last result of the running job."""
if not request.session.has_key('job'):
return HttpResponse('Not running a job.'+\
'Start a new one?')
else:
filename=request.session['jobfile']
results=open(filename)
lines=results.readlines()
try:
return HttpResponse(lines[-1]+\
'<p>Terminate?')
except:
return HttpResponse('No results yet.'+\
'<p>Terminate?')
return response
def rmjob(request):
"""Terminate the runining job."""
if request.session.has_key('job'):
job=request.session['job']
filename=request.session['jobfile']
try:
kill(job,signal.SIGKILL) # unix only
unlink(filename)
except OSError, e:
pass # probably the job has finished already
del request.session['job']
del request.session['jobfile']
return HttpResponseRedirect('/startjob/') # start a new one

Maybe you could look at the problem the other way around.
Maybe you could try DjangoQueueService, and have a "daemon" listening to the queue, seeing if there's something new and process it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.