I need different messages to be sent to the queue - each by its personal schedule. So I have message list and related interval to resend each one. I use rabbitMQ/pika and apscheduler.
According to numerous examples, I created the simplest BlockingConnection/channel/queue. When immediately after that I try to push messages - everything works fine, I can see in rabbitmq web-interface that all the messages become in the queue. Here is the piece of code that works:
self.cr = Queue('DIRECT_C_QUEUE', True, ex_type='direct')
for i in range(1,10000):
self.cr.channel.basic_publish(exchange='', routing_key='DIRECT_C_QUEUE', body='hello_world')
But if I try to push messages (in exactly same way) via apscheduler callback function - only few (about 1-10) messages appear in the queue (but callbacks are fired all the time and there are no any exception when publishing message!).
Finally I begin to receive such warnings:
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pika/connection.py:642: UserWarning: Pika: Write buffer exceeded warning threshold at 1125 bytes and an estimated 43 frames behind
warn(message % (self.outbound_buffer.size, est_frames_behind))
and still no new messages in the queue.
I am new in python, any help is much appreciate.
I found source of the problem:
apscheduler runs basic_publish calls in separate thread, and pika doesn't recommend share connections between threads - http://pika.github.com/faq.html
So I had choice either create new connection each time, or put new messages in some queue and publish them from the main thread (where connection was created).
I fixed that problem by increasing ulimit
edit /etc/default/rabbitmq-server and set
ulimit -n 4096
then restart rabbitmq
sudo /etc/init.d/rabbitmq-server restart
Related
I'm having some issues with the pyghmi python library, which is used for sending IPMI commands with python scripts. My goal is to implement an HTTP API to send IPMI commands through HTTP requests.
I am already able to create a Session and send a few commands with the library, but if the Session remains IDLE for 30 seconds, it logged itself out.
When the Session is logged out, I can't create a new one : I get an error "Session is logged out", or a deadlock.
How can I do if I want to have a server that is always up and create Session when it receives requests, if I can't create new Session when the previous one is logged out ?
What I've tried :
from pyghmi.ipmi import command
ipmi = command.Command(ip, user, passwd)
res = ipmi.get_power()
print(res)
# wait 30 seconds
res2 = ipmi.get_power() # get "Session logged out" error
ipmi2 = command.Command(ip, user, paswd) # Deadlock if wait < 30 seconds, else no error
res3 = ipmi2.get_power() # get "Session logged out" error
# Impossible to create new command.Command() Session, every command will give "logged out" error
The other problem is that I can't use the asynchronous way by giving an "onlogon callback" function in the command.Command() call, because I will need the callback return value in the caller and that's not possible with this sort of thread behavior.
Edit: I already tried some examples provided here but it's always one-time run scripts, whereas I'm looking for something that can stay "up" forever.
So I finally achieved a sort of solution. I emailed the Pyghmi's main contributor and he said that this lib was not suited for a multi- and reuseable- Session implementation (there is currently an open issue "Session reuse" on Pyghmi repository).
First "solution": use processes
My goal was to create an HTTP API. To avoid the Session timeout issue, I create a new Process (not Thread) for every new request. That works fine, but I did not keep this solution because it is to heavy and sockets consuming. It seems that by creating processes, the memory used by Pyghmi is not shared between processes (that's the goal of processes) so every Session utilisation is not a reuse but a creation.
Second "solution" : use Confluent
Confluent is a tool developed by Lenovo that allow to control hardware via HTTP. It uses a sort of patched version of Pyghmi as backend for IPMI calls. Confluent documentation here.
Once installed and configured on a server, Confluent worked well to control IPMI devices via HTTP. I packaged it in a Docker image along with an ipmi_simulator for testing purposes : confluent dockerized.
The solution today is to run Command.eventloop() after creating the connection. It is documented in ipmi/command.py, which has a very trivial Housekeeper class which in the current version 1.5.53 is actually just a renamed Thread class, with no additional features. It merely runs the eventloop.
The implementation looks like this. One of those mentioned house keeping tasks is sending keepalive messages, if enabled which it is by default and can be influence by supplying keepalive=True at Command instantiation:
class Housekeeper(threading.Thread):
"""A Maintenance thread for housekeeping
Long lived use of pyghmi may warrant some recurring asynchronous behavior.
This stock thread provides a simple minimal context for these housekeeping
tasks to run in. To use, do 'pyghmi.ipmi.command.Maintenance().start()'
and from that point forward, pyghmi should execute any needed ongoing
tasks automatically as needed. This is an alternative to calling
wait_for_rsp or eventloop in a thread of the callers design.
"""
def run(self):
Command.eventloop()
There is a receive function at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Client.receive_message to get SQS message,
Is there a function that I can just take a Peek at the SQS messages, without actually receiving it. Because If I receive the messages, it will be deleted from the queue. But I want the messages to be stay in the queue after peeking.
you can check
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
aws sqs sdk's (and client libraries written on top of them) by default they dont delete messages. but they have 'Visibility Timeout' which is 30 seconds by default. That means after you read the message , it wont be visible to other consumers for 30 seconds. It is up to a client to delete it within that time frame so that no one else will get that message ever.
So You can reduce that visility time out to something really small like 1 second. So you can download the message and within 1 second it will be available to other consumers . you can even set it to 0 so everyone can read the message at any point.
But this still means you will receieve the message. SQS is pretty simple queue system. you might want to check other queue system like Kafka or different way to design your system like using Notification services such as SNS
I have a Celery task that takes a message from an SQS queue and tries to run it. If it fails it is supposed to retry every 10 seconds at least 144 times. What I think is happening is that it fails and gets back into the queue, and at the same time it creates a new one, duplicating it to 2. These 2 fail again and follow the same pattern to create 2 new and becoming 4 messages in total. So if I let it run for some time the queue gets clogged.
What I am not getting is the proper way to retry it without duplicating. Following is the code that retries. Please see if someone can guide me here.
from celery import shared_task
from celery.exceptions import MaxRetriesExceededError
#shared_task
def send_br_update(bgc_id, xref_id, user_id, event):
from myapp.models.mappings import BGC
try:
bgc = BGC.objects.get(pk=bgc_id)
return bgc.send_br_update(user_id, event)
except BGC.DoesNotExist:
pass
except MaxRetriesExceededError:
pass
except Exception as exc:
# retry every 10 minutes for at least 24 hours
raise send_br_update.retry(exc=exc, countdown=600, max_retries=144)
Update:
More explanation of the issue...
A user creates an object in my database. Other users act upon that object and as they change the state of that object, my code emits signals. The signal handler then initiates a celery task, which means that it connects to the desired SQS queue and submits the message to the queue. The celery server, running the workers, see that new message and try to execute the task. This is where it fails and the retry logic comes in.
According to celery documentation to retry a task all we need to do is to raise self.retry() call with countdown and/or max_retries. If a celery task raises an exception it is considered as failed. I am not sure how SQS handles this. All I know is that one task fails and there are two in the queue, both of these fail and then there are 4 in the queue and so on...
This is NOT celery nor SQS issues.
The real issues is the workflow , i.e. way of you sending message to MQ service and handle it that cause duplication. You will face the same problem using any other MQ service.
Imagine your flow
script : read task message. MQ Message : lock for 30 seconds
script : task fail. MQ Message : locking timeout, message are now free to be grab again
script : create another task message
Script : Repeat Step 1. MQ Message : 2 message with the same task, so step 1 will launch 2 task.
So if the task keep failing, it will keep multiply, 2,4,8,16,32....
If celery script are mean to "Recreate failed task and send to message queue", you want to make sure these message can only be read ONCE. **You MUST discard the task message after it already been read 1 time, even if the task failed. **
There are at least 2 ways to do this, choose one.
Delete the message before recreate the task. OR
In SQS, you can enforce this by create DeadLetter Queue, configure the Redrive Policy, set Maximum Receives to 1. This will make sure the message
with the task that have been read never recycle.
You may prefer method 2, because method 1 require you to configure celery to "consume"(read and delete) ASAP it read the message, which is not very practical. (and you must make sure you delete it before create a new message for failed task)
This dead letter queue is a way to let you to check if celery CRASH, i.e. message that have been read once but not consumed (delete) means program stop somewhere.
This is probably a little bit late, I have written a backoff policy for Celery + SQS as a patch.
You can see how it is implemented in this repository
https://github.com/galCohen88/celery_sqs_retry_policy/blob/master/svc/celery.py
We have a Windows based Celery/RabbitMQ server that executes long-running python tasks out-of-process for our web application.
What this does, for example, is take a CSV file and process each line. For every line it books one or more records in our database.
This seems to work fine, I can see the records being booked by the worker processes. However, when I check the rabbitMQ server with the management plugin (the web based management tool) I see the Queued messages increasing, and not coming back down.
Under connections I see 116 connections, about 10-15 per virtual host, all "running" but when I click through, most of them have 'idle' as State.
I'm also wondering why these connections are still open, and if there is something I need to change to make them close themselves:
Under 'Queues' I can see more than 6200 items with state 'idle', and not decreasing.
So concretely I'm asking if these are normal statistics or if I should worry about the Queues increasing but not coming back down and the persistent connections that don't seem to close...
Other than the rather concise help inside the management tool, I can't seem to find any information about what these stats mean and if they are good or bad.
I'd also like to know why the messages are still visible in the queues, and why they are not removed, as the tasks seem t be completed just fine.
Any help is appreciated.
Answering my own question;
Celery sends a result message back for every task in the calling code. This message is sent back via the same AMPQ queue.
This is why the tasks were working, but the queue kept filling up. We were not handling these results, or even interested in them.
I added ignore_result=True to the celery task, so the task does not send result messages back into the queue. This was the main solution to the problem.
Furthermore, the configuration option CELERY_SEND_EVENTS=False was added to speed up celery. If set to TRUE, this option has Celery send events for external monitoring tools.
On top of that CELERY_TASK_RESULT_EXPIRES=3600 now makes sure that even if results are sent back, that they expire after one hour if not picked up/acknowledged.
Finally CELERY_RESULT_PERSISTENT was set to False, this configures celery to not store these result messages on disk. They will vanish when the server crashes, which is fine in our case, as we don't use them.
So in short; if you don't need feedback in your app about if and when the tasks are finished, use ignore_result=True on the celery task, so that no messages are sent back.
If you do need that information, make sure you pick up and handle the results, so that the queue stops filling up.
If you don't need the reliability then you can make your queues transient.
http://celery.readthedocs.org/en/latest/userguide/optimizing.html#optimizing-transient-queues
CELERY_DEFAULT_DELIVERY_MODE = 'transient'
I'm following this tutorial: http://boto.s3.amazonaws.com/sqs_tut.html
When there's something in the queue, how do I assign one of my 20 workers to process it?
I'm using Python.
Unfortunately, SQS lacks some of the semantics we've often come to expect in queues. There's no notification or any sort of blocking "get" call.
Amazon's related SNS/Simple Notification Service may be useful to you in this effort. When you've added work to the queue, you can send out a notification to subscribed workers.
See also:
http://aws.amazon.com/sns/
Best practices for using Amazon SQS - Polling the queue
This is (now) possible with Long polling on a SQS queue.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Query_QueryReceiveMessage.html
Long poll support (integer from 1 to 20) - the duration (in seconds) that the ReceiveMessage action call will wait until a message is in the queue to include in the response, as opposed to returning an empty response if a message is not yet available.
If you do not specify WaitTimeSeconds in the request, the queue attribute ReceiveMessageWaitTimeSeconds is used to determine how long to wait.
Type: Integer from 0 to 20 (seconds)
Default: The ReceiveMessageWaitTimeSeconds of the queue.
Further to point out a problem with SQS - You must poll for new notifications, and there is no guarantee that on any particular poll you will receive an event that exists in the queue (this is due to the redundancy of their architecture). This means you need to consider the possibility that your polling didn't return a message that existed (which for me meant I needed to increase the polling rate).
All in all I found too many limitations with SQS (as I've found with some other AWS tools such as SimpleDB). But that's just my injected opinion.
Actual if you dont require a low latency, you can try this:
Create an cloudwatch alarm on your queue, like messages visible or messages received > 0.
As an action you will send a message to an sns topic, which then can send the message to your workers via an http/s endpoint.
normally this kind of approach is used for autoscaling.
There is now an JMS wrapper for SQS from Amazon that will let you create listeners that are automatically triggered when a new message is available.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/jmsclient.html#jmsclient-gsg