What is this command GearmanWorker.set_client_id(client_id) ?
http://packages.python.org/gearman/worker.html#gearman.worker.GearmanWorker.set_client_id
It means that the worker only serves clients with the specified id ?
If yes how can I find a client's id.
From the docs of Gearman protocol:
SET_CLIENT_ID
This sets the worker ID in a job server so monitoring and reporting
commands can uniquely identify the various workers, and different
connections to job servers from the same worker.
So it does not have anything to do with worker-client relationship. That is only handled by the function handle that the client is passing and the worker is registering for. This ID is probably seen in administrative commands' output and can help you in debugging / monitoring your application. As a matter of fact, some interfaces (e.g. PHP) do not support this setting and are still fully usable.
Related
For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)
redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right
To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.
I have a Django application that uses Celery with Redis broker for asynchronous task execution. Currently, the app has 3 queues (& 3 workers) that connect to a single Redis instance for communication. Here, the first two workers are prefork-based workers and the third one is a gevent-based worker.
The Celery setting variables regarding the broker and backend look like this:
CELERY_BROKER_URL="redis://localhost:6379/0"
CELERY_RESULT_BACKEND="redis://localhost:6379/1"
Since Celery uses rpush-blpop to implement the FIFO queue, I was wondering if it'd be correct or even possible to use different Redis databases for different queues like — q1 uses database .../1 and q2 uses database .../2 for messaging? This way each worker will only listen to the dedicated database for that and pick up the task from the queue with less competition.
Does this even make any sense?
If so, how do you implement something like this in Celery?
First, if you are worried about the load, please specify your expected numbers/rates.
In my opinion, you shouldn't be concerned about the Redis capability to handle your load.
Redis has its own scale-out / scale-in capabilities whenever you'll need them.
You can use RabbitMQ as your broker (using rabbitMQ docker is dead-simple as well, you can see example) which again, has its own scale-out capabilities to support a high load, so I don't think you should be worried about this point.
As far as I know, there's no way to use different DBs for Redis broker. You can create different Celery applications with different DBs but then you cannot set dependencies between tasks (canvas: group, chain, etc). I wouldn't recommend such an option.
I'm building a web application (Using Python/Django) that is hosted on two machines connected to a load balancer.
I have a central storage server, and I have a central Redis server, single celery beat, and two celery workers on each hosting machine.
I receive files from an API endpoint (on any of the hosting machines) and then schedule a task to copy to the storage server.
The problem is that the task is scheduled using:
task.delay(args)
and then any worker can receive it, while the received files exist only on one of the 2 machines, and have to be copied from it.
I tried finding if there's a unique id for the worker that I can assign the task to but didn't find any help in the docs.
Any solution to this ? Given that the number of hosting machines can scale to more than 2.
The best solution is to put the task onto a named queue and have each worker look for jobs from their specific queue. So if you have Machine A and Machine B you could have Queue A, Queue B and Queue Shared. Machine A would watch for jobs on Queue A and Queue Shared while Machine B looked for jobs on Queue B and Queue Shared.
The best way to do this is to have a dedicated queue for each worker.
When I was learning Celery I did exactly this, and after few years completely abandoned this approach as it creates more problems than it actually solves.
Instead, I would recommend the following: any resource that you may need to share among tasks should be on a shared filesystem (NFS), or in some sort of in-memory caching servise like Redis, KeyDb or memcached. We use a combination of S3 and Redis, depending on the type of resource.
Sure, if you do not really care about scalability the queue-per-worker approach will work fine.
I have a project in which the user will send an audio file from android/web to the server.
I need to perform speech to text processing on the server and return some files to the user back on android/web. However the server side is to be done using Python.
Please guide me as to how it could be done?
Alongside your web application, you can have a queue of tasks that need to be run and worker process(es) to run and track those tasks. This is a popular pattern when web requests need to either start tasks in the background, check in on tasks, or get the result of a task. An introduction to this pattern can be found in the Task Queues section of the Full Stack Python open book. Celery and RQ are two popular projects that supply task queue management and can plug into an existing Python web application, such as one built with Django or Flask.
Once you have task management, you'll have to decide how to keep the user up to date on the status of a task. If you're stuck with having to use RPC-style web service calls only, then you can have clients (e.g. Android or browser) poll for the status by making a call to a web service you've created that checks on the task via your task queue manager's API.
If you want the user to be informed faster or want to reduce wasteful overhead from constant polling, consider supplying a websocket instead. Through a websocket connection, clients could subscribe to notifications of events such as the completion of a speech-to-text job. The Autobahn|Python library provides server code for implementing websockets as well as support for a protocol on top called WAMP that can be used to communicate subscriptions and messages or call upon services. If you need to stick with Django, consider something like django-websocket-redis instead.
Edit: I posted this to python-list and tutor-list with no responses. Any advice would be much appreciated.
What is the best approach to writing a concurrent daemon that can execute callbacks for different types of events (AMQP messages, parsed output of a subprocess, HTTP requests)?
I am considering twisted, the built-in threading module, and greenlet. I must admit that I am very unfamiliar with concurrent programming and Python programming in general (formerly a data analysis driven procedural programmer). Any resources on threaded/concurrent programming (specifically daemons...not just multi-threading a single task) would be much appreciated.
Thanks!
Details:
1) Listens into AMQP messaging queues and executes callbacks when messages arrive.
Example: Immediately after startup, the daemon continuously listens to the Openstack Notifications messaging queue. When a virtual machine is launched, a notification is generated by Openstack with the hostname, IP address, etc. The daemon should read this message and write some info to a log (or POST the info to a server, or notify the user...something simple).
2) Parse the output of a subprocess and execute callbacks based on the output.
Example: Every 30 seconds, a system command "qstat" is run to query a job resource manager (e.g. TORQUE). Similar callbacks to 1).
3) Receive requests from a user and process them. I think this will be via WSGI HTTP.
Example: User submits an XML template with virtual machine templates. The daemon does some simple XML parsing and writes a job script for the job resource manager. The job is submitted to the resource manager and the daemon continually checks for the status of the job with "qstat" and for messages from AMQP. It should return "live" feedback to the user and write to a log.
You may want to look at the OpenStack Oslo project.
Start here:
https://wiki.openstack.org/wiki/Oslo
Oslo is basically a shared resource for all OpenStack applications. The focus here is providing re-usable code, and standardizing on methods that many applications create or use.
Messaging being a fundamental component of OpenStack has some break outs. Also, since openstack supports many messaging protocols, maybe doing direct AMQP isn't the right answer for you.
Anyways check this...
Messaging Specifically is being placed here:
https://github.com/openstack/oslo.messaging
I'd go dig into that repository and play with some of the methods made available there.