Alternative to Amazon SQS for Python - python

I have a collection of EC2 instances whit a process installed in them that makes use of the same SQS Queue, name it my_queue. This queue is extremely active, writing more than 250 messages a minute and deleting those 250 messages consecutively. The problem I've encounter at this point is that it is starting to be slow, thus, my system is not working properly, some processeses hang because SQS closes the connection and also writing to remotes machines.
The big advantage I have for using SQS is that 1) it's very easy to use, no need to install or configure local files, 2) it's a reliable tool, since I only need a key and key_secret in order to start pushing and pulling messages.
My questions are:
What alternatives exist to SQS, I know of Redis, RabbittMQ, but both need local deployment, configuration and that might lead to unreliable functionality, if for example the box that is running it suddenly crashes and other boxes are not able to write messages to the queue.
If I choose something like Redis, to be deployed in my in my box, is it worth it over SQS, or I should just stay with SQS and look for another solution ?
Thanks

You may have solved this by now since the question is old; I had the same issues a while back and the costs ($$) of polling SQS in my application were sizable. I migrated it to REDIS successfully without much effort. I eventually migrated the REDIS host to ElasticCache's REDIS and have been very pleased with the solution. I am able to snapshot it and scale it as needed. Hope that helps.

Related

Update single database value on a website with many users

For this question, I'm particularly struggling with how to structure this:
User accesses website
User clicks button
Value x in database increments
My issue is that multiple people could potentially be on the website at the same time and click the button - I want to make sure each user is able to click the button, and update the value and read the incremented value too, but I don't know how to circumvent any synchronisation/concurrency issues.
I'm using flask to run my website backend, and I'm thinking of using MongoDB or Redis to store my single value that needs to be updated.
Please comment if there is any lack of clarity in my question, but this is a problem I've really been struggling with how to solve.
Thanks :)
redis, I think you can use redis hincrby command, or create a distributed lock to make sure there is only one writer at the same time and only the lock holding writer can make the update in your flask framework. Make sure you release the lock after certain period of time or after the writer done using the lock.
mysql, you can start a transaction, and make the update and commit the change to make sure the data is right
To solve this problem I would suggest you follow a micro service architecture.
A service called worker would handle the flask route that's called when the user clicks on the link/button on the website. It would generate a message to be sent to another service called queue manager that maintains a queue of increment/decrement messages from the worker service.
There can be multiple worker service instances running concurrently but the queue manager is a singleton service that takes the messages from each service and adds them to the queue. If the queue manager is busy the worker service will either timeout and retry or return a failure message to the user. If the queue is full a response is sent back to the worker to retry n number of times, and you can count down that n.
A third service called storage manager is run every time the queue is not empty, this service sends the messages to the storage solution (whatever mongo, redis, good ol' sql) and it will ensure the increment/decrement messages are handled in the order they were received in the queue. You could also include a time stamp from the worker service in the message if you wanted to use that to sort the queue.
Generally whatever hosting environment for flask will use gunicorn as the production web server and support multiple concurrent worker instances to handle the http requests, and this would naturally be your worker service.
How you build and coordinate the queue manager and storage manager is down to implementation preference, for instance you could use something like Google Cloud pub/sub system to send messages between different deployed services but that's just off the top of my head. There's a load of different ways to do it, and you're in the best position to decide that.
Without knowing more details about what you're trying to achieve and what's the requirements for concurrent traffic I can't go into greater detail, but that's roughly how I've approached this type of problem in the past. If you need to handle more concurrent users at the website, you can pick a hosting solution with more concurrent workers. If you need the queue to be longer, you can pick a host with more memory, or else write the queue to an intermediate storage. This will slow it down but will make recovering from a crash easier.
You also need to consider handling when messages fail between different services, how to recover from a service crashing or the queue filling up.
EDIT: Been thinking about this over the weekend and a much simpler solution is to just create a new record in a table directly from the flask route that handles user clicks. Then to get your total you just get a count from this table. Your bottlenecks are going to be how many concurrent workers your flask hosting environment supports and how many concurrent connections your storage supports. Both of these can be solved by throwing more resources at them.

Ensuring at most a single instance of job executing on Kubernetes and writing into Postgresql

I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?
A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.

In Django, what are the best ways to handle scheduled jobs where only one job is spun up instead of multiple?

[TLDR] What are the best ways to handle scheduled jobs in development where only one job is spun up instead of two? Do you use the 'noreload' option? Do you test to see if a file is locked and then stop the second instance of the job if it is? Are there better alternatives?
[Context]
[edit] We are still in a development environment and am looking to next steps for a production environment.
My team and I are currently developing a Django project, Django 1.9 with Python 3.5. We just discovered that Django spins up two instances of itself to allow for real time code changes.
APScheduler 3.1.0 is being used to schedule a DB ping every few minutes to see if there is new data for us to process. However, when Django spins up, we noticed that we were pinging twice and that there were two instances of our functions running. We tried to shut down the second job through APS but as they are in two different processes, APS is unable to see the other job.
After researching this, we discovered the 'noreload' option and another suggestion to test if a file has been locked.
The noreload prevents Django from spinning up the second instance. This solution works but feels weird. We haven't come across any documentation or guides that suggest that this is something that you want to/not do in production.
Filelock 2.0.6 is another option that we have tested. In this solution, the two scheduled tasks ping a local file to see if it is locked. If it isn't locked, then that task will lock it and run while the other one will stop running. If the task crashes, then the locked file will remain locked until a server restart. This feels like a hack.
In a production environment, are these good solutions? Are there other alternatives that we should look at for handling scheduled tasks that are better for this? Are there cons to either of these solutions that we haven't thought of?
'noreload' - is this something that is done normally in a production environment?

Redis server responding intermittently to Python client

Using Redis with our Python WSGI application, we are noticing that at certain intervals, Redis stops responding to requests. After some time, however, we are able to fetch the values stored in it again.
After seeing this situation and checking the status of the Redis service, it is still online.
If it is of any help, we are using the redis Python package and using StrictRedis as a connection class, with a default ConnectionPool. Any ideas on this will be greatly appreciated. If there is more information which will help better diagnose the problem, let me know and I'll update ASAP.
Thanks very much!
More data about your redis setup and the size of the data set would be useful. That said I would venture to guess your Eedis server is configured to persist data to disk (the default). If so you could be seeing your Redis node getting a bit overwhelmed whenit forks off a copy of itself to save the data set to disk.
If this is the case and you do need to persist to disk then I would recommend a second instance be stood up and configured to be a shave to the first one and persist to disk. The master you would then configure to not persist to disk. In this configuration you should see the writable node be fully responsive at all times. You could even go so far as to setup a Non-persisting slave for read-only access.
But without more details on your configuration, resources, and usage it is merely an educated guess.

A question about sites downtime updates

I've a server when I run a Django application but I've a little problem:
when with mercurial I commit and pushing new changes on the server, there's a micro time (like 1 microsec) where the home page is unreachable.
I have apache on the server.
How can I solve this?
You could run multiple instances of the django app (either on the same machine with different ports or on different machines) and use apache to reverse proxy requests to each instance. It can failover to instance B whilst instance A is rebooting. See mod_proxy.
If the downtime is as short as you say though, it is unlikly to be an issue worth worrying about.
Also note that there are likely to be better (and easier) proxies than Apache. Nginx is popular, as is HAProxy.
If you have any significant traffic in time that is measured in microsecond it's probably best to push new changes to your web servers one at a time, and remove the machine from load balancer rotation for the moment you're doing the upgrade there.
When using apachectl graceful, you minimize the time the website is unavailable when 'restarting' Apache. All children are 'kindly' requested to restart and get their new configuration when they're not doing anything.
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
At a heavy-traffic website, you will notice some performance loss, as some children will temporarily not accept new connections. It's my experience, however, that TCP recovers perfectly from this.
Considering that some websites take several minutes or hours to update, that is completely acceptable. If it is a really big issue, you could use a proxy, running multiple instances and updating them one at a time, or update at an off-peak moment.
If you're at the point of complaining about a 1/1,000,000th of a second outage, then I suggest the following approach:
Front end load balancers pointing to multiple backend servers.
Remove one backend server from the loadbalancer to ensure no traffic will go to it.
Wait for all traffic that the server was processing has been sent.
Shutdown the webserver on that instance.
Update the django instance on that machine.
Add that instance back to the load balancers.
Repeat for every other server.
This will ensure that the 1/1,000,000th of a second gap is removed.
i think it's normal, since django may be needing to restart its server after your update

Categories

Resources