Can someone explain a little bit a statement that redis is single threaded.
What I want to do...
I'm writing a flask web site. There should be a lot of backgrund work, so I thought to separate it in multiple threads. I read that it's best to use celery and I would like to use redis as broker. Redis, because I would like to use redis for some key value storing.
So my question is.
Can multiple threads connect to redis db ( in a thread safe way) in the same time to retrieve and store data.
Also, can redis be used for site caching?
Multiple threads can connect to Redis in a thread-safe way (assuming that the Redis client is thread-safe and that the code itself is as well).
Because Redis is (mostly) single-threaded, every request to it blocks all others while it is executed. However, because Redis is so fast - requests are usually returned in under a millisecond - it can still serve a considerable number of concurrent requests, so having multiple connections to it isn't an issue.
As for whether it can be used for caching a website, that's definitely so (just Google it ;)).
Related
I have built a webserver written in python using the flask framework and psycopg2 and I have some questions about concurrent processing as it relates to dbs and the server itself. I am using gunicorn to start my app with
web:gunicorn app:app.
From my understanding a webserver such as this processes requests one at a time. So, if someone makes a get or post request to the server the server must finish responding to that get or post request before it can then move on to another request. If this is the case, then why would I need to make more than one connection cursor object? For example, if someone were making a post request that requires me to update something in the db, then my server can't process requests until I return out of that post end point anyway so that one connection object isn't bottle necking anything is it?
Ultimately, I am trying to allow my server to process a large number of requests simultaneously. In order to do so, I think I would first have to make multiple instances of my server, and THEN the connection pool comes into play right? I think in order to make multiple instances of my server (apologies if any terminologies are being used incorrectly here), I would do one of these things:
one way would be to: I would need to use multiple threads and if the machine my application is deployed on in the cloud has multiple cpu cores, then it can do this(?). However, I have read that python does not support "True multithreading" meaning a multi threaded program is not actually running concurrently, it's just switching back and forth between those threads really quickly, so would this really be any different than my set up currently?
the second way: use multiple gunicorn workers, or use multiple dynos. I think this route is the solution here, but I don't understand the theory on how to set this up at all. If I spawn additional gunicorn workers, what is happening behind the scenes? Would this still all run on my heroku application instance? Does the amount of cores I have access to on heroku affect this in anyway? Also, regardless of which way I pick, what would I be looking to change in the app.py code or would the change solely be inside the procfile?
Assuming I manage to set up multithreading or gunicorn workers, how would this then affect the connection pool set up/what should I do in regards to the connection pool? If anyone familiar with this can help provide some theory or explanations or some resources, I would greatly appreciate it. Thanks all!
From my experience with python here's what I've learned...
If you are using multiple threads or async then you need to use a pool or an async connection
If you have multiple processes and your code is strictly synchronous with no threads then a pool is not necessary. You can reuse a single connection for each process since they are not shared between each other.
Threads dont speed up execution speed in python usually since python will only ever run one thread at a time. Though they can help speed if threads need to block.
For web servers the true bottle neck is IO usually, meaning connecting to db or read file or w.e. Multiple process and making those process async gives the greatest performance. Starlette is a async version of Flask... kinda and is usually much faster when setup properly and using async libraries
I have a Django application that uses Celery with Redis broker for asynchronous task execution. Currently, the app has 3 queues (& 3 workers) that connect to a single Redis instance for communication. Here, the first two workers are prefork-based workers and the third one is a gevent-based worker.
The Celery setting variables regarding the broker and backend look like this:
CELERY_BROKER_URL="redis://localhost:6379/0"
CELERY_RESULT_BACKEND="redis://localhost:6379/1"
Since Celery uses rpush-blpop to implement the FIFO queue, I was wondering if it'd be correct or even possible to use different Redis databases for different queues like — q1 uses database .../1 and q2 uses database .../2 for messaging? This way each worker will only listen to the dedicated database for that and pick up the task from the queue with less competition.
Does this even make any sense?
If so, how do you implement something like this in Celery?
First, if you are worried about the load, please specify your expected numbers/rates.
In my opinion, you shouldn't be concerned about the Redis capability to handle your load.
Redis has its own scale-out / scale-in capabilities whenever you'll need them.
You can use RabbitMQ as your broker (using rabbitMQ docker is dead-simple as well, you can see example) which again, has its own scale-out capabilities to support a high load, so I don't think you should be worried about this point.
As far as I know, there's no way to use different DBs for Redis broker. You can create different Celery applications with different DBs but then you cannot set dependencies between tasks (canvas: group, chain, etc). I wouldn't recommend such an option.
I am building REST API with Flask-restplus. One of my endpoints takes a file uploaded from client and run some analysis. The job uses up to 30 seconds. I don't want the job to block the main process. So the endpoint will return a response with 200 or 201 right away, the job can still be running. Results will be saved to database which will be retrieved later.
It seems I have two options for long-running jobs.
Threading
Task-queue
Threading is relatively simpler. But problem is, there is a limit of thread numbers for Flask app. In a standalone Python app, I could use a queue for the threads. But this is REST api, each request call is independent. I don't know if there is a way to maintain a global queue for that. So if the requests exceed the thread limit, it won't be able to take more requests.
Task-queue with Celery and Redis is probably better option. But this is just a proof of concept thing, and time line is kind of tight. Setting up Celery, Redis with Flask is not easy, I am having lots of trouble on my dev machine which is a Windows. It will be deployed on AWS which is kind of complex.
I wonder if there is a third option for this case?
I would HIGHLY recommend using Celery as you have already mentioned in your post. It is built exactly for this use case. Their docs are really informative and there are no shortage of examples online that can get you up and running quickly.
Additionally, I would say THIS would be an excellent first resource for you to start with.
Celery is a fantastic solution to this problem I have used quite successfully in the past to manage millions of jobs per day.
The only real downside is the initial learning curve and complexity of debugging when things go sour (it can happen, especially with millions of jobs).
I have a Flask microservice which serves user requests by an endpoint (say): /getdata
The data can be fetched in one of the two ways 1) cache or 2) from database directly - if the cache is in the process of being updated
Another service updates the database (thus making the cache stale). Once the service is done updating the database, it publishes a message to the rabbitmq stating: "update done"
Back to the microservice: I'd like it to have two threads:
Thread 1: runs the app.run()
Thread 2: subscribes to the queue - where "update done" messages are published
Given the two threads, I don't want the /getdata to be fetching database from the cache when it's being updated. At the same time, I don't want to update the cache when data is being fetched from the endpoint.
Here's one solution I can think of:
1) Have a threading.Lock() as a "global"
2) /getdata checks if the lock is available; if so, it will acquire, fetch data from cache and release the lock. If the lock is unavailable, it will fetch the data from the database directly, thereby incurring a performance hit - but still getting the "latest" data
3) RabbitMQ "subscriber" checks the state of the lock; if so, it acquires the lock , updates the cache from the database and releases the lock. If not, it adds the request to a local "queue", and waits for say one minute before trying to acquire the lock again. When it does, it will pop the first item from queue and update the cache from the database.
My questions:
Given the multitude of libraries and options in Python/Flask - is
there a library that allows me to do task like this in a "safe" way
(I am using pika for rabbitmq access)
Is it possible to launch the flask app.run() via one thread and the
queue subscriber via another (i.e. in if __name__ == "main":
)
How do I declare a "global" threading.Lock() which can coordinate
the two threads?
Notes:
I expect that in the worse case the lock won't be acquired for more than one minute.
Pika is not thread safe. You should avoid sharing the connection object across Flask's contexts. Writing your own Flask plugin wouldn't take that much boilerplate though. It would be very similar to the documentation example plugin. Otherwise, you could do a quick search with flask pika on a search engine and you'll find some existing plugins for this purpose. I have not tried them and they don't seem really popular, but maybe you should give them a go?
I don't see why it wouldn't be possible. Flask knows how to deal with this. However, I reckon it would severly degrade performances. Moreover, you might hit some corner-cases if the plugins you use are not perfectly written.
Just like you would declare any lock for threading. Nothing much. You put it at the module level (not in Flask's context) so that it is global, that's it.
That being said, I think you shouldn't proceed this way. You should rather run the update-job in a different process from the Web Server (using Flask CLI or whatever if you need to re-use some functions). It will be better performance-wise, it's easier to reason about, it's more loosely coupled.
Also, you should avoid running into locking headaches as long as possible. Believe me, it's a real source of problems. It's a nightmare to test properly, to debug, to maintain and quite risky when it comes to real-production use-cases. And if you really, really need a lock, don't hold it for one minute, it's way too long.
I don't know your exact requirements, but there surely is a solution that is OK and that does not involve such complexity.
I have some confusion in redis. I am self learning redis.
I have got to know that redis is single threaded and it works on the concept of event loop. So read/write operations are serialized in redis and there is no race condition.
My confusion is - when I naively think about single threaded architecture, I can imagine that there is a buffer where all read/write requests gather and the thread schedules them one by one. But in a real life internet application where thousands or millions of request are to be processed, how does redis handle those requests without significant latency? If some write operation takes say few milliseconds time, does it block other read write operation during that period of time?
Does redis implement any locking concept like relational db? If no, then how redis handles thousands of read/writes without significant latency?
Any internals / examples would be great for my further study.
Your understanding of Redis internal is quite correct. There is no locking system. All operations are atomic and blocking.
The recommendation when using Redis, is to make multiple short requests, instead of a long one. Take in account the time complexity mentioned in Redis Commands documentation when writing your requests, if you work on a large number of keys or a large data structure. Avoid the KEYS command, prefer it the SCAN family of commands. Be even more careful when writing a Lua script which will be sent to Redis using the EVAL command.
Each request having a very short execution time, the clients won't be impacted, in most of the use cases, by the fact Redis commands won't respond to any other command during the execution of a given one.
Most of the time, the limiting factor won't be Redis itself, but the network.
However, in some use cases, you may hit Redis limits (which are very high). In these cases, you can use multiple Redis instances in master-slave mode (replication, monitored by Redis Sentinel), and make some kind of load balacing between the instances for reading requests. You can also use a tool like twemproxy in front on several Redis instances.