AWS and Python threading scalability - python

I have a service running on a local server, written using Python threading library. Think of it as a kind of web crawler. It uses 50 threads. I want deploy it on Amazon Web Services cloud and scale it up, so it uses more threads.
Simply, I have two queues: Qinput with URLs and Qoutput with pages content. The threads pick URLs from Qinput, fetch content of the web page an put it to Qoutput
Question: is it enough that I simply increase the number of threads to, say, 500, 5,000 or 50,000 and AWS + Python will handle it? Should I expect the service to run seamlessly or there are some "standard" design pitfalls that I should be aware of when porting a multithreading service on AWS?
I am aware of Global Interpreter Lock although it should not be an issue here, as the main task of the threads is to call outside the interpreter while crawling / scraping pages

Any single instance has its limit. You will probably be able to spawn quite a lot of threads in your instance, especially if you choose the larger ones. But you will get diminished return on the additional threads, until it will not help you any more to get more performance.
However, if you want your system to scale beyond the limitation of a single instance, it is best to be able to run your system on multiple instances. Then your decisions is only operational and not technical. I think that if you are running in AWS environment, which allows you almost endless operational resources, you should think into it.
You can also check out SQS, which is basically a distributed queue system. It will allow you to synchronize the work of as many instances as you need.

Related

How to scale a CPU bound Twisted application?

I'm working on a twisted web application which uploads files and encrypts them, returning the url+key to the user.
I've been tasked with scaling this application. At the moment when there are more than 3-4 concurrent upload requests the performance will drop off significantly.
I'm no Twisted expert but I assume this is due to it running in a single python process, being a high cpu application and the GIL?
How could I go about scaling this?
If this was a different framework such as Flask I would just put uwsgi in front of it and scale the number of processes. Would something similar work for Twisted and if so what tools are generally used for this?
If you think you could throw uwsgi in front of the application, I suppose it is pretty close to shared-nothing. So you can run multiple instances of the program and gain a core's worth of performance from each.
There are a couple really obvious options for exactly how to run the multiple instances. You could have a load balancer in front. You could have the processes share a listening port. There are probably more possibilities, too.
Since your protocol seems to be HTTP, any old HTTP load balancer should be applicable. It needn't be Twisted or Python based itself (though certainly it could be).
If you'd rather share a listening port, Twisted has APIs for passing file descriptors between processes (IReactorSocket) and for launching new processes that inherit a file descriptor from the parent (IReactorProcess).

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Python multithreading - Global Interpreter Lock

Python threading module documentation says something like this
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Can someone explain whether I can use threading module in my situation or not?
I'm going to detect the frameworks used by websites.
So here is how my app works
My MySQL database contains around 10 million domains ( id, domain, frameworks )
Fetch 1000 rows from the database
Scrape domain one by one using requests module
Detect the frameworks
Update the database row with the results.
Since I have 10 million domains, its going to take very long time. So I would like to speed up the process by using threads.
But i'm not sure whether my app is I/O bound or not. Can someone explain?
Thankyou
I guess, the most time expensive activity will be fetching all the urls.
So the answer to your question is: Yes, your app is very likely to be I/O bound.
You plan to scrape domains one by one, this would lead into really long processing time. You shall definitely do that concurrently. One solution is described in my answer to similar question related to scraping web sites.
Anyway, the number of your urls seems really large, you might need to take advantage from splitting the work to multiple workers - for this purpose you might use e.g. Celery framework. However, as your task is really I/O bound, you would earn some speed only, if your workers work on multiple computers, ideally with independent connectivity. I did similar task on DigitalOcean machines and it worked very well.

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?
If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

Python script load testing web page

I want to do a test load for a web page. I want to do it in python with multiple threads.
First POST request would login user (set cookies).
Then I need to know how many users doing the same POST request simultaneously can server take.
So I'm thinking about spawning threads in which requests would be made in loop.
I have a couple of questions:
1. Is it possible to run 1000 - 1500 requests at the same time CPU wise? I mean wouldn't it slow down the system so it's not reliable anymore?
2. What about the bandwidth limitations? How good the channel should be for this test to be reliable?
Server on which test site is hosted is Amazon EC2 script would be run from another server(Amazon too).
Thanks!
cPython does not take advantage from multiple cores when running multiple threads. It means, that basically, You will only have one core doing the testing job.
There are dedicated tools to do what You want to do. Let me suggest two:
FunkLoad is a functional and load web tester, written in Python, whose main use cases are:
Functional testing of web projects, and thus regression testing as well.
Performance testing: by loading the web application and monitoring
your servers it helps you to pinpoint bottlenecks, giving a detailed
report of performance measurement.
Load testing tool to expose bugs that do not surface in cursory testing,
like volume testing or longevity testing.
Stress testing tool to overwhelm the web application resources and test
the application recoverability.
Writing web agents by scripting any web repetitive task, like checking if
a site is alive.
Tsung is an open-source multi-protocol distributed load testing tool
The purpose of Tsung is to simulate
users in order to test the scalability
and performance of IP based
client/server applications. You can
use it to do load and stress testing
of your servers. Many protocols have
been implemented and tested, and it
can be easily extended. WebDAV, LDAP
and MySQL support have been added
recently (experimental).
It can be distributed on several
client machines and is able to
simulate hundreds of thousands of
virtual users concurrently (or even
millions if you have enough hardware
...).
If You decide to write Your own tool, You will probably want to use Python's multiprocessing module as it would let You use multiple cores. You should also take a look on Twisted as it would let You easily handle multiple sockets while a limited number of threads. That would be much better than spawning a new thread for each socket.
You work with Amazon EC2, so I would recommend using Tsung. You can rent a dozen of multicore servers for a few hours and run some really heavy load tests with Tsung. It scales very well in this kind of configuration.
As for the bandwidth, it's usually not a problem, but it depends on the application. You will have to monitor all Your resources closely while performing a load test.
too many variables. 1000 at the same time... no. in the same second... possibly. bandwidth may well be the bottleneck. this is something best solved by experimentation.

Categories

Resources