How to grab contents asynchronously using Python(Gevent)?

How to grab contents asynchronously using Python(Gevent)? - python

The scenario is save the response of an API request using RMDB id as a parameter.
I want to grab all the movie info from imdv-id tt0000001 to tt9999999.
Now I'm using gevent to run several threads(gevent.joinall(threads)), it's not so fast.
Is there other solutions for this kind of problems, like using Celery+RabbitMQ?

For one you must make sure that you aren't making any blocking calls in your code,
as that will also block everything else from running, slowing the entire system.
Reasons for blocking include tight loops or IO that has not been patched by eventlet's monkey patch (e.g. C extensions).
Celery supports using eventlet & gevent, and that is probably the recommended concurrency
option for what you are doing (web request IO). Celery may not make your code run faster though, but it enables you to easily distribute the work to many machines.
To optimize you should always profile your code to find out what the bottleneck is. It could be many things, e.g. slow network, slow host, slow DNS or something else entirely.

Related

Daemon background tasks on flask (uwsgi) application

Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution

I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.

Concurrent server in Golang

First of all I have to admit that I am a beginner concerning concurrency in general, but reading a lot about it recently. Because I heard that Golang is strong on that area. I wanted to ask how (concurrent) servers are written in this language.
I mean, there are different ways in how to write a server that can handle multiple requests/connections concurrently. You can use threads, asynchronous programming (async/asyncio in Python for example), and in Golang there are goroutines, which is more or less a lightweight thread.
However, when using Python and async/asyncio you can have one single process and one thread and it's able to handle concurrency. However, the code is complicated (at least for me without any background).
My question:
What is the way to go to write a concurrent server in Golang? Just a new goroutine for every connection or are there any asynchronous ways? What's the "best practice"?
I mean is it not expensive to have LOTS of goroutines on a highly used server? How to do a well-written server in Golang?

For beginner the best way to start is just use https://golang.org/pkg/net/http/ and just write http handlers. You don't need to initialize Go routines - the http.Server will do it for you.
The code will be straight forward with blocking calls. You don't need to think about concurrency at this stage as Go will do it for you. For example when you do a call like
record, err := someDb.GetRecordByID(123)
actually it's an asynchronous call that blocks current flow but release thread to other Go routines. It will continue flow once data returned and a thread (may be different from previous) becomes available.
If you will need to do concurrent calls within 1 HTTP request you can start Go routines. But leave it for later stage and do the Go lang tour on concurrency first.
If you really need a high load solution for HTTP requests consider using https://github.com/valyala/fasthttp instead of standard http package.

For HTTP #icza's comments & Alexander's answer give a fair idea. Just to add Goroutines are not expensive because they are lighter than normal threads. They can have variable sized stack (probably start as low as 2k) & hence can scale up very well with less operating overhead.
Also on http, there are third party libraries like Gorilla mux which can make life better as also other frameworks like Buffalo which you can explore. While I haven't used the latter, I have heard it makes life easier.
Now if you are going to be writing your own custom server (something different from http) then again Go is a great choice for it. The program can start as simple as https://golang.org/pkg/net/#example_Listener (To try running this program, you can use netcat like this from another terminal)
$ nc localhost 2000
Hellow
Hellow
And finally channels in Go make sharing data & communication much easier and safer across routines taking care of the synchronization aspects. Hope this helps.

My question: What is the way to go to write a concurrent server in
Golang? Just a new goroutine for every connection or are there any
asynchronous ways? What's "best practice"?
Golang http package will do requests concurrency handling for you and I really like that code looks like synchronous and you don't need to add any async/await keywords. Here is how you start
func helloHandler(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello")
}
http.HandleFunc("/hello", helloHandler)
log.Fatal(http.ListenAndServe(":8080", nil))

How to run multithreaded Python scripts

i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally

You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?

If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

Twisted or Celery? Which is right for my application with lots of SOAP calls?

I'm writing a Python application that needs both concurrency and asynchronicity. I've had a few recommendations each for Twisted and Celery, but I'm having trouble determining which is the better choice for this application (I have no experience with either).
The application (which is not a web app) primarily centers around making SOAP calls out to various third party APIs. To process a given piece of data, I'll need to call several APIs sequentially. And I'd like to be able to have a pool of "workers" for each of these APIs so I can make more than 1 call at a time to each API. Nothing about this should be very cpu-intensive.
More specifically, an external process will add a new "Message" to this application's database. I will need a job that watches for new messages, and then pushes them through the Process. The process will contain 4-5 steps that need to happen in order, but can happen completely asynchronously. Each step will take the message and act upon it in some way, typically adding details to the message. Each subsequent step will require the output from the step that precedes it. For most of these Steps, the work involved centers around calling out to a third-party API typically with a SOAP client, parsing the response, and updating the message. A few cases will involve the creation of a binary file (harder to pickle, if that's a factor). Ultimately, once the last step has completed, I'll need to update a flag in the database to indicate the entire process is done for this message.
Also, since each step will involve waiting for a network response, I'd like to increase overall throughput by making multiple simultaneous requests at each step.
Is either Celery or Twisted a more generally appropriate framework here? If they'll both solve the problem adequately, are there pros/cons to using one vs the other? Is there something else I should consider instead?

Is either Celery or Twisted a more generally appropriate framework here?
Depends on what you mean by "generally appropriate".
If they'll both solve the problem adequately, are there pros/cons to using one vs the other?
Not an exhaustive list.
Celery Pros:
Ready-made distributed task queue, with rate-limiting, re-tries, remote workers
Rapid development
Comparatively shallow learning curve
Celery Cons:
Heavyweight: multiple processes, external dependencies
Have to run a message passing service
Application "processes" will need to fit Celery's design
Twisted Pros:
Lightweight: single process and not dependent on a message passing service
Rapid development (for those familiar with it)
Flexible
Probably faster, no "internal" message passing required.
Twisted Cons:
Steep learning curve
Not necessarily as easy to add processing capacity later.
I'm familiar with both, and from what you've said, if it were me I'd pick Twisted.
I'd say you'll get it done quicker using Celery, but you'd learn more while doing it by using Twisted. If you have the time and inclination to follow the steep learning curve, I'd recommend you do this in Twisted.

Celery allows you to use asynchronous behavior of various async library like gevent and eventlet. So you can have best of both world.
Example using eventlet
https://github.com/celery/celery/tree/master/examples/eventlet
Example using gevent
https://github.com/celery/celery/tree/master/examples/gevent

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.