Require Help Structuring Parallel HTTP Requests - python

Here's my case. I have three tables Book, Publisher and Price. I have a management command that does loops over each book and for each book, it queries the publisher to get the price which it then stores into the Prices table. It's a very simple HTTP GET or UDP request that I make to get the price. Here what the skeleton of my code looks like:
#transaction.commit_on_success
def handle(self, *args, **options):
for book in Book.objects.all():
for publisher book.publisher_set.objects.all():
price = check_the_price(publisher.url, book.isbn)
Price.objects.create(book=book, publisher=publisher, price=price)
The code is simple, but it gets really slow and time consuming when I have 10000 books. I could easily speed this up by making parallel HTTP requests. I could make 50 parallel requests this would be done in a jiffy but I don't know how to structure this code.
My site itself is very and small and light-weight site and I'm trying to stay away from RabbitMQ/Celery stuff. I just feel it's a big thing to take on right now.
Any recommendations on how to do this while maintaining transactional integrity?
Edit #1: This is used as an analogy for what I'm actually doing. In writing this analogy I forgot to mention that I also need to make a few UDP requests.

You could use the requests package which provides quasi-parallel request processing based on gevent's green threads. requests lets you build a number of request objects which are then executed in "parallel". See this example.
Green threads do not actually run in parallel, but cooperatively yield execution control. gevent can patch the standard library's I/O functions (e.g. the ones used by urllib2) to yield control whenever they would block on I/O otherwise. The request package wraps that into a single function call which takes a number of requests and returns a number of response objects. It doesn't get much easier than that.

Related

Best way to create a queue for handling request to REST Api created via Django

I have the following scenario:
Back end => a geospatial database and the Open Data Cube tool
API => Users can define parameters (xmin,xmax,ymin,ymax) to make GET
requests
Process => On each requests analytics are calculated and satellite
images pixels' values are given back to the user
My question is the following: As the process is quite heavy (it can reserve many GB of RAM) how it is possible to handle multiple requests at the same time? Is there any queue that I can save the requests and serve each one sequently?
Language/frameworks => Python 3.8 and Django
Thanks in advance
Celery + Rabbitmq/Redis is probably what you need.
In this configuration, your heavy processes become "tasks". When called with .delay() they go in the queue and are not handle by your main process anymore.
You might want to check the tuto
https://docs.celeryproject.org/en/stable/django/first-steps-with-django.html
There are many asynchronous message queueing technologies that allow you to do this, lots of which have Python APIs too.
You probably want to use request-response messaging, to correlate the requests you get with the replies you want to send.
A message queueing technology will allow you to take the requests, store them on a queue, and have your server handle them when it's ready. Storing requests on a queue means that they won't get lost. This also allows your application to scale - as more requests come in, they can be dealt with by multiple application instances and still return only one result!
The answer above recommends celery, which is a great choice for this kind of project. Depending on the requirements you have, you can also use pymqi: https://dsuch.github.io/pymqi/examples.html or ZeroMQ: (example here for using a request-response pattern) ZeroMQ - Multiple Publishers and Listener if you need technologies that are more heavy-duty.

Class for making asynchronous API requests with Twisted

I am working on development of a system that collect data from rest servers and manipulates it.
One of the requirements is multiple and frequent API requests. We currently implement this in a somewhat synchronous way. I can easily implement this using threads but considering the system might need to support thousands of requests per second I think it would be wise to utilize Twisted's ability to efficiently implement the above. I have seen this blog post and the whole idea of deferred list seems to do the trick. But I am kind of stuck with how to structure my class (can't wrap my mind around how Twisted works).
Can you try to outline the structure of the class that will run the event-loop and will be able to get a list of URLs and headers and return a list of results after making the requests asynchronously?
Do you know of a better way of implementing this in python?
Sounds like you want to use a Twisted project called treq which allows you to send requests to HTTP endpoints. It works alot like requests. I recently helped a friend here in this thread. My answer there might be of some use to you. If you still need more help, just make a comment and I'll try my best to update this answer.

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?
If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

How to create a polling script in Python?

I was trying to create a polling script in python that starts when another python script starts and then keeps supplying data back to this script.
I can obviously write an infinite loop but is that the right way to go about it? I might loose control over how the functions work and how many times a function should be called in an hour.
Edit:
What I am trying to accomplish is to poll the REST API of twitter and get new mentions and people who follow me. I obviously can't keep polling because I will run out of API requests per hour. Thus, the issue. This poller, will send the new mention and follower id/user to the main script that would be listening to any such update.
I highly suggest looking into Twisted, one of the most popular async frameworks using the reactor pattern.
The "infinite loop" you are looking for is really an application pattern that Twisted implements to respond to events asynchronously, and it almost never makes sense to roll your own.
Twisted is largely used for networking requirements, but the it has a LoopingCall interface to set up the kind of functionality you require. Using the core Twisted deferred as your request model allows you to set up a long-polling server that can perform the kind of conditional network test you need. It can intially be a little intimidating, but once you understand the core components (Factories, Reactors, Protocols etc) that you need to inherit it becomes much easier to visualize your problem.
This also might be a good tutorial to start looking at the basics of the "push" model:
http://carloscarrasco.com/simple-http-pubsub-server-with-twisted.html

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Categories

Resources