Python: running multiple web requests in parallel - python

I'm new to Python and I have a basic question but I'm struggling to find an answer online, because a lot of the examples online seem to refer to deprecated APIs, so sorry if this has been asked before.
I'm looking for a way to execute multiple (similar) web requests in parallel, and retrieve the result in a list.
The synchronous version I have right now is something like:
urls = ['http://example1.org', 'http://example2.org', '...']
def getResult(urls):
result = []
for url in urls:
result.append(get(url).json())
return result
I'm looking for the asynchronous equivalent (where all the requests are made in parallel, but I then wait for all of them to be finished before returning the global result).
From what I saw I have to use async/await and aiohttp but the examples seemed way too complicated for the simple task I'm looking for.
Thanks

I am going to try to explain the simplest possible way to achieve what you want. Im sure there are more cleaner/better ways to do this but here it goes.
You could preform what you want using the python "threading" library. You can use it to create separate threads for each request and then run all the threads concurrently and get an answer.
Since you are new to python, to simplify things further I am using a global list called RESULTS to store in the results of the get(url) rather than returning them from the function.
import threading
RESULTS=[] #List to store the results
#Request Single Url Result and store in global RESULTS
def getSingleResult(url):
global RESULTS
RESULTS.append( ( url, get(url).json()) )
#Your Original Function
def getResult(urls)
ths=[]
for url in urls:
th=threading.Thread(target=getSingleResult, args=(url,)) #Create a Thread
th.start() #Start it
ths.append(th) #Add it to a thread list
for th in ths:
th.join() #Wait for all threads to finish
The usage of the global results is to make it easier rather than collecting results from the threads directly. If you wish to do that you can check out this answer How to get the return value from a thread in python?
Of course one thing to note that multi-threading in python doesnt provide true parallelism but rather concurrency especially if you are using the standard python implementation due to what is known as the Global Interpreter Lock
However for your use case it would still provide for you the speed up you need.

Related

Concurrency questions with non-concurrent code

I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.
Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.
Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.

Python concurrency isn't concurrent

My employer uses Box. It's API is very slow. Fortunately our files are largely static. Nightly I can iterate (recursively) over Box folders and store the URLs in a local file. Using the local file during the day substantially improves the performance of our scripts that read from and write to Box.
Starting the recursive search (spider) at level 0 includes folders we don't care about. So we have a named list of starting points from level 1. I'd like to recursively search them in parallel.
When I observe the code below (via logging/print statements I have hidden) it does not seem to search under the starting points in parallel. In instead searches the entire tree under starting point 1, then the tree under starting point 2, etc.
My question is: why does the code below not execute the spider method concurrently for each item in for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list)?
import asyncio
#asyncio.coroutine
def spider(storage_dict, dir_list):
"""Recursive storage of Box information in storage_dict."""
storage_dict = {"key": "value"}
cache_dict_list = [dict() for x in starting_dir_list]
task_list = list()
async def main():
for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list):
task_list.append(asyncio.create_task(spider(storage_dict, [starting_point])))
await asyncio.gather(*task_list)
asyncio.run(main())
total_dict = dict()
total_dict.update([cache_dict.update(x) for x in cache_dict_list])
The reason is basically that async isn't multithreading (more on threading later). Async basically queues up tasks which are executed by the event loop. So when you await asyncio.gather(*task_list) you are basically saying "put all these tasks in the queue(ish) and wait until they are done." If you used more async and await statements within spider() you could split it up more in the queue, but ultimately it would still take about as long since only one item in the queue will be processed at a time.
Then, we have threading. This (kinda) allows for concurrency. However, it isn't much better if you are resource-capped, because cpython uses a global interpreter lock (GIL). The GIL means that basically, a single python process can only utilize one core at a time, which avoids issues that can happen when multiple cores try to access and modify data at the same time.
However, if you want true concurrency, you can use the multiprocessing module. How you implement this probably depends on exactly how you want to get and store your data (in order to avoid the issues with multiple cores that are the reason for the GIL), but basically it will allow you to use multiple cores concurrently.

Best way to download files simultaneously with Python?

I'm trying to send simultaneous get requests with the Python requests module.
While searching for a solution I've come across lots of different approaches, including grequests, gevent.monkey, requests futures, threading, multi-processing...
I'm a little overwhelmed and not sure which one to pick, regarding speed and code-readibility.
The task is to download < 400 files as fast as possible, all from the same server. Ideally it should output the status for the downloads in terminal, e. g. print an error or success message per request.
def download(webpage):
requests.get(webpage)
# Whatever else you need to do to download your resource, put it in here
urls = ['https://www.example.com', 'https://www.google.com','https://yahoo.com'] # Populate with resources you wish to download
threads = {}
if __name__ == '__main__':
for i in urls:
print(i)
threads[i] = threading.Thread(target=download, args=(i,))
for i in threads:
threads[i].start()
for i in threads:
threads[i].join()
print('successfully done.')
The above code contains a function called download that represents whatever code you have to run to download the resource you're looking to download. Then a list is made populated with urls you wish to download - change these values as you please. This is assembled in to a second dictionary that contains the threads. This is so that you can have as many urls in the url dictionary as you want, and a separate thread is made for each of them. The threads are each started, then joined.
I would use threading as it is not necessary to run the downloads on multiple cores like multiprocessing does.
So write a function where requests.get() is in it and then start as a thread.
But remember that your internet connection has to be fast enough, otherwise it wouldn't be worth it.

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

long running running job in flask

I have created a module that does some heavy computations, and returns some data to be stored in a nosqldatabase. The computation process is started via a post request in my flask application. The flask function will execute the cumputation code and after the code and then the returned results will be stored in db. I was thinking of celery. But I am wondering and haven't found any clear info on that if it would be possible to use python trheading E.g
from mysci_module import heavy_compute
#route('/initiate_task/', methods=['POST',])
def run_computation():
import thread
thread.start_new_thread(heavy_compute, post_data)
return reponse
Its very abstract I know. The only problem I see in this method is that my function will have to know and be responsible in storing data in the database, so It is not very independant on the database used. Correct? Why is Celery a better (is it really?) than the method above?
Since CPython is restricted from true concurrency using threads by the GIL, all computations will infact happen serially. Instead you could use the python multiprocessing module and create a pool of processes to complete your heavy computation task.
There are a few microframeworks such as twisted klein apart from celery that can also help achieve that concurrency and independence that you're looking for. They aren't necessarily better, but are available for those who don't want to get their hands messy with various issues that are likely to come up when one gets into synchronizing flask and the actual business logic, especially when response is based on that activity.
I would suggest the following method to start a thread for the long procedure first. Then leave Flask to communicate with the procedure time by time upon your requirements:
from mysci_module import heavy_compute
import thread
thread.start_new_thread(heavy_compute, post_data)
#route('/initiate_task/', methods=['POST',])
def check_computation():
response = heave_compute.status
return response
The best part of this method is to make sure you have a callable thread in the background all the time while it's possible to get the necessary result even passing some parameters to the task.

Categories

Resources