Concurrency questions with non-concurrent code - python

I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.

Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.

Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.

Related

Python concurrency isn't concurrent

My employer uses Box. It's API is very slow. Fortunately our files are largely static. Nightly I can iterate (recursively) over Box folders and store the URLs in a local file. Using the local file during the day substantially improves the performance of our scripts that read from and write to Box.
Starting the recursive search (spider) at level 0 includes folders we don't care about. So we have a named list of starting points from level 1. I'd like to recursively search them in parallel.
When I observe the code below (via logging/print statements I have hidden) it does not seem to search under the starting points in parallel. In instead searches the entire tree under starting point 1, then the tree under starting point 2, etc.
My question is: why does the code below not execute the spider method concurrently for each item in for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list)?
import asyncio
#asyncio.coroutine
def spider(storage_dict, dir_list):
"""Recursive storage of Box information in storage_dict."""
storage_dict = {"key": "value"}
cache_dict_list = [dict() for x in starting_dir_list]
task_list = list()
async def main():
for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list):
task_list.append(asyncio.create_task(spider(storage_dict, [starting_point])))
await asyncio.gather(*task_list)
asyncio.run(main())
total_dict = dict()
total_dict.update([cache_dict.update(x) for x in cache_dict_list])
The reason is basically that async isn't multithreading (more on threading later). Async basically queues up tasks which are executed by the event loop. So when you await asyncio.gather(*task_list) you are basically saying "put all these tasks in the queue(ish) and wait until they are done." If you used more async and await statements within spider() you could split it up more in the queue, but ultimately it would still take about as long since only one item in the queue will be processed at a time.
Then, we have threading. This (kinda) allows for concurrency. However, it isn't much better if you are resource-capped, because cpython uses a global interpreter lock (GIL). The GIL means that basically, a single python process can only utilize one core at a time, which avoids issues that can happen when multiple cores try to access and modify data at the same time.
However, if you want true concurrency, you can use the multiprocessing module. How you implement this probably depends on exactly how you want to get and store your data (in order to avoid the issues with multiple cores that are the reason for the GIL), but basically it will allow you to use multiple cores concurrently.

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))

Python: running multiple web requests in parallel

I'm new to Python and I have a basic question but I'm struggling to find an answer online, because a lot of the examples online seem to refer to deprecated APIs, so sorry if this has been asked before.
I'm looking for a way to execute multiple (similar) web requests in parallel, and retrieve the result in a list.
The synchronous version I have right now is something like:
urls = ['http://example1.org', 'http://example2.org', '...']
def getResult(urls):
result = []
for url in urls:
result.append(get(url).json())
return result
I'm looking for the asynchronous equivalent (where all the requests are made in parallel, but I then wait for all of them to be finished before returning the global result).
From what I saw I have to use async/await and aiohttp but the examples seemed way too complicated for the simple task I'm looking for.
Thanks
I am going to try to explain the simplest possible way to achieve what you want. Im sure there are more cleaner/better ways to do this but here it goes.
You could preform what you want using the python "threading" library. You can use it to create separate threads for each request and then run all the threads concurrently and get an answer.
Since you are new to python, to simplify things further I am using a global list called RESULTS to store in the results of the get(url) rather than returning them from the function.
import threading
RESULTS=[] #List to store the results
#Request Single Url Result and store in global RESULTS
def getSingleResult(url):
global RESULTS
RESULTS.append( ( url, get(url).json()) )
#Your Original Function
def getResult(urls)
ths=[]
for url in urls:
th=threading.Thread(target=getSingleResult, args=(url,)) #Create a Thread
th.start() #Start it
ths.append(th) #Add it to a thread list
for th in ths:
th.join() #Wait for all threads to finish
The usage of the global results is to make it easier rather than collecting results from the threads directly. If you wish to do that you can check out this answer How to get the return value from a thread in python?
Of course one thing to note that multi-threading in python doesnt provide true parallelism but rather concurrency especially if you are using the standard python implementation due to what is known as the Global Interpreter Lock
However for your use case it would still provide for you the speed up you need.

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

Run Multiple BigQuery Jobs via Python API

I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)

Categories

Resources