My employer uses Box. It's API is very slow. Fortunately our files are largely static. Nightly I can iterate (recursively) over Box folders and store the URLs in a local file. Using the local file during the day substantially improves the performance of our scripts that read from and write to Box.
Starting the recursive search (spider) at level 0 includes folders we don't care about. So we have a named list of starting points from level 1. I'd like to recursively search them in parallel.
When I observe the code below (via logging/print statements I have hidden) it does not seem to search under the starting points in parallel. In instead searches the entire tree under starting point 1, then the tree under starting point 2, etc.
My question is: why does the code below not execute the spider method concurrently for each item in for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list)?
import asyncio
#asyncio.coroutine
def spider(storage_dict, dir_list):
"""Recursive storage of Box information in storage_dict."""
storage_dict = {"key": "value"}
cache_dict_list = [dict() for x in starting_dir_list]
task_list = list()
async def main():
for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list):
task_list.append(asyncio.create_task(spider(storage_dict, [starting_point])))
await asyncio.gather(*task_list)
asyncio.run(main())
total_dict = dict()
total_dict.update([cache_dict.update(x) for x in cache_dict_list])
The reason is basically that async isn't multithreading (more on threading later). Async basically queues up tasks which are executed by the event loop. So when you await asyncio.gather(*task_list) you are basically saying "put all these tasks in the queue(ish) and wait until they are done." If you used more async and await statements within spider() you could split it up more in the queue, but ultimately it would still take about as long since only one item in the queue will be processed at a time.
Then, we have threading. This (kinda) allows for concurrency. However, it isn't much better if you are resource-capped, because cpython uses a global interpreter lock (GIL). The GIL means that basically, a single python process can only utilize one core at a time, which avoids issues that can happen when multiple cores try to access and modify data at the same time.
However, if you want true concurrency, you can use the multiprocessing module. How you implement this probably depends on exactly how you want to get and store your data (in order to avoid the issues with multiple cores that are the reason for the GIL), but basically it will allow you to use multiple cores concurrently.
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.
Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.
Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.
I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))
I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.
I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)