Difference between IOLoop.current().run_in_executor() and ThreadPoolExecutor().submit() - python

I'm quite new to Python Tornado and have been trying to start a new thread to run some IO blocking code whilst allowing the server to continue to handle new requests. I've been doing some reading but still can't seem to figure out what the difference is between these two functions?
For example calling a method like this:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(1) as executor:
future = executor.submit(report.write_gresb_workbook)
print(future.result())
compared to:
from concurrent.futures import ThreadPoolExecutor
from tornado import ioloop
with ThreadPoolExecutor(1) as executor:
my_success = await ioloop.IOLoop.current().run_in_executor(executor, report.write_gresb_workbook)
print(my_success)
write_gresb_workbook takes some information from the object report and writes it to an excel spreadsheet (however I'm using openpyxl which takes ~20s to load an appropriately formatted workbook and another ~20s to save it which stops the server from handling new requests!)
The function simply returns True or False (which is what my_success is) as the report object has the path of the output file attached to it.
I haven't quite gotten either of these methods to work yet so they might be incorrect but was just looking for some background information.
Cheers!

IOLoop.run_in_executor and Executor.submit do essentially the same thing, but return different types. IOLoop.run_in_executor returns an asyncio.Future, while Executor.submit returns a concurrent.futures.Future.
The two Future types have nearly identical interfaces, with one important difference: Only asyncio.Future can be used with await in a coroutine. The purpose of run_in_executor is to provide this conversion.

Related

Concurrency questions with non-concurrent code

I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.
Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.
Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?
Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

long running running job in flask

I have created a module that does some heavy computations, and returns some data to be stored in a nosqldatabase. The computation process is started via a post request in my flask application. The flask function will execute the cumputation code and after the code and then the returned results will be stored in db. I was thinking of celery. But I am wondering and haven't found any clear info on that if it would be possible to use python trheading E.g
from mysci_module import heavy_compute
#route('/initiate_task/', methods=['POST',])
def run_computation():
import thread
thread.start_new_thread(heavy_compute, post_data)
return reponse
Its very abstract I know. The only problem I see in this method is that my function will have to know and be responsible in storing data in the database, so It is not very independant on the database used. Correct? Why is Celery a better (is it really?) than the method above?
Since CPython is restricted from true concurrency using threads by the GIL, all computations will infact happen serially. Instead you could use the python multiprocessing module and create a pool of processes to complete your heavy computation task.
There are a few microframeworks such as twisted klein apart from celery that can also help achieve that concurrency and independence that you're looking for. They aren't necessarily better, but are available for those who don't want to get their hands messy with various issues that are likely to come up when one gets into synchronizing flask and the actual business logic, especially when response is based on that activity.
I would suggest the following method to start a thread for the long procedure first. Then leave Flask to communicate with the procedure time by time upon your requirements:
from mysci_module import heavy_compute
import thread
thread.start_new_thread(heavy_compute, post_data)
#route('/initiate_task/', methods=['POST',])
def check_computation():
response = heave_compute.status
return response
The best part of this method is to make sure you have a callable thread in the background all the time while it's possible to get the necessary result even passing some parameters to the task.

Categories

Resources