How to properly transform a sync function to an async one? - python

I'm writing a telegram bot and I need the bot to be available to users even when it is processing some previous request. My bot downloads some videos and compresses them if it exceeds the size limit, so it takes some time to process the request. I want to turn my sync functions to async ones and handle them within another process to make this happen.
I found a way to do this, using this article but it doesn't work for me. That's my code to test the solution:
import asyncio
from concurrent.futures import ProcessPoolExecutor
from functools import wraps, partial
executor = ProcessPoolExecutor()
def async_wrap(func):
#wraps(func)
async def run(*args, **kwargs):
loop = asyncio.get_running_loop()
pfunc = partial(func, *args, **kwargs)
return await loop.run_in_executor(executor, pfunc)
return run
#async_wrap
def sync_func(a):
import time
time.sleep(10)
if __name__ == "__main__":
asyncio.run(sync_func(4))
As a result, I've got the following error message:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 34, in <module>
asyncio.run(sync_func(4))
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 18, in run
return await loop.run_in_executor(executor, pfunc)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
As I understand, the error arises because decorator changes the function and as a result returns a new object. What I need to change in my code to make it work. Maybe I don't understand some crucial concepts and there is some simple method to achieve the desired. Thanks for help

The article runs a nice experiment, but it really is just meant to work with a threaded-pool exercutor - not a multi-processing one.
If you see its code, at some point it passes executor=None to the .run_in_executor call, and asyncio creates a default executor which is a ThreadPoolExecutor.
The main difference to a ProcessPoolExecutor is that all data moved cross-process (and therefore, all data sent to the workers, including the target functions) have to be serialized - and it is done via Python's pickle.
Now, Pickle serialization of functions do not really send the function objects, along with its bytecode, down the wire: rather, it just sends the function qualname, and it is expected that the function with the same qualname on the other end is the same as the original function.
In the case of your code, the func which is the target for the executor-pool is the declared function, prior to it being wrapped in the decorator ( __main__.sync_func) . But what exists with this name in the target process is the post-decorated function. So, if Python would not block it due to the functions not being the same, you'd get into an infinite-loop creating hundreds of nested subprocess and never actually calling your function - as the entry-point in the target would be the wrapped function. That is just an error in the article you viewed.
All this said, the simpler way to make all this work, is instead of using this decorator in the usual fashion, just keep the original, undecorated function, in the module namespace, and create a new name for the wrapped function - this way, the "raw" code can be the target for the executor:
(...)
def sync_func(a):
import time
time.sleep(2)
print(f"finished {a}")
# this creates the decorated function with a new name,
# instead of replacing the original:
wrapped_sync = async_wrap(sync_func)
if __name__ == "__main__":
asyncio.run(wrapped_sync("go go go"))

Related

Dagster cannot connect to mongodb locally

I was going through Dagster tutorials and thought it be a good exercise to connect to my local mongodb.
from dagster import get_dagster_logger, job, op
from pymongo import MongoClient
#op
def connection():
client = MongoClient("mongodb://localhost:27017/")
return client["development"]
#job
def execute():
client = connection()
get_dagster_logger().info(f"Connection: {client} ")
Dagster error:
dagster.core.errors.DagsterExecutionHandleOutputError: Error occurred while handling output "result" of step "connection":
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_plan.py", line 232, in dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 348, in core_dagster_event_sequence_for_step
for evt in _type_check_and_store_output(step_context, user_event, input_lineage):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 405, in _type_check_and_store_output
for evt in _store_output(step_context, step_output_handle, output, input_lineage):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 534, in _store_output
for elt in iterate_with_context(
File "/usr/local/lib/python3.9/site-packages/dagster/utils/__init__.py", line 400, in iterate_with_context
return
File "/usr/local/Cellar/python#3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/utils.py", line 73, in solid_execution_error_boundary
raise error_cls(
The above exception was caused by the following exception:
TypeError: cannot pickle '_thread.lock' object
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/utils.py", line 47, in solid_execution_error_boundary
yield
File "/usr/local/lib/python3.9/site-packages/dagster/utils/__init__.py", line 398, in iterate_with_context
next_output = next(iterator)
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 524, in _gen_fn
gen_output = output_manager.handle_output(output_context, output.value)
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/fs_io_manager.py", line 124, in handle_output
pickle.dump(obj, write_obj, PICKLE_PROTOCOL)
I have tested this locally in a ipython and it works so the issue is related to dagster.
The default IOManager requires that inputs and outputs to ops be pickleable - it's likely that your MongoClient is not. You might want to try refactoring this to use Dagster's #resource method. This allows you to define resources externally to your #op, and makes mocking those resources later in tests really easy. You code would look something like this:
from dagster import get_dagster_logger, job, op, resource
from pymongo import MongoClient
#resource
def mongo_client():
client = MongoClient("mongodb://localhost:27017/")
return client["development"]
#op(
required_resource_keys={'mongo_client'}
)
def test_client(context):
client = context.resources.mongo_client
get_dagster_logger().info(f"Connection: {client} ")
#job(
resource_defs={'mongo_client': mongo_client}
)
def execute():
test_client()
Notice too that I moved the testing code into another #op, and then only called that op from within the execute #job. This is because the code within a job definition gets compiled at load time, and is only used to describe the graph of ops to execute. All general programming to carry out tasks needs to be contained within #op code.
The really neat thing about the #resource pattern is that this makes testing with mock resources or more generally swapping resources incredibly easy. Lets say you wanted a mocked client so you could run your job code without actually hitting the database. You could do something like the following:
#resource
def mocked_mongo_client():
from unittest.mock import MagicMock
return MagicMock()
#graph
def execute_graph():
test_client()
execute_live = execute_graph.to_job(name='execute_live',
resource_defs={'mongo_client': mongo_client,})
execute_mocked = execute_graph.to_job(name='execute_mocked',
resource_defs={'mongo_client': mocked_mongo_client,})
This uses Dagster's #graph pattern to describe a DAG of ops, then use the .to_job() method on the GraphDefinition object to configure the graph in different ways. This way you can have the same exact underlying op structure, but pass different resources, tags, executors, etc.

How to use asyncio with ProcessPoolExecutor

I am searching for huge number of addresses on web, I want to use both asyncio and ProcessPoolExecutor in my task to quickly search the addresses.
async def main():
n_jobs = 3
addresses = [list of addresses]
_addresses = list_splitter(data=addresses, n=n_jobs)
with ProcessPoolExecutor(max_workers=n_jobs) as executor:
futures_list = []
for _address in _addresses:
futures_list +=[asyncio.get_event_loop().run_in_executor(executor, execute_parallel, _address)]
for f in tqdm(as_completed(futures_list, loop=asyncio.get_event_loop()), total=len(_addresses)):
results = await f
asyncio.get_event_loop().run_until_complete(main())
expected:
I want to execute_parallel function should run in parallel.
error:
Traceback (most recent call last):
File "/home/awaish/danamica/scraping/skraafoto/aerial_photos_scraper.py", line 228, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
return future.result()
File "/home/awaish/danamica/scraping/skraafoto/aerial_photos_scraper.py", line 224, in main
results = await f
File "/usr/local/lib/python3.7/asyncio/tasks.py", line 533, in _wait_for_one
return f.result() # May raise f.exception().
TypeError: can't pickle coroutine objects
I'm not sure I'm answering the correct question, but it appears the intent of your code is to run your execute_parallel function across several processes using Asyncio. As opposed to using ProcessPoolExecutor, why not try something like using a normal multiprocessing Pool and setting up separate Asyncio loops to run in each. You might set up one process per core and let Asyncio work its magic within each process.
async def run_loop(addresses):
loop = asyncio.get_event_loop()
loops = [loop.create_task(execute_parallel, address) for address in addresses]
loop.run_until_complete(asyncio.wait(loops))
def main():
n_jobs = 3
addresses = [list of addresses]
_addresses = list_splitter(data=addresses, n=n_jobs)
with multiprocessing.Pool(processes=n_jobs) as pool:
pool.imap_unordered(run_loop, _addresses)
I've used Pool.imap_unordered with great success, but depending on your needs you may prefer Pool.map or some other functionality. You can play around with chunksize or with the number of addresses in each list to achieve optimal results (ie, if you're getting a lot of timeouts you may want to reduce the number of addresses being processed concurrently)

How to pass classes into Pool.map as arguments - Pickling Error

I am trying to process a file by cutting it up into chunks and running them through a function which processes the chunks and returns a numpy array. After looking around it seems the best method would be to use the Pool.map method by passing through classes as the arguments. These classes are initiated with the chunk sections as a variable, and another variable to store the numpy array. The output list of classes can then be parsed to get out the information I need to continue with the problem. Here is a simplified version of the script I am trying to write:
from multiprocessing import Pool
class container():
def __init__(self, k):
self.input_section = k
self.ouput_answer = 0
def compute(object_class):
# Main operation would go on in here....
object_class.output_answer = object_class.input_section
return object_class
def Main():
# Create list of classes to path as arguments
sections = [container(k) for k in range(10)]
# Create pool and compute modified classes
with Pool(4) as p:
results = p.map(compute, sections)
# Decode here to get answers
sections = [k.output_answer for k in results]
# Print answers
print(sections)
if __name__ == '__main__':
Main()
This is the error that I get when I run the script:
Exception in thread Thread-9: Traceback (most recent call last):
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\pool.py", line 463, in _handle_results
task = get()
File "C:\Users\rbernon\AppData\Local\Continuum\Anaconda3\lib\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
AttributeError: Can't get attribute 'container' on module '__main__' from
'C:\\Users\\rbernon\\AppData\\Local\\Continuum\\Anaconda3\\lib\\site-packages\\spyder\\utils\\ipython\\start_kernel.py'>
Any help would be greatly apprectiated!
Keep in mind that every piece of data you want to have processed needs to be pickled and sent to the worker processes.
The overhead of this will reduce (and might even eliminate) the advantages of using multiple processes.
If the data file is large, it is probably better to send each worker a start and end offset as a 2-tuple of numbers, so each worker can read part of the file and process it.

Why does this function not work when used as a decorator?

UPDATE: As noted by Mr. Fooz, the functional version of the wrapper has a bug, so I reverted to the original class implementation. I've put the code up on GitHub:
https://github.com/nofatclips/timeout/commits/master
There are two commits, one working (using the "import" workaround) the second one broken.
The source of the problem seems to be the pickle#dumps function, which just spits out an identifier when called on an function. By the time I call Process, that identifier points to the decorated version of the function, rather than the original one.
ORIGINAL MESSAGE:
I was trying to write a function decorator to wrap a long task in a Process that would be killed if a timeout expires. I came up with this (working but not elegant) version:
from multiprocessing import Process
from threading import Timer
from functools import partial
from sys import stdout
def safeExecution(function, timeout):
thread = None
def _break():
#stdout.flush()
#print (thread)
thread.terminate()
def start(*kw):
timer = Timer(timeout, _break)
timer.start()
thread = Process(target=function, args=kw)
ret = thread.start() # TODO: capture return value
thread.join()
timer.cancel()
return ret
return start
def settimeout(timeout):
return partial(safeExecution, timeout=timeout)
##settimeout(1)
def calculatePrimes(maxPrimes):
primes = []
for i in range(2, maxPrimes):
prime = True
for prime in primes:
if (i % prime == 0):
prime = False
break
if (prime):
primes.append(i)
print ("Found prime: %s" % i)
if __name__ == '__main__':
print (calculatePrimes)
a = settimeout(1)
calculatePrime = a(calculatePrimes)
calculatePrime(24000)
As you can see, I commented out the decorator and assigned the modified version of calculatePrimes to calculatePrime. If I tried to reassign it to the same variable, I'd get a "Can't pickle : attribute lookup builtins.function failed" error when trying to call the decorated version.
Anybody has any idea of what is happening under the hood? Is the original function being turned into something different when I assign the decorated version to the identifier referencing it?
UPDATE: To reproduce the error, I just change the main part to
if __name__ == '__main__':
print (calculatePrimes)
a = settimeout(1)
calculatePrimes = a(calculatePrimes)
calculatePrimes(24000)
#sleep(2)
which yields:
Traceback (most recent call last):
File "c:\Users\mm\Desktop\ING.SW\python\thread2.py", line 49, in <module>
calculatePrimes(24000)
File "c:\Users\mm\Desktop\ING.SW\python\thread2.py", line 19, in start
ret = thread.start()
File "C:\Python33\lib\multiprocessing\process.py", line 111, in start
self._popen = Popen(self)
File "C:\Python33\lib\multiprocessing\forking.py", line 241, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Python33\lib\multiprocessing\forking.py", line 160, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'function'>: attribute lookup builtin
s.function failed
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python33\lib\multiprocessing\forking.py", line 344, in main
self = load(from_parent)
EOFError
P.S. I also wrote a class version of safeExecution, which has exactly the same behaviour.
Move the function to a module that's imported by your script.
Functions are only picklable in python if they're defined at the top level of a module. Ones defined in scripts are not picklable by default. Module-based functions are pickled as two strings: the name of the module, and the name of the function. They're unpickled by dynamically importing the module then looking up the function object by name (hence the restriction on top-level-only functions).
It's possible to extend the pickle handlers to support semi-generic function and lambda pickling, but doing so can be tricky. In particular, it can be difficult to reconstruct the full namespace tree if you want to properly handle things like decorators and nested functions. If you want to do this, it's best to use Python 2.7 or later or Python 3.3 or later (earlier versions have a bug in the dispatcher of cPickle and pickle that's unpleasant to work around).
Is there an easy way to pickle a python function (or otherwise serialize its code)?
Python: pickling nested functions
http://bugs.python.org/issue7689
EDIT:
At least in Python 2.6, the pickling works fine for me if the script only contains the if __name__ block, the script imports calculatePrimes and settimeout from a module, and if the inner start function's name is monkey-patched:
def safeExecution(function, timeout):
...
def start(*kw):
...
start.__name__ = function.__name__ # ADD THIS LINE
return start
There's a second problem that's related to Python's variable scoping rules. The assignment to the thread variable inside start creates a shadow variable whose scope is limited to one evaluation of the start function. It does not assign to the thread variable found in the enclosing scope. You can't use the global keyword to override the scope because you want and intermediate scope and Python only has full support for manipulating the local-most and global-most scopes, not any intermediate ones. You can overcome this problem by placing the thread object in a container that's housed in the intermediate scope. Here's how:
def safeExecution(function, timeout):
thread_holder = [] # MAKE IT A CONTAINER
def _break():
#stdout.flush()
#print (thread)
thread_holder[0].terminate() # REACH INTO THE CONTAINER
def start(*kw):
...
thread = Process(target=function, args=kw)
thread_holder.append(thread) # MUTATE THE CONTAINER
...
start.__name__ = function.__name__ # MAKES THE PICKLING WORK
return start
Not sure really why you get that problem, but to answer your title question: Why does the decorator not work?
When you pass arguments to a decorator, you need to structure the code slightly different. Essentially you have to implement the decorator as a class with an __init__ and an __call__.
In the init, you collect the arguments that you send to the decorator, and in the call, you'll get the function you decorate:
class settimeout(object):
def __init__(self, timeout):
self.timeout = timeout
def __call__(self, func):
def wrapped_func(n):
func(n, self.timeout)
return wrapped_func
#settimeout(1)
def func(n, timeout):
print "Func is called with", n, 'and', timeout
func(24000)
This should get you going on the decorator front at least.

Is it possible to serialize tasklet code (not just exec state) using SPickle without doing a RPC?

Trying to use stackless python (2.7.2) with SPickle to send a test method over celery for execution on a different machine. I would like the test method (code) to be included with the pickle and not forced to exist on the executing machines python path.
Been referencing following presentation:
https://ep2012.europython.eu/conference/talks/advanced-pickling-with-stackless-python-and-spickle
Trying to use the technique shown in the checkpointing slide 11. The RPC example doesn't seem right given that we are using celery:
Client code:
from stackless import run, schedule, tasklet
from sPickle import SPickleTools
def test_method():
print "hello from test method"
tasks = []
test_tasklet = tasklet(test_method)()
tasks.append(test_tasklet)
pt = SPickleTools(serializeableModules=['__test_method__'])
pickled_task = pt.dumps(tasks)
Server code:
pt = sPickle.SPickleTools()
unpickledTasks = pt.loads(pickled_task)
Results in:
[2012-03-09 14:24:59,104: ERROR/MainProcess] Task
celery_tasks.test_exec_method[8f462bd6-7952-4aa1-9adc-d84ee4a51ea6] raised exception:
AttributeError("'module'
object has no attribute 'test_method'",)
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\celery\execute\trace.py", line 153, in trace_task
R = retval = task(*args, **kwargs)
File "c:\Python27\celery_tasks.py", line 16, in test_exec_method
unpickledTasks = pt.loads(pickled_task)
File "c:\Python27\lib\site-packages\sPickle\_sPickle.py", line 946, in loads
return unpickler.load()
AttributeError: 'module' object has no attribute 'test_method'
Any suggestions on what I am doing incorrect or if this is even possible?
Alternative suggestions for doing dynamic module loading in a celeryd would also be good (as an alternative for using sPickle). I have experimented with doing:
py_mod = imp.load_source(module_name,'some script path')
sys.modules.setdefault(module_name,py_mod)
but the dynamically loaded module does not seem to persist through different calls to celeryd, i.e. different remote calls.
You must define test_method within its own module. Currently sPickle detects whether test_method is defined in a module that can be imported. An alternative way is to set the __module__ attribute of the function to None.
def test_method():
pass
test_method.__module__ = None

Categories

Resources