Dagster cannot connect to mongodb locally - python

I was going through Dagster tutorials and thought it be a good exercise to connect to my local mongodb.
from dagster import get_dagster_logger, job, op
from pymongo import MongoClient
#op
def connection():
client = MongoClient("mongodb://localhost:27017/")
return client["development"]
#job
def execute():
client = connection()
get_dagster_logger().info(f"Connection: {client} ")
Dagster error:
dagster.core.errors.DagsterExecutionHandleOutputError: Error occurred while handling output "result" of step "connection":
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_plan.py", line 232, in dagster_event_sequence_for_step
for step_event in check.generator(step_events):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 348, in core_dagster_event_sequence_for_step
for evt in _type_check_and_store_output(step_context, user_event, input_lineage):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 405, in _type_check_and_store_output
for evt in _store_output(step_context, step_output_handle, output, input_lineage):
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 534, in _store_output
for elt in iterate_with_context(
File "/usr/local/lib/python3.9/site-packages/dagster/utils/__init__.py", line 400, in iterate_with_context
return
File "/usr/local/Cellar/python#3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/utils.py", line 73, in solid_execution_error_boundary
raise error_cls(
The above exception was caused by the following exception:
TypeError: cannot pickle '_thread.lock' object
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/utils.py", line 47, in solid_execution_error_boundary
yield
File "/usr/local/lib/python3.9/site-packages/dagster/utils/__init__.py", line 398, in iterate_with_context
next_output = next(iterator)
File "/usr/local/lib/python3.9/site-packages/dagster/core/execution/plan/execute_step.py", line 524, in _gen_fn
gen_output = output_manager.handle_output(output_context, output.value)
File "/usr/local/lib/python3.9/site-packages/dagster/core/storage/fs_io_manager.py", line 124, in handle_output
pickle.dump(obj, write_obj, PICKLE_PROTOCOL)
I have tested this locally in a ipython and it works so the issue is related to dagster.

The default IOManager requires that inputs and outputs to ops be pickleable - it's likely that your MongoClient is not. You might want to try refactoring this to use Dagster's #resource method. This allows you to define resources externally to your #op, and makes mocking those resources later in tests really easy. You code would look something like this:
from dagster import get_dagster_logger, job, op, resource
from pymongo import MongoClient
#resource
def mongo_client():
client = MongoClient("mongodb://localhost:27017/")
return client["development"]
#op(
required_resource_keys={'mongo_client'}
)
def test_client(context):
client = context.resources.mongo_client
get_dagster_logger().info(f"Connection: {client} ")
#job(
resource_defs={'mongo_client': mongo_client}
)
def execute():
test_client()
Notice too that I moved the testing code into another #op, and then only called that op from within the execute #job. This is because the code within a job definition gets compiled at load time, and is only used to describe the graph of ops to execute. All general programming to carry out tasks needs to be contained within #op code.
The really neat thing about the #resource pattern is that this makes testing with mock resources or more generally swapping resources incredibly easy. Lets say you wanted a mocked client so you could run your job code without actually hitting the database. You could do something like the following:
#resource
def mocked_mongo_client():
from unittest.mock import MagicMock
return MagicMock()
#graph
def execute_graph():
test_client()
execute_live = execute_graph.to_job(name='execute_live',
resource_defs={'mongo_client': mongo_client,})
execute_mocked = execute_graph.to_job(name='execute_mocked',
resource_defs={'mongo_client': mocked_mongo_client,})
This uses Dagster's #graph pattern to describe a DAG of ops, then use the .to_job() method on the GraphDefinition object to configure the graph in different ways. This way you can have the same exact underlying op structure, but pass different resources, tags, executors, etc.

Related

How to properly transform a sync function to an async one?

I'm writing a telegram bot and I need the bot to be available to users even when it is processing some previous request. My bot downloads some videos and compresses them if it exceeds the size limit, so it takes some time to process the request. I want to turn my sync functions to async ones and handle them within another process to make this happen.
I found a way to do this, using this article but it doesn't work for me. That's my code to test the solution:
import asyncio
from concurrent.futures import ProcessPoolExecutor
from functools import wraps, partial
executor = ProcessPoolExecutor()
def async_wrap(func):
#wraps(func)
async def run(*args, **kwargs):
loop = asyncio.get_running_loop()
pfunc = partial(func, *args, **kwargs)
return await loop.run_in_executor(executor, pfunc)
return run
#async_wrap
def sync_func(a):
import time
time.sleep(10)
if __name__ == "__main__":
asyncio.run(sync_func(4))
As a result, I've got the following error message:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 34, in <module>
asyncio.run(sync_func(4))
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/home/mikhail/Projects/social_network_parsing_bot/processes.py", line 18, in run
return await loop.run_in_executor(executor, pfunc)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/mikhail/.pyenv/versions/3.10.4/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sync_func at 0x7f2e333625f0>: it's not the same object as __main__.sync_func
As I understand, the error arises because decorator changes the function and as a result returns a new object. What I need to change in my code to make it work. Maybe I don't understand some crucial concepts and there is some simple method to achieve the desired. Thanks for help
The article runs a nice experiment, but it really is just meant to work with a threaded-pool exercutor - not a multi-processing one.
If you see its code, at some point it passes executor=None to the .run_in_executor call, and asyncio creates a default executor which is a ThreadPoolExecutor.
The main difference to a ProcessPoolExecutor is that all data moved cross-process (and therefore, all data sent to the workers, including the target functions) have to be serialized - and it is done via Python's pickle.
Now, Pickle serialization of functions do not really send the function objects, along with its bytecode, down the wire: rather, it just sends the function qualname, and it is expected that the function with the same qualname on the other end is the same as the original function.
In the case of your code, the func which is the target for the executor-pool is the declared function, prior to it being wrapped in the decorator ( __main__.sync_func) . But what exists with this name in the target process is the post-decorated function. So, if Python would not block it due to the functions not being the same, you'd get into an infinite-loop creating hundreds of nested subprocess and never actually calling your function - as the entry-point in the target would be the wrapped function. That is just an error in the article you viewed.
All this said, the simpler way to make all this work, is instead of using this decorator in the usual fashion, just keep the original, undecorated function, in the module namespace, and create a new name for the wrapped function - this way, the "raw" code can be the target for the executor:
(...)
def sync_func(a):
import time
time.sleep(2)
print(f"finished {a}")
# this creates the decorated function with a new name,
# instead of replacing the original:
wrapped_sync = async_wrap(sync_func)
if __name__ == "__main__":
asyncio.run(wrapped_sync("go go go"))

AWS CDK add port mappings

I've been trying a lot of different things, but I can't seem to get this to work for me.
I'm trying to declare ports for my container in aws cdk within the ecs.TaskDefinition contruct.
I keep getting an error that the array type was expected even though I'm using the specififed ecs.PortMapping construct that is needed for the port_mappings parameter.
File "/home/user/.local/share/virtualenvs/AWS_Automation--K5ZV1iW/lib/python3.9/site-packages/aws_cdk/aws_ecs/__init__.py", line 27675, in add_container
return typing.cast(ContainerDefinition, jsii.invoke(self, "addContainer", [id, props]))
File "/home/user/.local/share/virtualenvs/AWS_Automation--K5ZV1iW/lib/python3.9/site-packages/jsii/_kernel/__init__.py", line 143, in wrapped
return _recursize_dereference(kernel, fn(kernel, *args, **kwargs))
File "/home/user/.local/share/virtualenvs/AWS_Automation--K5ZV1iW/lib/python3.9/site-packages/jsii/_kernel/__init__.py", line 355, in invoke
response = self.provider.invoke(
File "/home/user/.local/share/virtualenvs/AWS_Automation--K5ZV1iW/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 359, in invoke
return self._process.send(request, InvokeResponse)
File "/home/user/.local/share/virtualenvs/AWS_Automation--K5ZV1iW/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 326, in send
raise JSIIError(resp.error) from JavaScriptError(resp.stack)
jsii.errors.JSIIError: Expected array type, got {"$jsii.struct":{"fqn":"aws-cdk-lib.aws_ecs.PortMapping","data":{"containerPort":8501,"hostPort":null,"protocol":null}}}
Any help would be appreciated. My relevant code is below.
from aws_cdk import (aws_ec2 as ec2, aws_ecs as ecs,
aws_ecs_patterns as ecs_patterns,
aws_ecs as ecs,
aws_ecr as ecr,
aws_route53 as route53,
aws_certificatemanager as certificatemanager,
aws_elasticloadbalancingv2 as elbv2)
container_port_mappings = ecs.PortMapping(container_port = 8501)
task_def = ecs.TaskDefinition(self,
'TD',
compatibility = ecs.Compatibility.FARGATE,
cpu = '512',
memory_mib = '1024'
)
task_def.add_container("SL_container",
image=ecs.ContainerImage.from_ecr_repository(_repo),
port_mappings = container_port_mappings
)
port_mappings accepts a list of PortMapping objects:
container_port_mappings = [ecs.PortMapping(container_port = 8501)]
BTW, CDK supports Python type annotations, which help avoid these types of errors.

How to integrate APScheduler and Imp?

I have built a plugin-based application where "plugins" (python modules) can be loaded by imp and then scheduled for later execution by APScheduler, I was able to successfully integrate them but I want to implement persistence in case of crashes or application reestarts, so I changed the default memory job store to the SqlAlchemyJobStore, it works quite well the first time you execute the program: tasks are loaded, scheduled, saved at the database and executed at the right time.
Problem is when I try to load the application again I get this traceback:
ERROR:apscheduler.jobstores.default:Unable to restore job "d3e0f0068df54d15986e9b7b6757f665" -- removing it
Traceback (most recent call last):
File "/home/jesus/.local/lib/python2.7/site-packages/apscheduler/jobstores/sqlalchemy.py", line 126, in _get_jobs
jobs.append(self._reconstitute_job(row.job_state))
File "/home/jesus/.local/lib/python2.7/site-packages/apscheduler/jobstores/sqlalchemy.py", line 114, in _reconstitute_job
job.__setstate__(job_state)
File "/home/jesus/.local/lib/python2.7/site-packages/apscheduler/job.py", line 228, in __setstate__
self.func = ref_to_obj(self.func_ref)
File "/home/jesus/.local/lib/python2.7/site-packages/apscheduler/util.py", line 257, in ref_to_obj
raise LookupError('Error resolving reference %s: could not import module' % ref)
LookupError: Error resolving reference __init__:run: could not import module
So it is obvious that there is a problem when attempting to import the function again
Here is my scheduler initialization:
executors = {'default': ThreadPoolExecutor(5)}
jobstores = {'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')}
self.scheduler = BackgroundScheduler(executors = executors,jobstores=jobstores)
I have a "tests" dictionary containing the "plugins" that should be loaded and some parameters, "load_plugin" uses imp to load a plugin by it's name.
for test,parameters in tests.items():
if test in pluggins:
module=load_plugin(pluggins[test])
self.jobs[test]=self.scheduler.add_job(module.run,"interval",seconds=parameters["interval"],name=test)
Any idea about how can I handle reconstituting jobs?
Something in the automatic detection of the module name is going wrong. Hard to say what, but the alternative is to manually give it the proper lookup path as a string (e.g. "package.module:function"). If you can do this, you can avoid this problem.

AttributeError for custom types with mixer

I have stumbled into a pretty interesting bug in klen mixer library for Python.
https://github.com/klen/mixer
This bug occurs whenever you try to setup a model with a column using sqlalchemy.dialect.postgresql.INET. Trying to blend a model with this in will bring the following trace...
mixer: ERROR: Traceback (most recent call last):
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 612, in blend
return type_mixer.blend(**values)
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 130, in blend
for name, value in defaults.items()
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 130, in <genexpr>
for name, value in defaults.items()
File "/home/cllamach/PythonProjects/mixer/mixer/mix_types.py", line 220, in gen_value
return type_mixer.gen_field(field)
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 209, in gen_field
return self.gen_value(field.name, field, unique=unique)
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 254, in gen_value
gen = self.get_generator(field, field_name, fake=fake)
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 304, in get_generator
field.scheme, field_name, fake, kwargs=field.params)
File "/home/cllamach/PythonProjects/mixer/mixer/backend/sqlalchemy.py", line 178, in make_generator
stype, field_name=field_name, fake=fake, args=args, kwargs=kwargs)
File "/home/cllamach/PythonProjects/mixer/mixer/main.py", line 324, in make_generator
fabric = self.__factory.gen_maker(scheme, field_name, fake)
File "/home/cllamach/PythonProjects/mixer/mixer/factory.py", line 157, in gen_maker
if not func and fcls.__bases__:
AttributeError: Mixer (<class 'tests.test_flask.IpAddressUser'>): 'NoneType' object has no attribute '__bases__'
I debugged this error all the way down to a couple of methods in the code, the first method get_generator tries the following...
if key not in self.__generators:
self.__generators[key] = self.make_generator(
field.scheme, field_name, fake, kwargs=field.params)
And heres comes the weird part. Here in this statement field.scheme has a value, specifically a Column object from sqlalchemy, but when is passed down to the make_generetor method is passed as a None. So far i have seen no other piece of code in between these two methods, have debugged with ipdb and others. Have tried calling the method manually with ipdb and still the scheme is passed None.
I know this can be deemed as too particular an issue but i would like to know if someone has encountered this kind of issues before, as this is a first for me.
Mixer is choking on an unknown column type. It stores all the ones it knows in GenFactory.types as a dict and calls types.get(column_type), which of course will return None for an unrecognized type. I ran into this because I defined a couple custom SQLAlchemy types with sqlalchemy.types.TypeDecorator.
To solve this problem, You'll have to monkey-patch your types into Mixer's type system. Here's how I did it:
def _setup_mixer_with_custom_types():
from mixer._faker import faker
from mixer.backend.sqlalchemy import (
GenFactory,
mixer,
)
from myproject.customcolumntypes import (
IntegerTimestamp,
UTCDateTimeTimestamp,
)
def arrow_generator():
return arrow.get(faker.date_time())
GenFactory.generators[IntegerTimestamp] = arrow_generator
GenFactory.generators[UTCDateTimeTimestamp] = arrow_generator
return mixer
mixer = _setup_mixer_with_custom_types()
Note that you don't actually have to touch GenFactory.types because it's just an intermediary step that Mixer skips if it can find your type directly on GenFactory.generators.
In my case, I also had to define a custom generator (to accommodate Arrow), but you may not need to. Mixer uses the fake-factory library to generate fake data, and you can see what they're using by looking at the GenFactory.generators dict.
You have to get the column type into GenFactory.generators, which by default only contains some standard types. Instead of monkey-patching, you might subclass GenFactory and then specify your own class upon Mixer generation.
In this case, we'll customize the already subclassed GenFactory and Mixer variants from backend.sqlalchemy:
from mixer.backend.sqlalchemy import Mixer, GenFactory
from customtypes import CustomType # The column type
def get_mixer():
class CustomFactory(GenFactory):
# No need to preserve entries, the parent class attribute is
# automatically extended through GenFactory's metaclass
generators = {
CustomType: lambda: 42 # Or any other function
}
return Mixer(factory=CustomFactory)
You can use whatever function you like as generator, it just has to return the desired value. Sometimes, directly using something from faker might be enough.
In the same way, you can also customize the other attributes of GenFactory, i.e. fakers and types.

Is it possible to serialize tasklet code (not just exec state) using SPickle without doing a RPC?

Trying to use stackless python (2.7.2) with SPickle to send a test method over celery for execution on a different machine. I would like the test method (code) to be included with the pickle and not forced to exist on the executing machines python path.
Been referencing following presentation:
https://ep2012.europython.eu/conference/talks/advanced-pickling-with-stackless-python-and-spickle
Trying to use the technique shown in the checkpointing slide 11. The RPC example doesn't seem right given that we are using celery:
Client code:
from stackless import run, schedule, tasklet
from sPickle import SPickleTools
def test_method():
print "hello from test method"
tasks = []
test_tasklet = tasklet(test_method)()
tasks.append(test_tasklet)
pt = SPickleTools(serializeableModules=['__test_method__'])
pickled_task = pt.dumps(tasks)
Server code:
pt = sPickle.SPickleTools()
unpickledTasks = pt.loads(pickled_task)
Results in:
[2012-03-09 14:24:59,104: ERROR/MainProcess] Task
celery_tasks.test_exec_method[8f462bd6-7952-4aa1-9adc-d84ee4a51ea6] raised exception:
AttributeError("'module'
object has no attribute 'test_method'",)
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\celery\execute\trace.py", line 153, in trace_task
R = retval = task(*args, **kwargs)
File "c:\Python27\celery_tasks.py", line 16, in test_exec_method
unpickledTasks = pt.loads(pickled_task)
File "c:\Python27\lib\site-packages\sPickle\_sPickle.py", line 946, in loads
return unpickler.load()
AttributeError: 'module' object has no attribute 'test_method'
Any suggestions on what I am doing incorrect or if this is even possible?
Alternative suggestions for doing dynamic module loading in a celeryd would also be good (as an alternative for using sPickle). I have experimented with doing:
py_mod = imp.load_source(module_name,'some script path')
sys.modules.setdefault(module_name,py_mod)
but the dynamically loaded module does not seem to persist through different calls to celeryd, i.e. different remote calls.
You must define test_method within its own module. Currently sPickle detects whether test_method is defined in a module that can be imported. An alternative way is to set the __module__ attribute of the function to None.
def test_method():
pass
test_method.__module__ = None

Categories

Resources