Understanding Dask Distributed behavior with regards to DataFrame operations - python

I would like to better understand how dask.distributed works. I have a simple csv I read into a Dask dataframe as in the below. This operation executes fine and returns some integer value representing the length of the dataframe, which is the behavior I would expect.
import dask.dataframe as dd
gdf = dd.read_csv(filepath)
len(gdf)
# returns some int value
But once I introduce an instance of Client from dask.distributed, I recieve the following error:
distributed.utils - ERROR - 'LocalFileSystem' object has no attribute 'cwd'
Here is an example code block:
from dask.distributed import Client
import dask.dataframe as dd
client_db = Client(remote_addr)
gdf = dd.read_csv(filepath)
len(gdf)
# throws the above error
I'm confused - does Client "inject itself" into all Dask operations once it has been instantiated. I thought I would need to do something like gdf = client_db.persist(gdf) to ask that Client connection to manage operations on that dataframe.
Some context on what's happening here would be much appreciated! I can see from the traceback it has to do with Tornado, which is a web framework in Py that allows for web sockets, long polling, etc. I assume that it is attempting to store something... somewhere... But my familiarity drops off here.
If needed, the traceback:
Traceback (most recent call last):
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/distributed/utils.py", line 223, in f
result[0] = yield make_coro()
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/distributed/client.py", line 1156, in _gather
traceback)
File "/.../geopandas_opt/venv/lib/python3.6/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.5/site-packages/dask/bytes/core.py", line 212, in read_block_from_file
File "/usr/local/lib/python3.5/site-packages/dask/bytes/core.py", line 314, in __enter__
File "/usr/local/lib/python3.5/site-packages/dask/bytes/local.py", line 64, in open
File "/usr/local/lib/python3.5/site-packages/dask/bytes/local.py", line 36, in _trim_filename
AttributeError: 'LocalFileSystem' object has no attribute 'cwd'

Yes, when you create a Client it registers itself as the default global scheduler. You can avoid this behavior with the set_as_default= keyword
client = Client(..., set_as_default=False)
Regarding the exception you've run into I suspect that it is a version mismatch. You might want to upgrade with either conda or pip.
conda install dask distributed
or
pip install dask distributed

Related

Multiprocessing using Pool class in Python giving Pickling error

I am trying to run a simple multiprocessing example in python3.6 in a zeppelin notebook(in windows) but I am not able to execute it. Below is the code that i used:
def sqrt(x):
return x**0.5
numbers = [i for i in range(1000000)]
with Pool() as pool:
sqrt_ls = pool.map(sqrt, numbers)
After running this code I am getting the following error:
Traceback (most recent call last):
File "/tmp/zeppelin_python-3196160128578820301.py", line 315, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 6, in <module>
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function sqrt at 0x7f6f84f1a620>: attribute lookup sqrt on __main__ failed
I am not sure if its just me who is facing the issue. As i have seen so many articles where people can run the code easily. If you know the solution please help
Thanks
From the multiprocessing documentation:
Note: Functionality within this package requires that the main module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.
Notebooks are running Python interactive interpreters behind the scene, so it's probably why you get this error. You can try to run your code from within a if __name__ == '__main__': statement.
A Zeppelin notebook does not emulate a normal module well enough to support the pickling that is used to identify the correct operation to another process. You can put all the functions you want to call into a proper module that you import in the usual fashion.

assert group is none

I am trying to run a preprocessing pipeline using nipype and I get the following error message:
Traceback (most recent call last):
File "preprocscript.py", line 211, in <module>
preproc.run('MultiProc', plugin_args={'n_procs': 8})
File "/sw/anaconda/3/lib/python3.6/site-packages/nipype/pipeline/engine/workflows.py", line 579, in run
runner = plugin_mod(plugin_args=plugin_args)
File "/sw/anaconda/3/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 162, in __init__
initargs=(self._cwd,)
File "/sw/anaconda/3/lib/python3.6/multiprocessing/pool.py", line 175, in __init__
self._repopulate_pool()
File "/sw/anaconda/3/lib/python3.6/multiprocessing/pool.py", line 236, in _repopulate_pool
self._wrap_exception)
File "/sw/anaconda/3/lib/python3.6/multiprocessing/pool.py", line 250, in _repopulate_pool_static
wrap_exception)
File "/sw/anaconda/3/lib/python3.6/multiprocessing/process.py", line 73, in __init__
assert group is None, 'group argument must be None for now'
AssertionError: group argument must be None for now
and I am not sure what exactly might be wrong in my code that leads to this or if this is an issue with my software.
I am on a linux system and use python 3.6.
The module you are using has a ProcessPoolExecuter being used in it. In Python 3.7 they added some additional arguments to that class, namely initargs which is what is being called in nipype multiprocess module you are using. Unfortunately it is not backwards compatible to 3.6 and they did not write in another way to use that module.
Your options are to upgrade or not use the multiprocessing portion of nipype.

Error pickling a `matlab` object in joblib `Parallel` context

I'm running some Matlab code in parallel from inside a Python context (I know, but that's what's going on), and I'm hitting an import error involving matlab.double. The same code works fine in a multiprocessing.Pool, so I am having trouble figuring out what the problem is. Here's a minimal reproducing test case.
import matlab
from multiprocessing import Pool
from joblib import Parallel, delayed
# A global object that I would like to be available in the parallel subroutine
x = matlab.double([[0.0]])
def f(i):
print(i, x)
with Pool(4) as p:
p.map(f, range(10))
# This prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
for _ in Parallel(4, backend='multiprocessing')(delayed(f)(i) for i in range(10)):
pass
# This also prints 1, [[0.0]]\n2, [[0.0]]\n... as expected
# Now run with default `backend='loky'`
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
pass
# ^ this crashes.
So, the only problematic one is the one using the 'loky' backend.
The full traceback is:
exception calling callback for <Future at 0x7f63b5a57358 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
joblib.externals.loky.process_executor._RemoteTraceback:
'''
Traceback (most recent call last):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/mlarray.py", line 31, in <module>
from _internal.mlarray_sequence import _MLArrayMetaClass
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_sequence.py", line 3, in <module>
from _internal.mlarray_utils import _get_strides, _get_size, \
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/_internal/mlarray_utils.py", line 4, in <module>
import matlab
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/matlab/__init__.py", line 24, in <module>
from mlarray import double, single, uint8, int8, uint16, \
ImportError: cannot import name 'double'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 20, in <module>
for _ in Parallel(4)(delayed(f)(i) for i in range(10)):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 934, in __call__
self.retrieve()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "~/miniconda3/envs/myenv/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 309, in __call__
self.parallel.dispatch_next()
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 731, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 510, in apply_async
future = self._workers.submit(SafeFunction(func))
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
fn, *args, **kwargs)
File "~/miniconda3/envs/myenv/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1022, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Looking at the traceback, it seems like the root cause is an issue importing the matlab package in the child process.
It's probably worth noting that this all runs just fine if instead I had defined x = np.array([[0.0]]) (after importing numpy as np). And of course the main process has no problem with any matlab imports, so I am not sure why the child process would.
I'm not sure if this error has anything in particular to do with the matlab package, or if it's something to do with global variables and cloudpickle or loky. In my application it would help to stick with loky, so I'd appreciate any insight!
I should also note that I'm using the official Matlab engine for Python: https://www.mathworks.com/help/matlab/matlab-engine-for-python.html. I suppose that might make it hard for others to try out the test cases, so I wish I could reproduce this error with a type other than matlab.double, but I haven't found another yet.
Digging around more, I've noticed that the process of importing the matlab package is more circular than I would expect, and I'm speculating that this could be part of the problem? The issue is that when import matlab is run by loky's _ForkingPickler, first some file matlab/mlarray.py is imported, which imports some other files, one of which contains import matlab, and this causes matlab/__init__.py to be run, which internally has from mlarray import double, single, uint8, ... which is the line that causes the crash.
Could this circularity be the issue? If so, why can I import this module in the main process but not in the loky backend?
The error is caused by incorrect loading order of global objects in the child processes. It can be seen clearly in the traceback
_ForkingPickler.loads(res) -> ... -> import matlab -> from mlarray import ...
that matlab is not yet imported when the global variable x is loaded by cloudpickle.
joblib with loky seems to treat modules as normal global objects and send them dynamically to the child processes. joblib doesn't record the order in which those objects/modules were defined. Therefore they are loaded (initialized) in a random order in the child processes.
A simple workaround is to manually pickle the matlab object and load it after importing matlab inside your function.
import matlab
import pickle
px = pickle.dumps(matlab.double([[0.0]]))
def f(i):
import matlab
x=pickle.loads(px)
print(i, x)
Of course you can also use the joblib.dumps and loads to serialize the objects.
Use initializer
Thanks to the suggestion of #Aaron, you can also use an initializer (for loky) to import Matlab before loading x.
Currently there's no simple API to specify initializer. So I wrote a simple function:
def with_initializer(self, f_init):
# Overwrite initializer hook in the Loky ProcessPoolExecutor
# https://github.com/tomMoral/loky/blob/f4739e123acb711781e46581d5ed31ed8201c7a9/loky/process_executor.py#L850
hasattr(self._backend, '_workers') or self.__enter__()
origin_init = self._backend._workers._initializer
def new_init():
origin_init()
f_init()
self._backend._workers._initializer = new_init if callable(origin_init) else f_init
return self
It is a little bit hacky but works well with the current version of joblib and loky.
Then you can use it like:
import matlab
from joblib import Parallel, delayed
x = matlab.double([[0.0]])
def f(i):
print(i, x)
def _init_matlab():
import matlab
with Parallel(4) as p:
for _ in with_initializer(p, _init_matlab)(delayed(f)(i) for i in range(10)):
pass
I hope the developers of joblib will add initializer argument to the constructor of Parallel in the future.

OrientDB Gremlin server not working in python

I am using the orientdb and gremlin server in python, Gremlin server is started successfully, but when I am trying to add one vertex to the orientdb through gremlin code it's giving me an error.
query = """graph.addVertex(label, "Test", "title", "abc", "title", "abc")"""
following is the Traceback
/usr/bin/python3.6 /home/admin-12/Documents/bitbucket/ecodrone/ecodrone/test/test1.py
Traceback (most recent call last):
File "/home/admin-12/Documents/bitbucket/ecodrone/ecodrone/test/test1.py", line 27, in <module>
result = execute_query("""graph.addVertex(label, "Test", "title", "abc", "title", "abc")""")
File "/home/admin-12/Documents/bitbucket/ecodrone/ecodrone/GremlinConnector.py", line 21, in execute_query
results = future_results.result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/admin-12/.local/lib/python3.6/site-packages/gremlin_python/driver/resultset.py", line 81, in cb
f.result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/admin-12/.local/lib/python3.6/site-packages/gremlin_python/driver/connection.py", line 77, in _receive
self._protocol.data_received(data, self._results)
File "/home/admin-12/.local/lib/python3.6/site-packages/gremlin_python/driver/protocol.py", line 106, in data_received
"{0}: {1}".format(status_code, data["status"]["message"]))
gremlin_python.driver.protocol.GremlinServerError: 599: Error during serialization: Infinite recursion (StackOverflowError) (through reference chain: com.orientechnologies.orient.core.id.ORecordId["record"]->com.orientechnologies.orient.core.record.impl.ODocument["schemaClass"]->com.orientechnologies.orient.core.metadata.schema.OClassImpl["document"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"]->com.orientechnologies.orient.core.record.impl.ODocument["owners"])
Process finished with exit code 1
First of all, I very much recommend that you do not use the Graph API to make mutation. Prefer the Traversal API for that and do:
g.addV('Test').
property('title1', 'abc').
property('title2', 'abc')
Second, I think that your error is occurring because you are returning a Vertex which contains an ORecordId which is the vertex identifier and Gremlin Server doesn't know how to handle that. I don't know if OrientDB has serializers built to handle that, but if they do then you would want to add them to Gremlin Server configurations which is described in a bit more detail here - basically, you would want to know if OrientDB exposes an TinkerPop IORegistry for all their custom classes that might be sent back over the wire.
If they do not, then you would want to avoid returning those or convert them yourself. TinkerPop already recommends that you not return full Vertex objects and only return data that you need. So, rather than g.V() you would want to convert that Vertex into a Map with g.V().valueMap('title') or something similar (perhaps use project() step). If you definitely need the vertex identifier then you would need to convert that to something TinkerPop serializers understand. That might mean something as simple as:
g.V().has("title1","abc").id().next().toString()

CloudDataflow can not use "google.cloud.datastore" package?

I want to put datastore with transaction on CloudDataflow.
So, I wrote below.
def exe_dataflow():
....
from google.cloud import datastore
# call from pipeline
def ds_test(content):
datastore_client = datastore.Client()
kind = 'test_out'
name = 'change'
task_key = datastore_client.key(kind, name)
for _ in range(3):
with datastore_client.transaction():
current_value = client.get(task_key)
current_value['v'] += content['v']
datastore_client.put(task)
# pipeline
....
| 'datastore test' >> beam.Map(ds_test)
But, Error occured and log message was displayed as below.
(7b75e0ef2db229da): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
op.start()
...(SNIP)...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
return getattr(__import__(module, None, None, [obj]), obj)
AttributeError: 'module' object has no attribute 'datastore'
CloudDataflow can not use "google.cloud.datastore" package?
add 2018/2/28.
I add --requirements_file to MyOption
options = MyOptions(flags = ["--requirements_file", "./requirements.txt"])
and I make requirements.txt
google-cloud-datastore==1.5.0
But, Another error occurred.
(366397598dcf7f02): Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
op.start()
...(SNIP)...
File "my_dataflow.py", line 66, in to_entity
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/__init__.py", line 60, in <module>
from google.cloud.datastore.batch import Batch
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/batch.py", line 24, in <module>
from google.cloud.datastore import helpers
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/helpers.py", line 29, in <module>
from google.cloud.datastore_v1.proto import datastore_pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/__init__.py", line 17, in <module>
from google.cloud.datastore_v1 import types
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/types.py", line 21, in <module>
from google.cloud.datastore_v1.proto import datastore_pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/proto/datastore_pb2.py", line 17, in <module>
from google.cloud.datastore_v1.proto import entity_pb2 as google_dot_cloud_dot_datastore__v1_dot_proto_dot_entity__pb2
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore_v1/proto/entity_pb2.py", line 28, in <module>
dependencies=[google_dot_api_dot_annotations__pb2.DESCRIPTOR,google_dot_protobuf_dot_struct__pb2.DESCRIPTOR,google_dot_protobuf_dot_timestamp__pb2.DESCRIPTOR,google_dot_type_dot_latlng__pb2.DESCRIPTOR,])
File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 824, in __new__
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "google/cloud/datastore_v1/proto/entity.proto":
google.datastore.v1.PartitionId.project_id: "google.datastore.v1.PartitionId.project_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
...(SNIP)...
google.datastore.v1.Entity.properties: "google.datastore.v1.Entity.PropertiesEntry" seems to be defined in "google/cloud/proto/datastore/v1/entity.proto", which is not imported by "google/cloud/datastore_v1/proto/entity.proto". To use it here, please add the necessary import.
The recommended way to interact with Cloud Datastore from a Cloud Dataflow Pipeline is to use the Datastore I/O API, which is available through the Dataflow SDK and provides some methods to read and write data to a Cloud Datastore database.
You can find detailed documentation for the Datastore I/O package for Dataflow SDK 2.x for Python in this other link. The datastore.v1.datastoreio module is the specific module that you want to use. There is plenty of information in the links I am sharing, but in short, it is a connector to Datastore that uses PTransform to read / write / delete a PCollection from Datastore using the classes ReadFromDatastore() / WriteToDatastore() / DeleteFromDatastore() respectively.
You should try using it instead of implementing the calls yourself. I suspect this may be the reason for the error you are seeing, as a Datastore implementation already exists in the Dataflow SDK:
"google.datastore.v1.PartitionId.project_id" is already defined in file "google/cloud/proto/datastore/v1/entity.proto".
UPDATE:
It looks like those three classes collect several mutations and executes them in a single transaction. You can check that in the code describing the classes.
If the aim is to retrieve (get()) and then update (put()) a Datastore entity, you can probably work with the write_mutations() function, which is described in the documentation, and you can work with a full batch of mutations performing the operations you are interested in.

Categories

Resources