Circular references in dask with out-of-core computations

Circular references in dask with out-of-core computations - python

I am writing a library that implements various out of core algorithms and have run into the issue that it is possible to build a circularly dependent computation graph by source and storing to the same memmapped object. For example
import numpy as np
import dask.array as da
array_shape = (8000, 8000)
chunk_shape = (100, 100)
### create the example data
adisk = np.memmap('/tmp/a.npy', dtype=np.float32, shape=array_shape)
bdisk = np.memmap('/tmp/b.npy', dtype=np.float32, shape=array_shape)
a = da.ones(mainshape, chunks=chunkshape, dtype=np.float32)
b = 2*da.ones(mainshape, chunks=chunkshape, dtype=np.float32)
a.store(adisk)
b.store(bdisk)
adisk.flush()
bdisk.flush()
### Begin demonstration of issue
c = da.from_array(adisk, chunks=chunkshape)
d = da.from_array(bdisk, chunks=chunkshape)
e = c#d
e.store(adisk)
# Assert fails because source data is overwritten before re-read
assert np.all(adisk[:] == adisk[0,0])
My testing suggests that if the data is too large to cache in memory the dask will complete the operation but the behaviour is undefined as source data can be overwritten by result data before it is reused for other parts of the computation. Thus the above code ceases to produce correct results above a certain (machine dependent) matrix size.
Does dask provide any methods to help detect and mitigate such circular dependencies?
I have looked into potential solutions and I am currently thinking that a plugin function could be useful here to check that the target of a store operation is not the same as any source further back in the chain.

I believe that dask.array tokenizes memmapped objects by filename and last modification time. Cases like this where it is yet to be written to seem like a fail case.
I would expect this to fail somewhere in the task optimization phase.
Short term I recommend explicitly providing a name= keyword to the da.from_array function to provide your own unique ID.

I created a solution for my purposes that can be used to check that there are no circular dependencies. In my case because of the code structure I can easily guard the potential issues with e.g assert no_circular_dependencies(e, adisk) but in general this would be tricky to achieve without Monkeypatching Dask.
Because each dask array keeps a list of all tasks in the taskgraph it is straightforward to check the graph to find original data sources and ensure that they are not the same as the destination array.
def get_true_base(ndarr):
if not isinstance(ndarr, np.ndarray):
raise TypeError("expected numpy array")
base = ndarr
while True:
try:
base = base.base
except AttributeError:
break
return base
def get_memmap_bases(dsk):
# better to build a set of names and then grab arrays later
nameset = set()
for name, task in dsk.items():
if isinstance(task[0], np.memmap):
nameset.add(name)
bases = [get_true_base(dsk[name][0]) for name in nameset]
return bases
def no_circular_dependency(dask_arr, target):
if get_true_base(target) in get_memmap_bases(dask_arr.dask):
return False
return True

Related

Rare error when using tf.function in an abusive setting

I have programmend a framework that concatenate different ( quite complicated ) linear operators in an abstract manner. It overrides the operators, "+,*,#,-" and chooses a path through the graph of compositions of functions. It isn't easy to debug to say the least, however the control flow isn't depending on the data itself and of course any operation is done with tensorflow. I was hoping to use tf.function to compile it and get an ( hopefully much faster) tf.function by XLA. However I get the following error:
TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
#tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: Reshape_2:0
I do not use the tf.init_scope anywhere and there are 8 (!) google results regarding to this error - while none of them provides me any clue how to debug it.
# initilize linear operators, these are python objects that override __matmul__ etc.
P = ...
A = ...
# initilize vectors, these are compatible python objects to P and A
x = ...
y = ...
# This function recreates the python object from its raw tensorflow data.
# Since it might be dependend on the spaces and
# they also need to be set up for deserializaton the method is returned by a function of x.
# But since many vectors share the same spaces I was hoping to reuse it.
deserialize = x.deserialize()
# We want to compile the action on x to a function now
def meth( data ):
result = P # ( A.T # A # deserialize( data ) )
# we return only the raw data
return result.serialize()
meth = tf.function( meth,
#experimental_compile = True ,
input_signature = (x.serialize_signature,),
).get_concrete_function()
# we want to use meth now for many vectors
# executing this line throws the error
meth(x1)
meth(x2)
meth(x3)
Needless to say that works without the tf.function.
Did anyone stumble across the error and can help me to understand it better ? Or is the hole setup I'm trying not suitable for tensorflow ?
Edit:
The Error was caused by implicitly capturing a constant tensor in the linear operator class by a local lambda. To be honest, the error message suggest something like that, however it was difficult to understand which line in the code caused it and it wasn't easy to find the bug in the end.

Broadcast variables and mapPartitions

Context
In pySpark I broadcast a variable to all nodes with the following code:
sc = spark.sparkContext # Get context
# Extract stopwords from a file in hdfs
# The result looks like stopwords = {"and", "foo", "bar" ... }
stopwords = set([line[0] for line in csv.reader(open(SparkFiles.get("stopwords.txt"), 'r'))])
# The set of stopwords is broadcasted now
stopwords = sc.broadcast(stopwords)
After broadcasting the stopwords I want to make it accessible in mapPartitions:
# Some dummy-dataframe
df = spark.createDataFrame([(["TESTA and TESTB"], ), (["TESTB and TESTA"], )], ["text"])
# The method which will be applied to mapPartitions
def stopwordRemoval(partition, passed_broadcast):
"""
Removes stopwords from "text"-column.
#partition: iterator-object of partition.
#passed_stopwords: Lookup-table for stopwords.
"""
# Now the broadcast is passed
passed_stopwords = passed_broadcast.value
for row in partition:
yield [" ".join((word for word in row["text"].split(" ") if word not in passed_stopwords))]
# re-partitioning in order to get mapPartitions working
df = df.repartition(2)
# Now apply the method
df = df.select("text").rdd \
.mapPartitions(lambda partition: stopwordRemoval(partition, stopwords)) \
.toDF()
# Result
df.show()
#Result:
+------------+
| text |
+------------+
|TESTA TESTB |
|TESTB TESTA |
+------------+
Questions
Even though it works I'm not quite sure if this is the right usage of broadcasting variables. So my questions are:
Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
The second question relates to this question which partly answers my own. Anyhow, within the specifics it differs; that's why I've chosen to also ask this question.

Some time went by and I read some additional information which answered the question for me. Thus, I wanted to share my insights.
Question 1: Is the broadcast correctly executed when I pass it to mapParitions in the demonstrated way?
First it is of note that a SparkContext.broadcast() is a wrapper around the variable to broadcast as can be read in the docs. This wrapper serializes the variable and adds the information to the execution graph to distribute the this serialized form over the nodes. Calling the broadcasts .value-argument is the command to deserialize the variable again when used.
Additionally, the docs state:
After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v [the variable] is not shipped to the nodes more than once.
Secondly, I found several sources stating that this works with UDFs (User Defined Functions), e.g. here. mapPartitions() and udf()s should be considered analogous since they both, in case of pySpark, pass the data to a Python instance on the respective nodes.
Regarding this, here is the important part: Deserialization has to be part of the Python function (udf() or whatever function passed to mapPartitions()) itself, meaning its .value argument must not be passed as function-parameter.
Thus, the broadcast done the right way: The braodcasted wrapper is passed as parameter and the variable is deserialized inside stopwordRemoval().
Question 2: Is using broadcasting within mapParitions useful since stopwords would be distributed with the function to all nodes anyway (stopwords is never reused)?
Its documented that there is only an advantage if serialization yields any value for the task at hand.
The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
This might be the case when you have a large reference to broadcast to your cluster:
[...] to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
If this applies to your broadcast, broadcasting has an advantage.

How to make statsmodels GLM.fit_constrained result picklable/store-and-reloadable

A GLS (or thus also OLS) regression with constraints on parameters can readily be run using statsmodels GLM.fit_constrained() method, as with the code below (or here).
How can I make the GLMresults object resulting from such a statsmodels GLM.fit_constrained() regression picklable, so that the estimation result can be stored for re-use for prediction in a new session anytime later?
The GLMresults object obtained from fit_constrained() and containing the relevant estimation result has its .save() method that would normally readily pickle the object into a file.
This .save() works for the result from a standard (unconstrained) GLM regression, sm.glm.fit(). However, it doesn't work with the result for sm.glm.fit_unconstrained(). Instead, it throws a pickling error, seemingly because patsy DesignMatrixBuilder is not Picklable, so it links to the never resolved issue here. This at least for my Python 3.6.3 (running on Windows).
An example:
import statsmodels
import statsmodels.api as sm
import pandas as pd
# Define exapmle data & Constraints:
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDF'))
y = df['A']
X = df[['B','C','D','F']]
constraints = ['B + C + D', 'C - F'] # Add two linear constraints on parameters: B+C+D = 0 & C-F = 0
statsmodels.genmod.families.links.identity()
OLS_from_GLM = sm.GLM(y, X)
# Unconstrained regression:
result_u = OLS_from_GLM.fit()
result_u.save('myfile_u.pickle') # This works
# Constrained regression - save() fails
result_c = OLS_from_GLM.fit_constrained(constraints)
result_c.save('myfile_c.pickle') # This fails with pickling error (tested in Python 3.6.3 on Windows): "NotImplementedError: Sorry, pickling not yet supported. See https://github.com/pydata/patsy/issues/26 if you want to help."
Is there a way to readily make the result from fit_unconstrained() picklable i.e./or storable?
I below suggest a first workaround answer; it is trivial and works well for me so far. I do not know, however, whether it is truly advisable or whether its risks are large and/or any preferable alternative solution exists.

I got this to work by simply removing (commenting out) the line
res._results.constraints = lc
in the function definition of fit_constrained() within statsmodels' active generalized_linear_model.py script (in my case in the virtualenv folder \env\Lib\site-packages\statsmodels\genmod\generalized_linear_model.py).
Idling this line seems to have created no problem for my work; I can now readily save and reload the pickled file and use it to make correct predictions based on the stored estimation; the imposed parameter constraints remain respected and predictions made using .predict() remain unchanged after reloading.
I wonder though whether there is any major risk attached to this procedure. I am not familiar with the inner workings of the statsmodels library, or with its glm.fit_constrained() method in particular. i reckon it's unadvisable to change anything in a pre-existing module one does not understand. However, it is the only way I am conveniently able to impose various constraints to my GLM parameters and to be able to save the regression results to readily re-use it for prediction in a later session.

Share large data between processes within a class

I'm trying to share a large numpy array between processes using pool.imap_unordered. Should be easy but I'm trying to do it from within a class. Right now I'm just passing the data everytime and everything works well until the data gets sufficiently large and pool just hangs and doesn't launch the parallel processes. Since only a subset of the large data is needed for each parallel process, an alternative is to only pass a subset, but I don't know how to in my current framework.
Since functions used for multiprocessing must be in the global namespace, I'm placing my function outside of the class as follows (toy example of real problem):
import numpy as np
import mutliprocessing.Pool
import itertools
def process(args):
large_data, index = args
return some_costly_operation(large_data[index])
class MyClass:
def __init__(self):
# Let's pretend this is large
self.data = np.zeros(10)
def do(self):
p = Pool()
for result in p.imap_unordered(process,
itertools.izip(itertools.repeat(self.data), xrange(10)))):
print result
I know this is a hack-y way to do multiprocessing and theoretically you shouldn't do it from within a class and should protect by checking if you're in main... Any alternatives or suggestions?

I think you should use binary/compact memory layout and mmap specifically for numpy arrays.
Code left as exercise to the reader, but I might try to hack something up :)

Can you serialize the data to disk from the caller, and just pass the filename to the worker process? If the response can be large the worker could serialize it and return the filename to caller. This is what I have used when I was working with large data sets.

Pickling cv2.KeyPoint causes PicklingError

I want to search surfs in all images in a given directory and save their keypoints and descriptors for future use. I decided to use pickle as shown below:
#!/usr/bin/env python
import os
import pickle
import cv2
class Frame:
def __init__(self, filename):
surf = cv2.SURF(500, 4, 2, True)
self.filename = filename
self.keypoints, self.descriptors = surf.detect(cv2.imread(filename, cv2.CV_LOAD_IMAGE_GRAYSCALE), None, False)
if __name__ == '__main__':
Fdb = open('db.dat', 'wb')
base_path = "img/"
frame_base = []
for filename in os.listdir(base_path):
frame_base.append(Frame(base_path+filename))
print filename
pickle.dump(frame_base,Fdb,-1)
Fdb.close()
When I try to execute, I get a following error:
File "src/pickle_test.py", line 23, in <module>
pickle.dump(frame_base,Fdb,-1)
...
pickle.PicklingError: Can't pickle <type 'cv2.KeyPoint'>: it's not the same object as cv2.KeyPoint
Does anybody know, what does it mean and how to fix it? I am using Python 2.6 and Opencv 2.3.1
Thank you a lot

The problem is that you cannot dump cv2.KeyPoint to a pickle file. I had the same issue, and managed to work around it by essentially serializing and deserializing the keypoints myself before dumping them with Pickle.
So represent every keypoint and its descriptor with a tuple:
temp = (point.pt, point.size, point.angle, point.response, point.octave,
point.class_id, desc)
Append all these points to some list that you then dump with Pickle.
Then when you want to retrieve the data again, load all the data with Pickle:
temp_feature = cv2.KeyPoint(x=point[0][0],y=point[0][1],_size=point[1], _angle=point[2],
_response=point[3], _octave=point[4], _class_id=point[5])
temp_descriptor = point[6]
Create a cv2.KeyPoint from this data using the above code, and you can then use these points to construct a list of features.
I suspect there is a neater way to do this, but the above works fine (and fast) for me. You might have to play around with your data format a bit, as my features are stored in format-specific lists. I tried to present the above using my idea at its generic base. I hope that this may help you.

Part of the issue is cv2.KeyPoint is a function in python that returns a cv2.KeyPoint object. Pickle is getting confused because, literally, "<type 'cv2.KeyPoint'> [is] not the same object as cv2.KeyPoint". That is, cv2.KeyPoint is a function object, while the type was cv2.KeyPoint. Why OpenCV is like that, I can only make guesses at unless I go digging. I have a feeling it has something to do with it being a wrapper around a C/C++ library.
Python does give you the ability to fix this yourself. I found the inspiration on this post about pickling methods of classes.
I actually use this clip of code, highly modified from the original in the post
import copyreg
import cv2
def _pickle_keypoints(point):
return cv2.KeyPoint, (*point.pt, point.size, point.angle,
point.response, point.octave, point.class_id)
copyreg.pickle(cv2.KeyPoint().__class__, _pickle_keypoints)
Key points of note:
In Python 2, you need to use copy_reg instead of copyreg and point.pt[0], point.pt[1] instead of *point.pt.
You can't directly access the cv2.KeyPoint class for some reason, so you make a temporary object and use that.
The copyreg patching will use the otherwise problematic cv2.KeyPoint function as I have specified in the output of _pickle_keypoints when unpickling, so we don't need to implement an unpickling routine.
And to be nauseatingly complete, cv2::KeyPoint::KeyPoint is an overloaded function in C++, but in Python, this isn't exactly a thing. Whereas in the C++, there's a function that takes the point for the first argument, in Python, it would try to interpret that as an int instead. The * unrolls the point into two arguments, x and y to match the only int argument constructor.
I had been using casper's excellent solution until I realized this was possible.

A similar solution to the one provided by Poik. Just call this once before pickling.
def patch_Keypoint_pickiling(self):
# Create the bundling between class and arguments to save for Keypoint class
# See : https://stackoverflow.com/questions/50337569/pickle-exception-for-cv2-boost-when-using-multiprocessing/50394788#50394788
def _pickle_keypoint(keypoint): # : cv2.KeyPoint
return cv2.KeyPoint, (
keypoint.pt[0],
keypoint.pt[1],
keypoint.size,
keypoint.angle,
keypoint.response,
keypoint.octave,
keypoint.class_id,
)
# C++ Constructor, notice order of arguments :
# KeyPoint (float x, float y, float _size, float _angle=-1, float _response=0, int _octave=0, int _class_id=-1)
# Apply the bundling to pickle
copyreg.pickle(cv2.KeyPoint().__class__, _pickle_keypoint)
More than for the code, this is for the incredibly clear explanation available there : https://stackoverflow.com/a/50394788/11094914
Please note that if you want to expand this idea to other "unpickable" class of openCV, you only need to build a similar function to "_pickle_keypoint". Be sure that you store attributes in the same order as the constructor. You can consider copying the C++ constructor, even in Python, as I did. Mostly C++ and Python constructors seems not to differ too much.
I has issue with the "pt" tuple. However, a C++ constructor exists for X and Y separated coordinates, and thus, allow this fix/workaround.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Circular references in dask with out-of-core computations - python

Related

Rare error when using tf.function in an abusive setting

Broadcast variables and mapPartitions

How to make statsmodels GLM.fit_constrained result picklable/store-and-reloadable

Share large data between processes within a class

Pickling cv2.KeyPoint causes PicklingError

Categories

Resources