Is it possible in pyspark to use the parallelize function over python objects? I want to run on parallel on a list of objects, modified them using a function, and then print these objects.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize(object_list)
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(object):
return do-somthing(object)
When I run the above code, I encounter an error of "AttributeError: Can't get attribute 'Object' on <module 'pyspark.daemon' from...> ". How can I solve it?
I did the following workaround. But I don't think it is a good solution in general, and it assumes I can change the constructor of the object.
I have converted the object into a dictionary, and construed the object from the directory.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize([x.__dict__() for x in object_list])
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(dict):
object=CreateObject(create_from_dict=True,dictionary=dict)
return do-something(object)
In the constructor of the Object:
class Object:
def __init__(create_from_dict=False,dictionary=None, other_params...):
if(create_from_dict):
self.__dict__.update(dictionary)
return
Are there any better solutions?
Well for better answer I suggest you post a sample of the object_list and your desired output so we can test with real code.
According to pyspark docs (as above) parallelize function should accept any collection, so I think the problem might be the object_list. I see the workaround can work since the input type is a list of dictionary (or other mapping object)
As for a modular way to run over general created objects, it's depend on how you want the RDD to be, but the general way should be converting the whole object you want into a collection type object. One solution without modifying constructor / structure can be
sc.parallelize([object_list])
The key point is to ensure that the input is in collection type.
Related
I am trying to write a testing program for a python program that takes data, does calculations on it, then puts the output in a class instance object. This object contains several other objects, each with their own attributes. I'm trying to access all the attributes and sub-attributes dynamically with a one size fits all solution, corresponding to elements in a dictionary I wrote to cycle through and get all those attributes for printing onto a test output file.
Edit: this may not be clear from the above but I have a list of the attributes I want, so using something to actually get those attributes is not a problem, although I'm aware python has methods that accomplish this. What I need to do is to be able to get all of those attributes with the same function call, regardless of whether they are top level object attributes or attributes of object attributes.
Python is having some trouble with this - first I tried doing something like this:
for string in attr_dictionary:
...
outputFile.print(outputclass.string)
...
But Python did not like this, and returned an AttributeError
After checking SE, I learned that this is a supposed solution:
for string in attr_dictionary:
...
outputFile.print(getattr(outputclass, string))
...
The only problem is - I want to dynamically access the attributes of objects that are attributes of outputclass. So ideally it would be something like outputclass.objectAttribute.attribute, but this does not work in python. When I use getattr(outputclass, objectAttribute.string), python returns an AttributeError
Any good solution here?
One thing I have thought of trying is creating methods to return those sub-attributes, something like:
class outputObject:
...
def attributeIWant(self,...):
return self.subObject.attributeIWant
...
Even then, it seems like getattr() will return an error because attributeIWant() is supposed to be a function call, it's not actually an attribute. I'm not certain that this is even within the capabilities of Python to make this happen.
Thank you in advance for reading and/or responding, if anyone is familiar with a way to do this it would save me a bunch of refactoring or additional code.
edit: Additional Clarification
The class for example is outputData, and inside that class you could have and instance of the class furtherData, which has the attribute dataIWant:
class outputData:
example: furtherData
example = furtherData()
example.dataIWant = someData
...
with the python getattr I can't access both attributes directly in outputData and attributes of example unless I use separate calls, the attribute of example needs two calls to getattr.
Edit2: I have found a solution I think works for this, see below
I was able to figure this out - I just wrote a quick function that splits the attribute string (for example outputObj.subObj.propertyIWant) then proceeds down the resultant array, calling getattr on each subobject until it reaches the end of the array and returns the actual attribute.
Code:
def obtainAttribute(sample, attributeString: str):
baseObj = sample
attrArray = attributeString.split(".")
for string in attrArray:
if(attrArray.index(string) == (len(attrArray) - 1)):
return getattr(baseObj,string)
else:
baseObj = getattr(baseObj,string)
return "failed"
sample is the object and attributeString is, for example object.subObject.attributeYouWant
When declaring a new instance of an object in python, why would someone use the names of the variables from the parameters at instatntiation time? Say you have the following object:
class Thing:
def __init__(self,var1=None,var2=None):
self.var1=var1
self.var2=var2
The programmer from here decides to create an instance of this object at some point later and enters it in the following way:
NewObj = Thing(var1=newVar,var2=otherVar)
Is there a reason why someone would enter it that way vs. just entering the newVar/otherVar variables into the constructor parameters without using "var1=" and "var2="? Like below:
NewObj = Thing(newVar,otherVar)
I'm fairly novice at using python, and I couldn't find anything about this specific sort of syntax even if it seems like a fairly simple/straightforward question
The reason is clarity, not for the computer, but for yourself and other humans.
class Calculation:
def __init__(self, low=None, high=None, mean=None):
self.low=low
self.high=high
self.mean=mean
...
# with naming (notice how ordering is not important)
calc = Calculation(mean=0.5, low=0, high=1)
# without naming (now order is important and it is less clear what the numbers are used for)
calc = Calculation(0, 1, 0.5)
Note that the same can be done for any function, not only when initializing an object.
I am trying to create a stateful ParDo in apache beam that stores a dict of values and updates that dict with data from subsequent windows.
The equivalent being MapState in java.
I have tried to implement it using a custom CombineFn
class DictCombineFn(beam.CombineFn):
def create_accumulator(self):
return {}
def add_input(self, accumulator, element):
accumulator[element["key"]] = element["value"]
return accumulator
def merge_accumulators(self, accumulators):
return accumulators
def extract_output(self, accumulator):
return accumulator
Which is used in the CombiningValueStateSpec of the following ParDo:
class EnrichDoFn(beam.DoFn):
DICT_STATE = CombiningValueStateSpec(
'dict',
PickleCoder(),
DictCombineFn()
)
def process(
self,
element,
w=beam.DoFn.WindowParam,
dict_state=beam.DoFn.StateParam(DICT_STATE)
):
asks_state.add(element)
However I get the following error during :
TypeError: '_ConcatIterable' object does not support item assignment
I think this might be as a result of using the wrong coder?
What would be the optimal strategy to implement the aforementioned logic?
Thanks
I am not 100% sure about what this error means, but it does feel like dict type is not supported somehow in this particular process. Did you try to get a list of string i.e. "key:value", and then parse and convert it to a dict in one shot?
Merge accumulators should return one element, not iterable like in your case. You would do similar processing like in add element.
I have an object method which changes an attribute of an object. Another method (the one I'm trying to test) calls the first method multiple times and afterward uses the attribute that was modified. How can I test the second method while explicitly saying how the first method changed that attribute?
For example:
def method_to_test(self):
output = []
for _ in range(5):
self.other_method()
output.append(self.attribute_changed_by_other_method)
return output
I want to specify some specific values that attribute_changed_by_other_method will become due to other_method (and the real other_method uses probabilities in deciding on how to change attribute_changed_by_other_method).
I'm guessing the best way to do this would be to "mock" the attribute attribute_changed_by_other_method so that on each time the value is read it gives back a different value of my specification. I can't seem to find how to do this though. The other option I see would be to make sure other_method is mocked to update the attribute in a defined way each time, but I don't know of a particularly clean way of doing this. Can someone suggest a reasonable way of going about this? Thank you much.
What you can actually do is use flexmock for other_method. What you can do with flexmock is set a mock on an instance of your class. Here is an example of how to use it:
class MyTestClass(unittest.TestCase):
def setUp(self):
self.my_obj = MyClass()
self.my_obj_mock = flexmock(self.my_obj)
def my_test_case(self):
self.my_obj_mock.should_receive('other_method').and_return(1).and_return(2).and_return(3)
self.my_obj.method_to_test()
So, what is happening here is that on your instance of MyClass, you are creating a flexmock object out of self.my_obj. Then in your test case, you are stating that when you make your call to method_to_test, you should receive other_method, and each call to it should return 1, 2, 3 respectively.
Furthermore, if you are still interested in knowing how to mock out attribute_changed_by_other_method, you can use Mock's PropertyMock:
Hope this helps. Let me know how it goes!
For anyone still looking for a straightforward answer, this can be done easily with PropertyMock as the accepted answer suggests. Here is one way to do it.
from unittest.mock import patch, PropertyMock
with patch("method_your_class_or_method_calls", new_callable=PropertyMock) as mock_call:
mock_call.side_effect = [111, 222]
class_or_method()
Each subsequent call of that patched method will return that list in sequence.
I want to search surfs in all images in a given directory and save their keypoints and descriptors for future use. I decided to use pickle as shown below:
#!/usr/bin/env python
import os
import pickle
import cv2
class Frame:
def __init__(self, filename):
surf = cv2.SURF(500, 4, 2, True)
self.filename = filename
self.keypoints, self.descriptors = surf.detect(cv2.imread(filename, cv2.CV_LOAD_IMAGE_GRAYSCALE), None, False)
if __name__ == '__main__':
Fdb = open('db.dat', 'wb')
base_path = "img/"
frame_base = []
for filename in os.listdir(base_path):
frame_base.append(Frame(base_path+filename))
print filename
pickle.dump(frame_base,Fdb,-1)
Fdb.close()
When I try to execute, I get a following error:
File "src/pickle_test.py", line 23, in <module>
pickle.dump(frame_base,Fdb,-1)
...
pickle.PicklingError: Can't pickle <type 'cv2.KeyPoint'>: it's not the same object as cv2.KeyPoint
Does anybody know, what does it mean and how to fix it? I am using Python 2.6 and Opencv 2.3.1
Thank you a lot
The problem is that you cannot dump cv2.KeyPoint to a pickle file. I had the same issue, and managed to work around it by essentially serializing and deserializing the keypoints myself before dumping them with Pickle.
So represent every keypoint and its descriptor with a tuple:
temp = (point.pt, point.size, point.angle, point.response, point.octave,
point.class_id, desc)
Append all these points to some list that you then dump with Pickle.
Then when you want to retrieve the data again, load all the data with Pickle:
temp_feature = cv2.KeyPoint(x=point[0][0],y=point[0][1],_size=point[1], _angle=point[2],
_response=point[3], _octave=point[4], _class_id=point[5])
temp_descriptor = point[6]
Create a cv2.KeyPoint from this data using the above code, and you can then use these points to construct a list of features.
I suspect there is a neater way to do this, but the above works fine (and fast) for me. You might have to play around with your data format a bit, as my features are stored in format-specific lists. I tried to present the above using my idea at its generic base. I hope that this may help you.
Part of the issue is cv2.KeyPoint is a function in python that returns a cv2.KeyPoint object. Pickle is getting confused because, literally, "<type 'cv2.KeyPoint'> [is] not the same object as cv2.KeyPoint". That is, cv2.KeyPoint is a function object, while the type was cv2.KeyPoint. Why OpenCV is like that, I can only make guesses at unless I go digging. I have a feeling it has something to do with it being a wrapper around a C/C++ library.
Python does give you the ability to fix this yourself. I found the inspiration on this post about pickling methods of classes.
I actually use this clip of code, highly modified from the original in the post
import copyreg
import cv2
def _pickle_keypoints(point):
return cv2.KeyPoint, (*point.pt, point.size, point.angle,
point.response, point.octave, point.class_id)
copyreg.pickle(cv2.KeyPoint().__class__, _pickle_keypoints)
Key points of note:
In Python 2, you need to use copy_reg instead of copyreg and point.pt[0], point.pt[1] instead of *point.pt.
You can't directly access the cv2.KeyPoint class for some reason, so you make a temporary object and use that.
The copyreg patching will use the otherwise problematic cv2.KeyPoint function as I have specified in the output of _pickle_keypoints when unpickling, so we don't need to implement an unpickling routine.
And to be nauseatingly complete, cv2::KeyPoint::KeyPoint is an overloaded function in C++, but in Python, this isn't exactly a thing. Whereas in the C++, there's a function that takes the point for the first argument, in Python, it would try to interpret that as an int instead. The * unrolls the point into two arguments, x and y to match the only int argument constructor.
I had been using casper's excellent solution until I realized this was possible.
A similar solution to the one provided by Poik. Just call this once before pickling.
def patch_Keypoint_pickiling(self):
# Create the bundling between class and arguments to save for Keypoint class
# See : https://stackoverflow.com/questions/50337569/pickle-exception-for-cv2-boost-when-using-multiprocessing/50394788#50394788
def _pickle_keypoint(keypoint): # : cv2.KeyPoint
return cv2.KeyPoint, (
keypoint.pt[0],
keypoint.pt[1],
keypoint.size,
keypoint.angle,
keypoint.response,
keypoint.octave,
keypoint.class_id,
)
# C++ Constructor, notice order of arguments :
# KeyPoint (float x, float y, float _size, float _angle=-1, float _response=0, int _octave=0, int _class_id=-1)
# Apply the bundling to pickle
copyreg.pickle(cv2.KeyPoint().__class__, _pickle_keypoint)
More than for the code, this is for the incredibly clear explanation available there : https://stackoverflow.com/a/50394788/11094914
Please note that if you want to expand this idea to other "unpickable" class of openCV, you only need to build a similar function to "_pickle_keypoint". Be sure that you store attributes in the same order as the constructor. You can consider copying the C++ constructor, even in Python, as I did. Mostly C++ and Python constructors seems not to differ too much.
I has issue with the "pt" tuple. However, a C++ constructor exists for X and Y separated coordinates, and thus, allow this fix/workaround.