Derived AccumulatorParam in py-spark returns EOFException on addInPace function - python

I have a set of json objects of which one key is a nested object of list of keys, which are not common across all the json object. Example object is:
{
id: 'AJDFAKDF',
class: 'class1',
nested_object: {
'key1': 'value1',
'key2': 'value2',
'key3': 'value3'
}
}
The id is unique identifier of the object and key1, key2... will not be same across all objects i.e. 1 object may have key1 in nested_object and the other may not. Also, class will vary for each record.
I have 100s of millions of such records and I am interested in finding out the distribution of records by the keys of nested_object per class. Sample output would be.
{
class1: {
key1: 32,
key45: 45,
},
class2: {
key2: 39,
key45: 100
}
}
I am using pyspark accumulators to generate this distribution. I have derived from the dict class to create a CounterOfCounters class. This is on lines of the Counter class in collections library of python. Much code is copied from the the source code.
class CounterOfCounters(dict):
def __init__(*args, **kwds):
'''Create a new, empty CounterOfCounters object. Or, initialize
the count from another mapping of elements to their Counter.
>>> c = CounterOfCounters() # a new, empty CoC
>>> c = CounterOfCounters({'a': 4, 'b': 2}) # a new counter from a mapping
'''
if not args:
raise TypeError("descriptor '__init__' of 'Counter' object "
"needs an argument")
self, *args = args
if len(args) > 1:
raise TypeError('expected at most 1 arguments, got %d' % len(args))
super(CounterOfCounters, self).__init__()
self.update(*args, **kwds)
def __missing__(self, key):
# Needed so that self[missing_item] does not raise KeyError
return Counter()
def update(*args, **kwds):
'''Like dict.update() but add counts instead of replacing them.
Source can be a dictionary
'''
# The regular dict.update() operation makes no sense here because the
# replace behavior results in the some of original untouched counts
# being mixed-in with all of the other counts for a mismash that
# doesn't have a straight-forward interpretation in most counting
# contexts. Instead, we implement straight-addition. Both the inputs
# and outputs are allowed to contain zero and negative counts.
# print("In update. [0] ", args)
if not args:
raise TypeError("descriptor 'update' of 'CounterOfCounters' object "
"needs an argument")
self, *args = args
# print("In update. [1] ", self, args)
if len(args) > 1:
raise TypeError('expected at most 1 arguments, got %d' % len(args))
iterable = args[0] if args else None
if iterable is not None:
if isinstance(iterable, dict):
if self:
self_get = self.get
for elem, count in iterable.items():
if isinstance(count, Counter):
self[elem] = count + self_get(elem, Counter())
else:
raise TypeError("values can only be of type Counter")
else:
for elem, count in iterable.items():
if isinstance(count, Counter):
self[elem] = count
else:
raise TypeError("values can only be of type Counter")
else:
raise TypeError("values can only be of type Counter")
if kwds:
raise TypeError("Can not process **kwds as of now")
I then derive from AccumulatorParam my custom AccumulatorParam class.
class KeyAccumulatorParam(AccumulatorParam):
def zero(self, value):
c = CounterOfCounters(
dict.fromkeys(
value.keys(),
Counter()
)
)
return c
def addInPlace(self, ac1, ac2):
ac1.update(ac2)
return ac1
Then I define the accumulator variable and the function to add it.
keyAccum = sc.accumulator(
CounterOfCounters(
dict.fromkeys(
['class1', 'class2', 'class3', 'class4'],
Counter()
)
),
KeyAccumulatorParam()
)
def accumulate_keycounts(rec):
global keyAccum
c = Counter(list(rec['nested_object'].keys()))
keyAccum.add(CounterOfCounters({
rec['class']: c
}))
After which I call accumulate_keycounts foreach record in rdd.
test_rdd.foreach(accumulate_keycounts)
On doing this, I get the end of file exception:
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$1$$anonfun$read$1.apply$mcVI$sp(PythonRDD.scala:200)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:199)
... 25 more
Some more additional info. I am using spark 2.2.1 on my MacBook Pro Retina (2013 8GB) for testing my function. The test_rdd has only ~20 records. It is a sample from larger rdd of ~200000 records.
I am not sure why this error is happening. I have enough memory to finish the task. The accumulator is commutative and associative as per the requirements of accumulator.
What I have debugged till now and figured out is that the error happens after first record is processed i.e. first time addInPlace is called. The CounterOfCounters.update function returns normally. But after that the EOFException occurs.
Any tips to debug or pointers to the problem will be much appreciated.

Related

Instantiate an object with a property of custom type using dictionary

I have a few other nested classes in my design and I am instantiating an object using a dictionary which later itself is converted from a JSON file. In the JSON file I have nested relationship, e.g., Employee and list of Education (1 to many). For simplicity, I bring the following example to present my question:
I have the following classes defined:
def ensure_type(value, types):
if (isinstance(value, list)): # when value is a list
for element in value:
ensure_type(element, types)
return value
elif isinstance(value, dict):
for k,v in value.items(): # when value is a dict
ensure_type(v, types)
return value
elif isinstance(value, types):
return value
else:
raise TypeError('Value {value} is {value_type}, but should be {types}!'.format(value=value, value_type=type(value), types=types))
class Education:
def __init__(self, **kwargs):
print('im here')
self.school_name = ensure_type(kwargs.get('school_name'), str)
class Employee:
def __init__(self, **kwargs):
self.fname = ensure_type(kwargs.get('fname'), str)
self.education = ensure_type(kwargs.get('education'), Education)
where I validate the type using my custom function ensure_type.
I would like to instantiate an employee using the following:
if __name__ == "__main__":
emp_dict = {'fname': 'Bob', 'education': [{'school_name':'foo'}, {'school_name':'bar'}]}
employee1 = Employee(**emp_dict)
when I try the approach above, I get the following error:
File "test.py", line 32, in <module>
employee1 = Employee(**emp_dict)
File "test.py", line 26, in __init__
self.education = ensure_type(kwargs.get('education'), Education)
File "test.py", line 5, in ensure_type
ensure_type(element, types)
File "test.py", line 9, in ensure_type
ensure_type(v, types)
File "test.py", line 14, in ensure_type
raise TypeError('Value {value} is {value_type}, but should be {types}!'.format(value=value, value_type=type(value), types=types))
TypeError: Value foo is <class 'str'>, but should be <class '__main__.Education'>!
When I update the Employee class with following line:
self.education = Education(kwargs.get('education'))
I get the following error:
Traceback (most recent call last):
File "testt.py", line 32, in <module>
employee1 = Employee(**emp_dict)
File "testt.py", line 26, in __init__
self.education = Education(kwargs.get('education'))
TypeError: __init__() takes 1 positional argument but 2 were given
I would appreciate if you could kindly guide me on how to resolve this issue.
NOTE:
Initially I defined my constructor similar to the following:
def __init__(self, iterable=(), **kwargs):
self.__dict__.update(iterable, **kwargs)
it's a very powerful and flexible approach, but I need to have some (or all) properties to be required i.e., user must provide their values when instantiating an Employee object. That's why I chose not to pursue the approach above.
I think the following code illustrates how to avoid the error. It modifies the ensure_type() function so if an element of a list isn't the proper type (kind), it performs a further check to see if the element is a dict that can be used to construct and instance of the type. It also replaces the dictionary with the instance created, but whether you want this to happen is unclear.
If you want something similar to happen for dictionaries, you'll need to do something very similar to each of the values in one.
Note: As written, the code requires at least Python 3.8 due to its use of the := assignment expression (aka “the walrus operator”).
def ensure_type(value, kind):
if (isinstance(value, list)):
# Make sure elements of list are instances of kind or can be
# used to create an instance of one.
for i, element in enumerate(value):
try:
ensure_type(element, kind)
except TypeError:
# Unless element is dict that can be used to create an
# instance of kind.
if(not isinstance(element, dict) or
not isinstance(inst := kind(**element), kind)):
raise
else:
value[i] = inst # Replace element with instance (OPTIONAL)
return value
elif isinstance(value, dict):
for k,v in value.items():
# Make sure the value of each item in dict is an instance of kind.
ensure_type(v, kind)
return value
elif isinstance(value, kind):
return value
else:
raise TypeError(
'Value {value} is {value_type}, but should be {kind}!'.format(
value=value, value_type=type(value), kind=kind))
class Printable: # Added to print test results.
""" Class which can print a represenation of itself. """
def __repr__(self):
typename = type(self).__name__
args = ', '.join("%s=%r" % item for item in vars(self).items())
return '{typename}({args})'.format(typename=typename, args=args)
class Education(Printable):
def __init__(self, **kwargs):
# print("I'm here")
self.school_name = ensure_type(kwargs.get('school_name'), str)
class Employee(Printable):
def __init__(self, **kwargs):
self.fname = ensure_type(kwargs.get('fname'), str)
self.education = ensure_type(kwargs.get('education'), Education)
if __name__ == "__main__":
emp_dict = {'fname': 'Bob',
'education': [{'school_name':'foo'}, {'school_name':'bar'}]}
employee1 = Employee(**emp_dict)
print(employee1)
Output:
Employee(fname='Bob', education=[Education(school_name='foo'), Education(school_name='bar')])

enforce arguments to a specific list of values

What is the pythonic way to enforce a function to take a specific set of values for a given parameter? For instance there is a function like:
def results(status, data):
I want to restrict the parameter status to a set of values like 0, 1 or 99.
You need to check the value inside the function:
def results(status, data):
valid = {0, 1, 99}
if status not in valid:
raise ValueError("results: status must be one of %r." % valid)
Here, valid is a set, because the only thing we care about is whether status is a member of the collection (we aren't interested in order, for example). To avoid recreating the set each time you use the function, you'd probably define it as a "constant"1 global:
VALID_STATUS = {0, 1, 99}
def results(status, data):
if status not in VALID_STATUS:
raise ValueError("results: status must be one of %r." % VALID_STATUS)
Example usage:
>>> results(7, [...])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in results
ValueError: results: status must be one of {0, 1, 99}.
Always try to raise the most appropriate exception you can - ValueError tells the caller of the function what's going on better than Exception does, for example.
1 It's not really constant, but by convention, ALL_UPPERCASE variable names in Python are considered to be intended as constants.
You can check within the function itself if status is a valid value and if it is not then raise an exception.
def results(status,data):
list_valid_status = [0, 1, 99]
# list_valid_status = (0, 1, 99) # could be a tuple so it doesn't get modified by accident
if status not in list_valid_status:
raise ValueError("Wrong status")
An alternative option would be to go with Enums for instance
from enum import IntEnum
class STATUSES(IntEnum):
ZERO = 0
ONE = 1
OTHER = 99
def results(status, data):
status = STATUS(status)
calling STATUS(status) would raise ValueError similar to other answers, but with this approach you have some benefits:
Putting type annotation status:STATUSES would make it easier for others to get set of valid fields (especially when working with IDE)
You can name STATUSES in human-readable format (don't have a broader context of this particular use case)
I put together a decorator validateParameterList that does this check for keyword arguments specifically. It checks for valid parameter options within a specified key-value mapping (combinations) and/or names-options pairs. The required field was added to make the optional parameters not-so-optional (maybe some won't like this so much...).
This particular decorator works with async functions. If async is not needed, remove async and await from the decorator.
Decorator:
def validateParamList(
combinations: Dict = {},
names: List[str] = [],
options: List[List[str]] = [[]],
required: set = set(),
):
def innerFunction(func):
#wraps(func)
async def validateParamsWrapper(*args, **kwargs):
allowables = {
**combinations,
**{key: val for (key, val) in zip(names, options)},
}
missing_required = required.difference(kwargs.keys())
if len(missing_required) != 0:
raise Exception(
f"Required keyword values: Keyword argument{'s' if len(missing_required) > 1 else ''} {missing_required} was not provided."
)
for key, vals in allowables.items():
if key in kwargs.keys() and kwargs[key] not in vals:
arg_type = "required" if key in required else "optional"
raise ValueError(
f"Invalid {arg_type} keyword value: Valid values for'{key}' are {vals}. '{kwargs[key]}' was provided."
)
return await func(*args, **kwargs)
return validateParamsWrapper
return innerFunction
Application:
#validateParamList(
combinations={
"var2": range(10),
"var3": [True, False],
"var4": ["ok_val_1", "ok_val_2", "ok_val_3", "ok_val_4"],
},
required={"var3", "var4"},
)
async def myFunc(
self,
var1: str = "unchecked_default",
var2: int = -1,
var3: bool = None,
var4: str = "",
):
pass
For this example:
If var2 is not provided on the myFunc call, there will be no ValueError and var2 will be -1
If var3 or var4 are not provded on the myFunc call, there will be an Exception
If va2, var3, and var4 are provided on the myFunc call and any is invalid, there will be a ValueError

Optimizing modifiable named list based on namedtuple

My goal is to optimize a framework based on a stack of modifiers for CSV-sourced lists. Each modifier uses a header list to work on a named basis.
CSV example (including header):
date;place
13/02/2013;New York
15/04/2012;Buenos Aires
29/10/2010;Singapour
I have written some code based on namedtuple in order to be able to use lists generated by csv module without reorganizing data every time. Generated code below :
class MyNamedList(object):
__slots__ = ("__values")
_fields = ['date', 'ignore', 'place']
def __init__(self, values):
self.__values = values
if len(self.__values) <= 151:
for i in range(len(self.__values), 151):
self.__values += [None,]
#property
def date(self):
return self.__values[0]
#date.setter
def date(self, val):
self.__values[0] = val
#property
def ignore(self):
return self.__values[150]
#ignore.setter
def ignore(self, val):
self.__values[150] = val
#property
def place(self):
return self.__values[1]
#b.setter
def place(self, val):
self.__values[1] = val
I must say i am very disappointed with performance using this class. Calling a simple modifier function (which changes "ignore" to True 100 times. Yes i know it is useless) for each line of a 70000-line csv file takes 9 seconds (with pypy. 5.5 using original python) whereas equivalent code using a list named foo takes 1.1 second (same with pypy and original python).
Is there anything i could do to get comparable performance between both approaches ? To me, record.ignore = True could be directly inlined (or so) and therefore translated into record[150] = True. Is there any blocking point i don't see to get this to happen ?
Note that the record i am modifying is actually (for now) not created for each line in the CSV file, meaning adding more items into the list happens only once, before the iteration.
Update : sample codes
--> Using namedlist
import namedlist
MyNamedList=namedlist.namedlist("MyNamedList", {"a":1, "b":2, "ignore":150})
test = MyNamedList([0,1])
def foo(a):
test.ignore = True # x100 times
import csv
stream = csv.reader(open("66666.csv", "rb"))
for i in stream:
foo(i)
--> Not using namedlist
import namedlist
import csv
MyNamedList=namedlist.namedlist("MyNamedList", {"a":1, "b":2, "ignore":150})
test = MyNamedList([0,1])
sample_data = []
for i in range(len(sample_data), 151):
sample_data += [None,]
def foo(a):
sample_data[150] = True # x100 times
stream = csv.reader(open("66666.csv", "rb"))
for i in stream:
foo(i)
Update #2 : code for namedlist.py (heavily based on namedtuple.py
# Retrieved from http://code.activestate.com/recipes/500261/
# Licensed under the PSF license
from keyword import iskeyword as _iskeyword
import sys as _sys
def namedlist(typename, field_indices, verbose=False, rename=False):
# Parse and validate the field names. Validation serves two purposes,
# generating informative error messages and preventing template injection attacks.
field_names = field_indices.keys()
for name in [typename,] + field_names:
if not min(c.isalnum() or c=='_' for c in name):
raise ValueError('Type names and field names can only contain alphanumeric characters and underscores: %r' % name)
if _iskeyword(name):
raise ValueError('Type names and field names cannot be a keyword: %r' % name)
if name[0].isdigit():
raise ValueError('Type names and field names cannot start with a number: %r' % name)
seen_names = set()
for name in field_names:
if name.startswith('_') and not rename:
raise ValueError('Field names cannot start with an underscore: %r' % name)
if name in seen_names:
raise ValueError('Encountered duplicate field name: %r' % name)
seen_names.add(name)
# Create and fill-in the class template
numfields = len(field_names)
argtxt = repr(field_names).replace("'", "")[1:-1] # tuple repr without parens or quotes
reprtxt = ', '.join('%s=%%r' % name for name in field_names)
max_index=-1
for name in field_names:
index = field_indices[name]
if max_index < index:
max_index = index
max_index += 1
template = '''class %(typename)s(object):
__slots__ = ("__values") \n
_fields = %(field_names)r \n
def __init__(self, values):
self.__values = values
if len(self.__values) <= %(max_index)s:
for i in range(len(self.__values), %(max_index)s):
self.__values += [None,]'''% locals()
for name in field_names:
index = field_indices[name]
template += ''' \n
#property
def %s(self):
return self.__values[%d]
#%s.setter
def %s(self, val):
self.__values[%d] = val''' % (name, index, name, name, index)
if verbose:
print template
# Execute the template string in a temporary namespace
namespace = {'__name__':'namedtuple_%s' % typename,
'_property':property, '_tuple':tuple}
try:
exec template in namespace
except SyntaxError, e:
raise SyntaxError(e.message + ':\n' + template)
result = namespace[typename]
# For pickling to work, the __module__ variable needs to be set to the frame
# where the named tuple is created. Bypass this step in enviroments where
# sys._getframe is not defined (Jython for example) or sys._getframe is not
# defined for arguments greater than 0 (IronPython).
try:
result.__module__ = _sys._getframe(1).f_globals.get('__name__', '__main__')
except (AttributeError, ValueError):
pass
return result

Python recursive setattr()-like function for working with nested dictionaries [duplicate]

This question already has answers here:
Is it possible to index nested lists using tuples in python?
(7 answers)
Closed 7 months ago.
There are a lot of good getattr()-like functions for parsing nested dictionary structures, such as:
Finding a key recursively in a dictionary
Suppose I have a python dictionary , many nests
https://gist.github.com/mittenchops/5664038
I would like to make a parallel setattr(). Essentially, given:
cmd = 'f[0].a'
val = 'whatever'
x = {"a":"stuff"}
I'd like to produce a function such that I can assign:
x['f'][0]['a'] = val
More or less, this would work the same way as:
setattr(x,'f[0].a',val)
to yield:
>>> x
{"a":"stuff","f":[{"a":"whatever"}]}
I'm currently calling it setByDot():
setByDot(x,'f[0].a',val)
One problem with this is that if a key in the middle doesn't exist, you need to check for and make an intermediate key if it doesn't exist---ie, for the above:
>>> x = {"a":"stuff"}
>>> x['f'][0]['a'] = val
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'f'
So, you first have to make:
>>> x['f']=[{}]
>>> x
{'a': 'stuff', 'f': [{}]}
>>> x['f'][0]['a']=val
>>> x
{'a': 'stuff', 'f': [{'a': 'whatever'}]}
Another is that keying for when the next item is a lists will be different than the keying when the next item is a string, ie:
>>> x = {"a":"stuff"}
>>> x['f']=['']
>>> x['f'][0]['a']=val
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
...fails because the assignment was for a null string instead of a null dict. The null dict will be the right assignment for every non-list in dict until the very last one---which may be a list, or a value.
A second problem, pointed out in the comments below by #TokenMacGuy, is that when you have to create a list that does not exist, you may have to create an awful lot of blank values. So,
setattr(x,'f[10].a',val)
---may mean the algorithm will have to make an intermediate like:
>>> x['f']=[{},{},{},{},{},{},{},{},{},{},{}]
>>> x['f'][10]['a']=val
to yield
>>> x
{"a":"stuff","f":[{},{},{},{},{},{},{},{},{},{},{"a":"whatever"}]}
such that this is the setter associated with the getter...
>>> getByDot(x,"f[10].a")
"whatever"
More importantly, the intermediates should /not/ overwrite values that already exist.
Below is the junky idea I have so far---I can identify the lists versus dicts and other data types, and create them where they do not exist. However, I don't see (a) where to put the recursive call, or (b) how to 'build' the deep object as I iterate through the list, and (c) how to distinguish the /probing/ I'm doing as I construct the deep object from the /setting/ I have to do when I reach the end of the stack.
def setByDot(obj,ref,newval):
ref = ref.replace("[",".[")
cmd = ref.split('.')
numkeys = len(cmd)
count = 0
for c in cmd:
count = count+1
while count < numkeys:
if c.find("["):
idstart = c.find("[")
numend = c.find("]")
try:
deep = obj[int(idstart+1:numend-1)]
except:
obj[int(idstart+1:numend-1)] = []
deep = obj[int(idstart+1:numend-1)]
else:
try:
deep = obj[c]
except:
if obj[c] isinstance(dict):
obj[c] = {}
else:
obj[c] = ''
deep = obj[c]
setByDot(deep,c,newval)
This seems very tricky because you kind of have to look-ahead to check the type of the /next/ object if you're making place-holders, and you have to look-behind to build a path up as you go.
UPDATE
I recently had this question answered, too, which might be relevant or helpful.
I have separated this out into two steps. In the first step, the query string is broken down into a series of instructions. This way the problem is decoupled, we can view the instructions before running them, and there is no need for recursive calls.
def build_instructions(obj, q):
"""
Breaks down a query string into a series of actionable instructions.
Each instruction is a (_type, arg) tuple.
arg -- The key used for the __getitem__ or __setitem__ call on
the current object.
_type -- Used to determine the data type for the value of
obj.__getitem__(arg)
If a key/index is missing, _type is used to initialize an empty value.
In this way _type provides the ability to
"""
arg = []
_type = None
instructions = []
for i, ch in enumerate(q):
if ch == "[":
# Begin list query
if _type is not None:
arg = "".join(arg)
if _type == list and arg.isalpha():
_type = dict
instructions.append((_type, arg))
_type, arg = None, []
_type = list
elif ch == ".":
# Begin dict query
if _type is not None:
arg = "".join(arg)
if _type == list and arg.isalpha():
_type = dict
instructions.append((_type, arg))
_type, arg = None, []
_type = dict
elif ch.isalnum():
if i == 0:
# Query begins with alphanum, assume dict access
_type = type(obj)
# Fill out args
arg.append(ch)
else:
TypeError("Unrecognized character: {}".format(ch))
if _type is not None:
# Finish up last query
instructions.append((_type, "".join(arg)))
return instructions
For your example
>>> x = {"a": "stuff"}
>>> print(build_instructions(x, "f[0].a"))
[(<type 'dict'>, 'f'), (<type 'list'>, '0'), (<type 'dict'>, 'a')]
The expected return value is simply the _type (first item) of the next tuple in the instructions. This is very important because it allows us to correctly initialize/reconstruct missing keys.
This means that our first instruction operates on a dict, either sets or gets the key 'f', and is expected to return a list. Similarly, our second instruction operates on a list, either sets or gets the index 0 and is expected to return a dict.
Now let's create our _setattr function. This gets the proper instructions and goes through them, creating key-value pairs as necessary. Finally, it also sets the val we give it.
def _setattr(obj, query, val):
"""
This is a special setattr function that will take in a string query,
interpret it, add the appropriate data structure to obj, and set val.
We only define two actions that are available in our query string:
.x -- dict.__setitem__(x, ...)
[x] -- list.__setitem__(x, ...) OR dict.__setitem__(x, ...)
the calling context determines how this is interpreted.
"""
instructions = build_instructions(obj, query)
for i, (_, arg) in enumerate(instructions[:-1]):
_type = instructions[i + 1][0]
obj = _set(obj, _type, arg)
_type, arg = instructions[-1]
_set(obj, _type, arg, val)
def _set(obj, _type, arg, val=None):
"""
Helper function for calling obj.__setitem__(arg, val or _type()).
"""
if val is not None:
# Time to set our value
_type = type(val)
if isinstance(obj, dict):
if arg not in obj:
# If key isn't in obj, initialize it with _type()
# or set it with val
obj[arg] = (_type() if val is None else val)
obj = obj[arg]
elif isinstance(obj, list):
n = len(obj)
arg = int(arg)
if n > arg:
obj[arg] = (_type() if val is None else val)
else:
# Need to amplify our list, initialize empty values with _type()
obj.extend([_type() for x in range(arg - n + 1)])
obj = obj[arg]
return obj
And just because we can, here's a _getattr function.
def _getattr(obj, query):
"""
Very similar to _setattr. Instead of setting attributes they will be
returned. As expected, an error will be raised if a __getitem__ call
fails.
"""
instructions = build_instructions(obj, query)
for i, (_, arg) in enumerate(instructions[:-1]):
_type = instructions[i + 1][0]
obj = _get(obj, _type, arg)
_type, arg = instructions[-1]
return _get(obj, _type, arg)
def _get(obj, _type, arg):
"""
Helper function for calling obj.__getitem__(arg).
"""
if isinstance(obj, dict):
obj = obj[arg]
elif isinstance(obj, list):
arg = int(arg)
obj = obj[arg]
return obj
In action:
>>> x = {"a": "stuff"}
>>> _setattr(x, "f[0].a", "test")
>>> print x
{'a': 'stuff', 'f': [{'a': 'test'}]}
>>> print _getattr(x, "f[0].a")
"test"
>>> x = ["one", "two"]
>>> _setattr(x, "3[0].a", "test")
>>> print x
['one', 'two', [], [{'a': 'test'}]]
>>> print _getattr(x, "3[0].a")
"test"
Now for some cool stuff. Unlike python, our _setattr function can set unhashable dict keys.
x = []
_setattr(x, "1.4", "asdf")
print x
[{}, {'4': 'asdf'}] # A list, which isn't hashable
>>> y = {"a": "stuff"}
>>> _setattr(y, "f[1.4]", "test") # We're indexing f with 1.4, which is a list!
>>> print y
{'a': 'stuff', 'f': [{}, {'4': 'test'}]}
>>> print _getattr(y, "f[1.4]") # Works for _getattr too
"test"
We aren't really using unhashable dict keys, but it looks like we are in our query language so who cares, right!
Finally, you can run multiple _setattr calls on the same object, just give it a try yourself.
>>> class D(dict):
... def __missing__(self, k):
... ret = self[k] = D()
... return ret
...
>>> x=D()
>>> x['f'][0]['a'] = 'whatever'
>>> x
{'f': {0: {'a': 'whatever'}}}
You can hack something together by fixing two problems:
List that automatically grows when accessed out of bounds (PaddedList)
A way to delay the decision of what to create (list of dict) until you accessed it by the first time (DictOrList)
So the code will look like this:
import collections
class PaddedList(list):
""" List that grows automatically up to the max index ever passed"""
def __init__(self, padding):
self.padding = padding
def __getitem__(self, key):
if isinstance(key, int) and len(self) <= key:
self.extend(self.padding() for i in xrange(key + 1 - len(self)))
return super(PaddedList, self).__getitem__(key)
class DictOrList(object):
""" Object proxy that delays the decision of being a List or Dict """
def __init__(self, parent):
self.parent = parent
def __getitem__(self, key):
# Type of the structure depends on the type of the key
if isinstance(key, int):
obj = PaddedList(MyDict)
else:
obj = MyDict()
# Update parent references with the selected object
parent_seq = (self.parent if isinstance(self.parent, dict)
else xrange(len(self.parent)))
for i in parent_seq:
if self == parent_seq[i]:
parent_seq[i] = obj
break
return obj[key]
class MyDict(collections.defaultdict):
def __missing__(self, key):
ret = self[key] = DictOrList(self)
return ret
def pprint_mydict(d):
""" Helper to print MyDict as dicts """
print d.__str__().replace('defaultdict(None, {', '{').replace('})', '}')
x = MyDict()
x['f'][0]['a'] = 'whatever'
y = MyDict()
y['f'][10]['a'] = 'whatever'
pprint_mydict(x)
pprint_mydict(y)
And the output of x and y will be:
{'f': [{'a': 'whatever'}]}
{'f': [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {'a': 'whatever'}]}
The trick consist on creating a defaultdict of objects that can be either a dict or a list depending how you access it.
So when you have the assigment x['f'][10]['a'] = 'whatever' it will work the following way:
Get X['f']. It wont exist so it will return a DictOrList object for the index 'f'
Get X['f'][10]. DictOrList.getitem will be called with an integer index. The DictOrList object will replace itself in the parent collection by a PaddedList
Access the 11th element in the PaddedList will grow it by 11 elements and will return the MyDict element in that position
Assign "whatever" to x['f'][10]['a']
Both PaddedList and DictOrList are bit hacky, but after all the assignments there is no more magic, you have an structure of dicts and lists.
It is possible to synthesize recursively setting items/attributes by overriding __getitem__ to return a return a proxy that can set a value in the original function.
I happen to be working on a library that does a few things similar to this, so I was working on a class that can dynamically assign its own subclasses at instantiation. It makes working with this sort of thing easier, but if that kind of hacking makes you squeamish, you can get similar behavior by creating a ProxyObject similar to the one I create and by creating the individual classes used by the ProxyObject dynamically in the a function. Something like
class ProxyObject(object):
... #see below
def instanciateProxyObjcet(val):
class ProxyClassForVal(ProxyObject,val.__class__):
pass
return ProxyClassForVal(val)
You can use dictionary like I've used in FlexibleObject below would make that implementation significantly more efficient if this is the way you implement it. The code I will providing uses the FlexibleObject though. Right now it only supports classes that, like almost all of Python's builtin classes are capable of being generated by taking an instance of themselves as their sole argument to their __init__/__new__. In the next week or two, I'll add support for anything pickleable, and link to a github repository that contains it. Here's the code:
class FlexibleObject(object):
""" A FlexibleObject is a baseclass for allowing type to be declared
at instantiation rather than in the declaration of the class.
Usage:
class DoubleAppender(FlexibleObject):
def append(self,x):
super(self.__class__,self).append(x)
super(self.__class__,self).append(x)
instance1 = DoubleAppender(list)
instance2 = DoubleAppender(bytearray)
"""
classes = {}
def __new__(cls,supercls,*args,**kws):
if isinstance(supercls,type):
supercls = (supercls,)
else:
supercls = tuple(supercls)
if (cls,supercls) in FlexibleObject.classes:
return FlexibleObject.classes[(cls,supercls)](*args,**kws)
superclsnames = tuple([c.__name__ for c in supercls])
name = '%s%s' % (cls.__name__,superclsnames)
d = dict(cls.__dict__)
d['__class__'] = cls
if cls == FlexibleObject:
d.pop('__new__')
try:
d.pop('__weakref__')
except:
pass
d['__dict__'] = {}
newcls = type(name,supercls,d)
FlexibleObject.classes[(cls,supercls)] = newcls
return newcls(*args,**kws)
Then to use this to use this to synthesize looking up attributes and items of a dictionary-like object you can do something like this:
class ProxyObject(FlexibleObject):
#classmethod
def new(cls,obj,quickrecdict,path,attribute_marker):
self = ProxyObject(obj.__class__,obj)
self.__dict__['reference'] = quickrecdict
self.__dict__['path'] = path
self.__dict__['attr_mark'] = attribute_marker
return self
def __getitem__(self,item):
path = self.__dict__['path'] + [item]
ref = self.__dict__['reference']
return ref[tuple(path)]
def __setitem__(self,item,val):
path = self.__dict__['path'] + [item]
ref = self.__dict__['reference']
ref.dict[tuple(path)] = ProxyObject.new(val,ref,
path,self.__dict__['attr_mark'])
def __getattribute__(self,attr):
if attr == '__dict__':
return object.__getattribute__(self,'__dict__')
path = self.__dict__['path'] + [self.__dict__['attr_mark'],attr]
ref = self.__dict__['reference']
return ref[tuple(path)]
def __setattr__(self,attr,val):
path = self.__dict__['path'] + [self.__dict__['attr_mark'],attr]
ref = self.__dict__['reference']
ref.dict[tuple(path)] = ProxyObject.new(val,ref,
path,self.__dict__['attr_mark'])
class UniqueValue(object):
pass
class QuickRecursiveDict(object):
def __init__(self,dictionary={}):
self.dict = dictionary
self.internal_id = UniqueValue()
self.attr_marker = UniqueValue()
def __getitem__(self,item):
if item in self.dict:
val = self.dict[item]
try:
if val.__dict__['path'][0] == self.internal_id:
return val
else:
raise TypeError
except:
return ProxyObject.new(val,self,[self.internal_id,item],
self.attr_marker)
try:
if item[0] == self.internal_id:
return ProxyObject.new(KeyError(),self,list(item),
self.attr_marker)
except TypeError:
pass #Item isn't iterable
return ProxyObject.new(KeyError(),self,[self.internal_id,item],
self.attr_marker)
def __setitem__(self,item,val):
self.dict[item] = val
The particulars of the implementation will vary depending on what you want. It's obviously significantly easier to just override __getitem__ in the proxy than it is to override both __getitem__ and __getattribute__ or __getattr__. The syntax you are using in setbydot makes it look like you would be happiest with some solution that overrides a mixture of the two.
If you are just using the dictionary to compare values, using =,<=,>= etc. Overriding __getattribute__ works really nicely. If you are wanting to do something more sophisticated, you will probably be better off overriding __getattr__ and doing some checks in __setattr__ to determine whether you want to be synthesizing setting the attribute by setting a value in the dictionary or whether you want to be actually setting the attribute on the item you've obtained. Or you might want to handle it so that if your object has an attribute, __getattribute__ returns a proxy to that attribute and __setattr__ always just sets the attribute in the object (in which case, you can completely omit it). All of these things depend on exactly what you are trying to use the dictionary for.
You also may want to create __iter__ and the like. It takes a little bit of effort to make them, but the details should follow from the implementation of __getitem__ and __setitem__.
Finally, I'm going to briefly summarize the behavior of the QuickRecursiveDict in case it's not immediately clear from inspection. The try/excepts are just shorthand for checking to see whether the ifs can be performed. The one major defect of synthesizing the recursive setting rather than find a way to do it is that you can no longer be raising KeyErrors when you try to access a key that hasn't been set. However, you can come pretty close by returning a subclass of KeyError which is what I do in the example. I haven't tested it so I won't add it to the code, but you may want to pass in some human-readable representation of the key to KeyError.
But aside from all that it works rather nicely.
>>> qrd = QuickRecursiveDict
>>> qrd[0][13] # returns an instance of a subclass of KeyError
>>> qrd[0][13] = 9
>>> qrd[0][13] # 9
>>> qrd[0][13]['forever'] = 'young'
>>> qrd[0][13] # 9
>>> qrd[0][13]['forever'] # 'young'
>>> qrd[0] # returns an instance of a subclass of KeyError
>>> qrd[0] = 0
>>> qrd[0] # 0
>>> qrd[0][13]['forever'] # 'young'
One more caveat, the things being returned is not quite what it looks like. It's a proxy to what it looks like. If you want the int 9, you need int(qrd[0][13]) not qrd[0][13]. For ints this doesn't matter much since, +,-,= and all that bypass __getattribute__ but for lists, you would lose attributes like append if you didn't recast them. (You'd keep len and other builtin methods, just not attributes of list. You lose __len__.)
So that's it. The code's a little bit convoluted, so let me know if you have any questions. I probably can't answer them until tonight unless the answer's really brief. I wish I saw this question sooner, it's a really cool question, and I'll try to update a cleaner solution soon. I had fun trying to code a solution into the wee hours of last night. :)

Returning an Object (class) in Parallel Python

I have created a function that takes a value, does some calculations and return the different answers as an object. However when I try to parallelize the code, using pp, I get the following error.
File "trmm.py", line 8, in getattr
return self.header_array[name]
RuntimeError: maximum recursion depth exceeded while calling a Python object
Here is a simple version of what I am trying to do.
class DataObject(object):
"""
Class to handle data objects with several arrays.
"""
def __getattr__(self, name):
try:
return self.header_array[name]
except KeyError:
try:
return self.line[name]
except KeyError:
raise AttributeError("%s instance has no attribute '%s'" %(self.__class__.__name__, name))
def __setattr__(self, name, value):
if name in ('header_array', 'line'):
object.__setattr__(self, name, value)
elif name in self.line:
self.line[name] = value
else:
self.header_array[name] = value
class TrmmObject(DataObject):
def __init__(self):
DataObject.__init__(self)
self.header_array = {
'header': None
}
self.line = {
'longitude': None,
'latitude': None
}
if __name__ == '__main__':
import pp
ppservers = ()
job_server = pp.Server(2, ppservers=ppservers)
def get_monthly_values(value):
tplObj = TrmmObject()
tplObj.longitude = value
tplObj.latitude = value * 2
return tplObj
job1 = job_server.submit(get_monthly_values, (5,), (DataObject,TrmmObject,),("numpy",))
result = job1()
If I change return tplObj to return [tplObj.longitude, tplObj.latitude] there is no problem. However, as I said before this is a simple version, in reality this change would complicate the program a lot.
I am very grateful for any help.
You almost never need to use getattr and setattr, and it almost always ends up with something blowing up, and infinite recursions is a typical effect of that. I can't really see any reason for using them here either. Be explicit and use the line and header_array dictionaries directly.
If you want a function that looks up a value over all arrays, create a function for that and call it explicitly. Calling the function __getitem__ and using [] is explicit. :-)
(And please don't call a dictionary "header_array", it's confusing).

Categories

Resources