scipy.io.loadmat nested structures (i.e. dictionaries) - python

Using the given routines (how to load Matlab .mat files with scipy), I could not access deeper nested structures to recover them into dictionaries
To present the problem I run into in more detail, I give the following toy example:
load scipy.io as spio
a = {'b':{'c':{'d': 3}}}
# my dictionary: a['b']['c']['d'] = 3
spio.savemat('xy.mat',a)
Now I want to read the mat-File back into python. I tried the following:
vig=spio.loadmat('xy.mat',squeeze_me=True)
If I now want to access the fields I get:
>> vig['b']
array(((array(3),),), dtype=[('c', '|O8')])
>> vig['b']['c']
array(array((3,), dtype=[('d', '|O8')]), dtype=object)
>> vig['b']['c']['d']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/<ipython console> in <module>()
ValueError: field named d not found.
However, by using the option struct_as_record=False the field could be accessed:
v=spio.loadmat('xy.mat',squeeze_me=True,struct_as_record=False)
Now it was possible to access it by
>> v['b'].c.d
array(3)

Here are the functions, which reconstructs the dictionaries just use this loadmat instead of scipy.io's loadmat:
import scipy.io as spio
def loadmat(filename):
'''
this function should be called instead of direct spio.loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
'''
data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)
def _check_keys(dict):
'''
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
'''
for key in dict:
if isinstance(dict[key], spio.matlab.mio5_params.mat_struct):
dict[key] = _todict(dict[key])
return dict
def _todict(matobj):
'''
A recursive function which constructs from matobjects nested dictionaries
'''
dict = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, spio.matlab.mio5_params.mat_struct):
dict[strg] = _todict(elem)
else:
dict[strg] = elem
return dict

Just an enhancement to mergen's answer, which unfortunately will stop recursing if it reaches a cell array of objects. The following version will make lists of them instead, and continuing the recursion into the cell array elements if possible.
import scipy.io as spio
import numpy as np
def loadmat(filename):
'''
this function should be called instead of direct spio.loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
'''
def _check_keys(d):
'''
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
'''
for key in d:
if isinstance(d[key], spio.matlab.mio5_params.mat_struct):
d[key] = _todict(d[key])
return d
def _todict(matobj):
'''
A recursive function which constructs from matobjects nested dictionaries
'''
d = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, spio.matlab.mio5_params.mat_struct):
d[strg] = _todict(elem)
elif isinstance(elem, np.ndarray):
d[strg] = _tolist(elem)
else:
d[strg] = elem
return d
def _tolist(ndarray):
'''
A recursive function which constructs lists from cellarrays
(which are loaded as numpy ndarrays), recursing into the elements
if they contain matobjects.
'''
elem_list = []
for sub_elem in ndarray:
if isinstance(sub_elem, spio.matlab.mio5_params.mat_struct):
elem_list.append(_todict(sub_elem))
elif isinstance(sub_elem, np.ndarray):
elem_list.append(_tolist(sub_elem))
else:
elem_list.append(sub_elem)
return elem_list
data = spio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)

As of scipy >= 1.5.0 this functionality now comes built-in using the simplify_cells argument.
from scipy.io import loadmat
mat_dict = loadmat(file_name, simplify_cells=True)

I was advised on the scipy mailing list (https://mail.python.org/pipermail/scipy-user/) that there are two more ways to access this data.
This works:
import scipy.io as spio
vig=spio.loadmat('xy.mat')
print vig['b'][0, 0]['c'][0, 0]['d'][0, 0]
Output on my machine:
3
The reason for this kind of access: "For historic reasons, in Matlab everything is at least a 2D array, even scalars."
So scipy.io.loadmat mimics Matlab behavior per default.

Found a solution, one can access the content of the "scipy.io.matlab.mio5_params.mat_struct object" can be investigated via:
v['b'].__dict__['c'].__dict__['d']

Another method that works:
import scipy.io as spio
vig=spio.loadmat('xy.mat',squeeze_me=True)
print vig['b']['c'].item()['d']
Output:
3
I learned this method on the scipy mailing list, too. I certainly don't understand (yet) why '.item()' has to be added in, and:
print vig['b']['c']['d']
will throw an error instead:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
but I'll be back to supplement the explanation when I know it. Explanation of numpy.ndarray.item (from thenumpy reference):
Copy an element of an array to a standard Python scalar and return it.
(Please notice that this answer is basically the same as the comment of hpaulj to the initial question, but I felt that the comment is not 'visible' or understandable enough. I certainly did not notice it when I searched for a solution for the first time, some weeks ago).

Related

Difference between a numpy.array and numpy.array[:]

Me again... :)
I tried finding an answer to this question but again I was not fortunate enough. So here it is.
What is the difference between calling a numpy array (let's say "iris") and the whole group of data in this array (by using iris[:] for instance).
I´m asking this because of the error that I get when I run the first example (below), while the second example works fine.
Here is the code:
At this first part I load the library and import the dataset from the internet.
import statsmodels.api as sm
iris = sm.datasets.get_rdataset(dataname='iris',
package='datasets')['data']
If I run this code I get an error:
iris.columns.values = [iris.columns.values[x].lower() for x in range( len( iris.columns.values ) ) ]
print(iris.columns.values)
Now if I run this code it works fine:
iris.columns.values[:] = [iris.columns.values[x].lower() for x in range( len( iris.columns.values ) ) ]
print(iris.columns.values)
Best regards,
The difference is that when you do iris.columns.values = ... you try to replace the reference of the values property in iris.columns which is protected (see pandas implementation of pandas.core.frame.DataFrame) and when you do iris.columns.values[:] = ... you access the data of the np.ndarray and replace it with new values. In the second assignment statement you do not overwrite the reference to the numpy object. The [:] is a slice object that is passed to the __setitem__ method of the numpy array.
EDIT:
The exact implementation (there are multiple, here is the pd.Series implementation) of such property is:
#property
def values(self):
""" return the array """
return self.block.values
thus you try to overwrite a property that is constructed with a decorator #property followed by a getter function, and cannot be replaced since it is only provided with a getter and not a setter. See Python's docs on builtins - property()
iris.columns.values = val
calls
type(iris.columns).__setattr__(iris.columns, 'values', val)
This is running pandas' code, because type(iris.columns) is pd.Series
iris.columns.values[:] = val
calls
type(iris.columns.value).__setitem__(iris.columns.value, slice(None), val)
This is running numpy's code, because type(iris.columns.value) is np.ndarray

Convert python objects to python AST-nodes

I have a need to dump the modified python object back into source. So I try to find something to convert real python object to python ast.Node (to use later in astor lib to dump source)
Example of usage I want, Python 2:
import ast
import importlib
import astor
m = importlib.import_module('something')
# modify an object
m.VAR.append(123)
ast_nodes = some_magic(m)
source = astor.dump(ast_nodes)
Please help me to find that some_magic
There's no way to do what you want, because that's not how ASTs work.
When the interpreter runs your code, it will generate an AST out of the source files, and interpret that AST to generate python objects.
What happen to those objects once they've been generated has nothing to do with the AST.
It is however possible to get the AST of what generated the object in the first place.
The module inspect lets you get the source code of some python objects:
import ast
import importlib
import inspect
m = importlib.import_module('pprint')
s = inspect.getsource(m)
a = ast.parse(s)
print(ast.dump(a))
# Prints the AST of the pprint module
But getsource() is aptly named.
If I were to change the value of some variable (or any other object) in m, it wouldn't change its source code.
Even if it was possible to regenerate an AST out of an object, there wouldn't be a single solution some_magic() could return.
Imagine I have a variable x in some module, that I reassign in another module:
# In some_module.py
x = 0
# In __main__.py
m = importlib.import_module('some_module')
m.x = 1 + 227
Now, the value of m.x is 228, but there's no way to know what kind of expression led to that value (well, without reading the AST of __main__.py but this would quickly get out of hand). Was it a mere literal? The result of a function call?
If you really have to get a new AST after modifying some value of a module, the best solution would be to transform the original AST by yourself.
You can find where your identifier got its value, and replace the value of the assignment with whatever you want.
For instance, in my small example x = 0 is represented by the following AST:
Assign(targets=[Name(id='x', ctx=Store())], value=Num(n=0))
And to get the AST matching the reassignment I did in __main__.py, I would have to change the value of the above Assign node as the following:
value=BinOp(left=Num(n=1), op=Add(), right=Num(n=227))
If you'd like to go that way, I recommend you check Python's documentation of the AST node transformer (ast.NodeTransformer), as well as this excellent manual that documents all the nodes you can meet in Python ASTs Green Tree Snakes - the missing Python AST docs.
What Vladimir is asking about is certainly useful for compiler optimizations. Indeed, there are ways to accomplish that using the ast library. Here is a simple example demonstrating evaluation of constant functions:
from ast import *
import numpy as np
PURE_FUNS = {'arange' : np.arange}
PROG = '''
A=arange(5)
B=[0, 1, 2, 3, 4]
A[2:3] = 1
C = [A[1], 2, m]
'''
def py_to_ast(o):
if type(o) == np.ndarray:
return List(elts=[py_to_ast(e) for e in o], ctx=Load())
elif type(o) == np.int64:
return Constant(value=o)
# Add elifs for more types here
else:
assert False
class EvalPureFuns(NodeTransformer):
def visit_Call(self, node):
is_const_args = all(type(a) == Constant for a in node.args)
if node.func.id in PURE_FUNS and is_const_args:
res = eval(unparse(node), PURE_FUNS)
return py_to_ast(res)
return node
node = parse(PROG)
node = EvalPureFuns().visit(node)
print(unparse(node))

Python 2.6 numpy interaction array objects error

I have a multi-dimensional array of objects. I want to interate over the objects using the nditer iterator.
Here is a code example:
import numpy as np
class Test:
def __init__(self,a):
self.a = a
def get_a(self):
return self.a
b = np.empty((2,3),dtype = object)
t_00 = Test(0)
t_01 = Test(1)
t_11 = Test (11)
b[0,0] = t_00
b[0,1] = t_01
b[1,1] = t_11
for item in np.nditer(b,flags = ["refs_ok"]):
if item:
print item.get_a()
I would expect the "item" to contain the object reference that I can use to access data.
However I am getting the following error:AttributeError: 'numpy.ndarray' object has no attribute 'get_a'
My question is how can I go thru the array to access the object in the array?
The array.flat method of iteration will work, and can confirm that this works as you'd expect
for item in b.flat:
if item:
print item.get_a()
Iterating over an array with nditer gives you views of the original array's cells as 0-dimensional arrays. For non-object arrays, this is almost equivalent to producing scalars, since 0-dimensional arrays usually behave like scalars, but that doesn't work for object arrays.
If you were determined to go through nditer for this, you could extract the elements from the 0-dimensional views with the item() method:
for element in np.nditer(b,flags = ["refs_ok"]):
element = element.item()
if element:
print(element.get_a())

Hashing a dictionary?

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))

How to modify a NumPy.recarray using its two views

I am new to Python and Numpy, and I am facing a problem, that I can not modify a numpy.recarray, when applying to masked views. I read recarray from a file, then create two masked views, then try to modify the values in for loop. Here is an example code.
import numpy as np
import matplotlib.mlab as mlab
dat = mlab.csv2rec(args[0], delimiter=' ')
m_Obsr = dat.is_observed == 1
m_ZeroScale = dat[m_Obsr].scale_mean < 0.01
for d in dat[m_Obsr][m_ZeroScale]:
d.scale_mean = 1.0
But when I print the result
newFile = args[0] + ".no-zero-scale"
mlab.rec2csv(dat[m_Obsr][m_ZeroScale], newFile, delimiter=' ')
All the scale_means in the files, are still zero.
I must be doing something wrong. Is there a proper way of modifying values of the
view? Is it because I am applying two views one by one?
Thank you.
I think you have a misconception in this term "masked views" and should (re-)read The Book (now freely downloadable) to clarify your understanding.
I quote from section 3.4.2:
Advanced selection is triggered when
the selection object, obj, is a
non-tuple sequence object, an ndarray
(of data type integer or bool), or a
tuple with at least one sequence
object or ndarray (of data type
integer or bool). There are two types
of advanced indexing: integer and
Boolean. Advanced selection always
returns a copy of the data (contrast
with basic slicing that returns a
view).
What you're doing here is advanced selection (of the Boolean kind) so you're getting a copy and never binding it anywhere -- you make your changes on the copy and then just let it go away, then write a new fresh copy from the original.
Once you understand the issue the solution should be simple: make your copy once, make your changes on that copy, and write that same copy. I.e.:
dat = mlab.csv2rec(args[0], delimiter=' ')
m_Obsr = dat.is_observed == 1
m_ZeroScale = dat[m_Obsr].scale_mean < 0.01
the_copy = dat[m_Obsr][m_ZeroScale]
for d in the_copy:
d.scale_mean = 1.0
newFile = args[0] + ".no-zero-scale"
mlab.rec2csv(the_copy, newFile, delimiter=' ')

Categories

Resources