Saving a generic python object externally - python

I moved over from matlab to python about a year ago and am still getting used to the differences. In matlab you could save a structure (the slightly less nice equivalent of an object from a generic class) to an external file. As a result - if you were working with something that generated structures that you didn't understand, you could save them to a file and then compare them.
This file would look something like
structure 1
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
structure 2
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
structure 3
property1: value or values of property 1
property2: value or values of property 2
property3: value or values of property 3
and so on.
Is ther a python way of doing this? Right now i just care about printing them externally - so i can read and compare them. Matlab lets you read them back in, and i don't need that kind of functionality in this case.

Is ther a python way of doing this? Right now i just care about printing them externally - so i can read and compare them. Matlab lets you read them back in, and i don't need that kind of functionality in this case.
There is no really generic and readable serialisation scheme.
pickle works with everything but it's bad, it's unreadable (in fact it's very much specified as not human-readable, and even version 0 is at best to be qualified as "textual") and it's very unsafe.
json can be pretty readable, but it won't work out of the box with most types (though it will work with simple structures of builtin types e.g. if you're just creating lists and dicts it's fine).
You could use data classes or attrs, they make structures printable by default. They use the standard "repr" of
TypeName(field1=value1, field2=value2, ...)
which is somewhat less conducive to easy comparisons via most standard textual diffs, but both provide an asdict function which converts instances to dicts, which you can then easily serialise as JSON (with linebreaks and indentation) or some other generic scheme.

Related

How to see the content of a Gensim-generated dictionary?

I am running topic modeling using Gensim. Before creating the document-term matrix, one needs to create a dictionary of tokens.
dictionary = corpora.Dictionary(tokenized_reviews)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]
But, I don't understand what kind of object "dictionary" is.
So, when I type:
type(dictionary)
I get
gensim.corpora.dictionary.Dictionary
Is this a dictionary ( a kind of data structure)? If so, why can't I see the content (I am just curious)?
When I type
dictionary
I get:
<gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0>
The same issue exists with some of the objects in NLTK.
If this is a dictionary (as a data structure), why I am not able to see the keys and values like any other Python dictionary?
Thanks,
Navid
This is a specific Dictionary class implemented by the Gensim project.
It will be very similar in interface to the standard Python dict (and other various Dictionary/HashMap/etc types you may have used elsewhere).
However, to see exactly what it can do, you should consult the class-specific documentation:
https://radimrehurek.com/gensim/corpora/dictionary.html
Like a dict, you can do typical operations:
len(dictionary) # gets number of entries
dictionary[key] # gets the value at a certain key (word)
dictionary.keys() # gets all stored keys
The reason you see a generic <gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0> when you try to display the value of the dictionary itself is that it hasn't defined any convenience display-string with more info - so you're seeing the default for any random Python object. (Such dictionaries are usually far too large to usefull dump their full contents whenever asked, generically, to "show yourself".

Two identical dictionaries differ (by using diff) after being pickled

I have a dictionary whose keys are tuples like (int, str, int, str, int), and the corresponding values are lists of floats of the same size.
I pickled the dictionary twice by the same script:
import pickle
with open(name, 'wb') as source:
pickle.dump(the_dict, source)
For the two resulting binary files test_1 and test_2, I run
diff test_1 test_2
in a terminal (I'm using macOS) to see whether I can use diff to tell the difference. However, I received
Binary files test_1 and test_2 differ
Why? Was the same dictionary being pickled in different ways? Does it mean I cannot use diff to tell whether two dictionaries are identical?
Depending on what version of Python you are using, Python versions before v3.6 do not remember the order of insertion. Python v3.6 made this an implementation detail and v3.7 made it a language feature.
For backwards compatibility, you shouldn't depend on the dictionary remembering the order of inserted keys. Instead, you can use OrderedDict from the Collections module.
Also, using diff on pickled dict data may show differences in the data even though the actual dictionaries are equivalent -- since dicts, unlike lists, generally make no assurances on order state (see above for when that is not the case).

Simplest way to find and link repeating strings in an XML with Python

I have to parse an XML file with a large number of string values. For example:
<value>Foo</value>
<value>Bar</value>
<value>Baz</value>
<value>Foo</value>
Some of them are equal. There are multiple recurring strings, not just one as in the example above. Hence I would like to detect such values, and link them with XLink: to create a reference at one of the instances of a recurring string (doesn't have to be at the first one), and to link the rest (I can use UUIDs), like here:
<value id="D5494447-A010-4F81-9DDA-E5DFFBD616FF">Foo</value>
<value>Bar</value>
<value>Baz</value>
<value href="#D5494447-A010-4F81-9DDA-E5DFFBD616FF"/>
I am starting with XLinks so perhaps the above doesn't make sense. If that is not possible, another possibility is that I can create a dictionary containing such values:
{'D5494447-A010-4F81-9DDA-E5DFFBD616FF' : 'Foo'}
And then somehow put them in the XML. What is the simplest way to achieve these? I don't care much about the most efficient way as long as the method is correct and simple to implement, since I am a Python beginner and not a computer scientist, and computational complexity is not an issue. Parsing and writing XMLs is not an issue (I figured it out with lxml), so the question here is only about the detection of recurring strings and their linking.
One way is to maintain a dict (a mapping from arbitrary keys to values) of all strings that you have seen before. So, let's assume you are at the point where you have the value in a variable val, and that there is a dict valdict that's initially empty. The code you would need is something like this:
import uuid
if val in valdict: # We have seen this reference before
print '<value href="#%s"/>' % valdict[val]
else: # We need to add this reference
valdict[val] = str(uuid.uuid4()).upper()
print '<value id="%s">%s</value>' % (valdict[val], val)
I wouldn't really recommend this simplistic method for forming the XML iself, but it sounds like you are already well-prepared to handle that side of things.

How to create a memoryview for a non-contiguous memory location?

I have a fragmented structure in memory and I'd like to access it as a contiguous-looking memoryview. Is there an easy way to do this or should I implement my own solution?
For example, consider a file format that consists of records. Each record has a fixed length header, that specifies the length of the content of the record. A higher level logical structure may spread over several records. It would make implementing the higher level structure easier if it could see it's own fragmented memory location as a simple contiguous array of bytes.
Update:
It seems that python supports this 'segmented' buffer type internally, at least based on this part of the documentation. But this is only the C API.
Update2:
As far as I see, the referenced C API - called old-style buffers - does what I need, but it's now deprecated and unavailable in newer version of Python (3.X). The new buffer protocol - specified in PEP 3118 - offers a new way to represent buffers. This API is more usable in most of the use cases (among them, use cases where the represented buffer is not contiguous in memory), but does not support this specific one, where a one dimensional array may be laid out completely freely (multiple differently sized chunks) in memory.
First - I am assuming you are just trying to do this in pure python rather than in a c extension. So I am assuming you have loaded in the different records you are interested in into a set of python objects and your problem is that you want to see the higher level structure that is spread across these objects with bits here and there throughout the objects.
So can you not simply load each of the records into a byte arrays type? You can then use python slicing of arrays to create a new array that has just the data for the high level structure you are interested in. You will then have a single byte array with just the data you are interested in and can print it out or manipulate it in any way that you want to.
So something like:
a = bytearray(b"Hello World") # put your records into byte arrays like this
b = bytearray(b"Stack Overflow")
complexStructure = bytearray(a[0:6]+b[0:]) # Slice and join arrays to form
# new array with just data from your
# high level entity
print complexStructure
Of course you will still ned to know where within the records your high level structure is to slice the arrays correctly but you would need to know this anyway.
EDIT:
Note taking a slice of a list does not copy the data in the list it just creates a new set of references to the data so:
>>> a = [1,2,3]
>>> b = a[1:3]
>>> id(a[1])
140268972083088
>>> id(b[0])
140268972083088
However changes to the list b will not change a as b is a new list. To have the changes automatically change in the original list you would need to make a more complicated object that contained the lists to the original records and hid them in such a way as to be able to decide which list and which element of a list to change or view when a user look to modify/view the complex structure. So something like:
class ComplexStructure():
def add_records(self,record):
self.listofrecords.append(record)
def get_value(self,position):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
return self.listofrecords[listnum][record]
def set_value(self,position,value):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
self.listofrecords[listnum][record] = value
Granted this is not the simple way of doing things you were hoping for but it should do what you need.

strange output from yaml.dump

I've just started using yaml and I love it. However, the other day I came across a case that seemed really odd and I am not sure what is causing it. I have a list of file path locations and another list of file path destinations. I create a dictionary out of them and then use yaml to dump it out to read later (I work with artists and use yaml so that it is human readable as well).
sorry for the long lists:
source = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr']
dest = ['/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr', '/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr']
dictionary = dict(zip(source, dest))
print yaml.dump(dictionary)
this is the output that I get:
{/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhaw
k_diff_diffuse_v0006.exr,
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v00
06/blackhawk_maskBurnt_diffuse_v0006.1031.exr,
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr}
It comes back in fine with a yaml.load, but this is not useful for artists to be able to edit if need be.
This is the first question in the FAQ.
By default, PyYAML chooses the style of a collection depending on whether it has nested collections. If a collection has nested collections, it will be assigned the block style. Otherwise it will have the flow style.
If you want collections to be always serialized in the block style, set the parameter default_flow_style of dump() to False.
So:
>>> print yaml.dump(dictionary, default_flow_style=False)
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_diff.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_diff_diffuse_v0006.exr
/data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskBurnt.1031.exr: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskBurnt_diffuse_v0006.1031.exr
? /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/model/v026_03/blackhawk_maskTapeFloor.1051.exr
: /data/job/maze/build/vehicle/blackhawk/blackhawkHelicopter/work/data/map/tasks/texture/v0006/blackhawk_maskTapeFloor_diffuse_v0006.1051.exr
Still not exactly beautiful, but when you have strings longer than 80 characters as keys, it's about as good as you can reasonably expect.
If you model (part of) the filesystem hierarchy in your object hierarchy, or create aliases (or dynamic aliasers) for parts of the tree, etc., the YAML will look a lot nicer. But that's something you have to actually do at the object-model level; as far as YAML is concerned, those long paths full of repeated prefixes are just strings.

Categories

Resources