easy save/load of data in python - python

What is the easiest way to save and load data in python, preferably in a human-readable output format?
The data I am saving/loading consists of two vectors of floats. Ideally, these vectors would be named in the file (e.g. X and Y).
My current save() and load() functions use file.readline(), file.write() and string-to-float conversion. There must be something better.

The most simple way to get a human-readable output is by using a serialisation format such a JSON. Python contains a json library you can use to serialise data to and from a string. Like pickle, you can use this with an IO object to write it to a file.
import json
file = open('/usr/data/application/json-dump.json', 'w+')
data = { "x": 12153535.232321, "y": 35234531.232322 }
json.dump(data, file)
If you want to get a simple string back instead of dumping it to a file, you can use json.dumps() instead:
import json
print json.dumps({ "x": 12153535.232321, "y": 35234531.232322 })
Reading back from a file is just as easy:
import json
file = open('/usr/data/application/json-dump.json', 'r')
print json.load(file)
The json library is full-featured, so I'd recommend checking out the documentation to see what sorts of things you can do with it.

There are several options -- I don't exactly know what you like. If the two vectors have the same length, you could use numpy.savetxt() to save your vectors, say x and y, as columns:
# saving:
f = open("data", "w")
f.write("# x y\n") # column names
numpy.savetxt(f, numpy.array([x, y]).T)
# loading:
x, y = numpy.loadtxt("data", unpack=True)
If you are dealing with larger vectors of floats, you should probably use NumPy anyway.

If it should be human-readable, I'd
also go with JSON. Unless you need to
exchange it with enterprise-type
people, they like XML better. :-)
If it should be human editable and
isn't too complex, I'd probably go
with some sort of INI-like format,
like for example configparser.
If it is complex, and doesn't need to
be exchanged, I'd go with just
pickling the data, unless it's very
complex, in which case I'd use ZODB.
If it's a LOT of data, and needs to
be exchanged, I'd use SQL.
That pretty much covers it, I think.

A simple serialization format that is easy for both humans to computers read is JSON.
You can use the json Python module.

Here is an example of Encoder until you probably want to write for Body class:
# add this to your code
class BodyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.ndarray):
return obj.tolist()
if hasattr(obj, '__jsonencode__'):
return obj.__jsonencode__()
if isinstance(obj, set):
return list(obj)
return obj.__dict__
# Here you construct your way to dump your data for each instance
# you need to customize this function
def deserialize(data):
bodies = [Body(d["name"],d["mass"],np.array(d["p"]),np.array(d["v"])) for d in data["bodies"]]
axis_range = data["axis_range"]
timescale = data["timescale"]
return bodies, axis_range, timescale
# Here you construct your way to load your data for each instance
# you need to customize this function
def serialize(data):
file = open(FILE_NAME, 'w+')
json.dump(data, file, cls=BodyEncoder, indent=4)
print("Dumping Parameters of the Latest Run")
print(json.dumps(data, cls=BodyEncoder, indent=4))
Here is an example of the class I want to serialize:
class Body(object):
# you do not need to change your class structure
def __init__(self, name, mass, p, v=(0.0, 0.0, 0.0)):
# init variables like normal
self.name = name
self.mass = mass
self.p = p
self.v = v
self.f = np.array([0.0, 0.0, 0.0])
def attraction(self, other):
# not important functions that I wrote...
Here is how to serialize:
# you need to customize this function
def serialize_everything():
bodies, axis_range, timescale = generate_data_to_serialize()
data = {"bodies": bodies, "axis_range": axis_range, "timescale": timescale}
BodyEncoder.serialize(data)
Here is how to dump:
def dump_everything():
data = json.loads(open(FILE_NAME, "r").read())
return BodyEncoder.deserialize(data)

Since we're talking about a human editing the file, I assume we're talking about relatively little data.
How about the following skeleton implementation. It simply saves the data as key=value pairs and works with lists, tuples and many other things.
def save(fname, **kwargs):
f = open(fname, "wt")
for k, v in kwargs.items():
print >>f, "%s=%s" % (k, repr(v))
f.close()
def load(fname):
ret = {}
for line in open(fname, "rt"):
k, v = line.strip().split("=", 1)
ret[k] = eval(v)
return ret
x = [1, 2, 3]
y = [2.0, 1e15, -10.3]
save("data.txt", x=x, y=y)
d = load("data.txt")
print d["x"]
print d["y"]

As I commented in the accepted answer, using numpy this can be done with a simple one-liner:
Assuming you have numpy imported as np (which is common practice),
np.savetxt('xy.txt', np.array([x, y]).T, fmt="%.3f", header="x y")
will save the data in the (optional) format and
x, y = np.loadtxt('xy.txt', unpack=True)
will load it.
The file xy.txt will then look like:
# x y
1.000 1.000
1.500 2.250
2.000 4.000
2.500 6.250
3.000 9.000
Note that the format string fmt=... is optional, but if the goal is human-readability it may prove quite useful. If used, it is specified using the usual printf-like codes (In my example: floating-point number with 3 decimals).

Related

Python - Passing functions to another function where the arguments may be modified

I've written what's effectively a parser for a large amount of sequential data chunks, and I need to write a number of functions to analyze the data chunks in various ways. The parser contains some useful functionality for me such as frequency of reading data into (previously-instantiated) objects, conditional filtering of the data, and when to stop reading the file.
I would like to write external analysis functions in separate modules, import the parser, and pass the analysis function into the parser to evaluate at the end of every data chunk read. In general, the analysis functions will require variables modified within the parser itself (i.e. the data chunk that was read), but it may need additional parameters from the module where it's defined.
Here's essentially what I would like to do for the parser:
def parse_chunk(dat_file, dat_obj1, dat_obj2, parse_arg1=None, fun=None, **fargs):
# Process optional arguments to parser...
with open(dat_file,'r') as dat:
# Parse chunk of dat_file based on parse_arg1 and store data in dat_obj1, dat_obj2, etc.
dat_obj1.attr = parsed_data
local_var1 = dat_obj1.some_method()
# Call analysis function passed to parser
if fun != None:
return fun(**fargs)
In another module, I would have something like:
from parsemod import parse_chunk
def main_script():
# Preprocess data from other files
dat_obj1 = ...
dat_obj2 = ...
script_var1 = ...
# Parse data and analyze
result = parse_chunk(dat_file, dat_obj1, dat_obj2, fun=eval_prop,
dat_obj1=None, local_var1=None, foo=script_var1)
def eval_data(dat_obj1, local_var1, foo):
# Analyze data
...
return result
I've looked at similar questions such as this and this, but the issue here is that eval_data() has arguments which are modified or set in parse(), and since **fargs provides a dictionary, the variable names themselves are not in the namespace for parse(), so they aren't modified prior to calling eval_data().
I've thought about modifying the parser to just return all variables after every chunk read and call eval_data() from main_script(), but there are too many different possible variables needed for the different eval_data() functional forms, so this gets very clunky.
Here's another simplified example that's even more general:
def my_eval(fun, **kwargs):
x = 6
z = 1
return fun(**kwargs)
def my_fun(x, y, z):
return x + y + z
my_eval(my_fun, x=3, y=5, z=None)
I would like the result of my_eval() to be 12, as x gets overwritten from 3 to 6 and z gets set to 1. I looked into functools.partial but it didn't seem to work either.
To override kwargs you need to do
kwargs['variable'] = value # instead of just variable = value
in your case, in my_eval you need to do
kwargs['x'] = 6
kwargs['z'] = 1

Updating variables across files persistently [duplicate]

So, I want to store a dictionary in a persistent file. Is there a way to use regular dictionary methods to add, print, or delete entries from the dictionary in that file?
It seems that I would be able to use cPickle to store the dictionary and load it, but I'm not sure where to take it from there.
If your keys (not necessarily the values) are strings, the shelve standard library module does what you want pretty seamlessly.
Use JSON
Similar to Pete's answer, I like using JSON because it maps very well to python data structures and is very readable:
Persisting data is trivial:
>>> import json
>>> db = {'hello': 123, 'foo': [1,2,3,4,5,6], 'bar': {'a': 0, 'b':9}}
>>> fh = open("db.json", 'w')
>>> json.dump(db, fh)
and loading it is about the same:
>>> import json
>>> fh = open("db.json", 'r')
>>> db = json.load(fh)
>>> db
{'hello': 123, 'bar': {'a': 0, 'b': 9}, 'foo': [1, 2, 3, 4, 5, 6]}
>>> del new_db['foo'][3]
>>> new_db['foo']
[1, 2, 3, 5, 6]
In addition, JSON loading doesn't suffer from the same security issues that shelve and pickle do, although IIRC it is slower than pickle.
If you want to write on every operation:
If you want to save on every operation, you can subclass the Python dict object:
import os
import json
class DictPersistJSON(dict):
def __init__(self, filename, *args, **kwargs):
self.filename = filename
self._load();
self.update(*args, **kwargs)
def _load(self):
if os.path.isfile(self.filename)
and os.path.getsize(self.filename) > 0:
with open(self.filename, 'r') as fh:
self.update(json.load(fh))
def _dump(self):
with open(self.filename, 'w') as fh:
json.dump(self, fh)
def __getitem__(self, key):
return dict.__getitem__(self, key)
def __setitem__(self, key, val):
dict.__setitem__(self, key, val)
self._dump()
def __repr__(self):
dictrepr = dict.__repr__(self)
return '%s(%s)' % (type(self).__name__, dictrepr)
def update(self, *args, **kwargs):
for k, v in dict(*args, **kwargs).items():
self[k] = v
self._dump()
Which you can use like this:
db = DictPersistJSON("db.json")
db["foo"] = "bar" # Will trigger a write
Which is woefully inefficient, but can get you off the ground quickly.
Unpickle from file when program loads, modify as a normal dictionary in memory while program is running, pickle to file when program exits? Not sure exactly what more you're asking for here.
Assuming the keys and values have working implementations of repr, one solution is that you save the string representation of the dictionary (repr(dict)) to file. YOu can load it using the eval function (eval(inputstring)). There are two main disadvantages of this technique:
1) Is will not work with types that have an unuseable implementation of repr (or may even seem to work, but fail). You'll need to pay at least some attention to what is going on.
2) Your file-load mechanism is basically straight-out executing Python code. Not great for security unless you fully control the input.
It has 1 advantage: Absurdly easy to do.
My favorite method (which does not use standard python dictionary functions): Read/write YAML files using PyYaml. See this answer for details, summarized here:
Create a YAML file, "employment.yml":
new jersey:
mercer county:
pumbers: 3
programmers: 81
middlesex county:
salesmen: 62
programmers: 81
new york:
queens county:
plumbers: 9
salesmen: 36
Step 3: Read it in Python
import yaml
file_handle = open("employment.yml")
my__dictionary = yaml.safe_load(file_handle)
file_handle.close()
and now my__dictionary has all the values. If you needed to do this on the fly, create a string containing YAML and parse it wth yaml.safe_load.
If using only strings as keys (as allowed by the shelve module) is not enough, the FileDict might be a good way to solve this problem.
pickling has one disadvantage. it can be expensive if your dictionary has to be read and written frequently from disk and it's large. pickle dumps the stuff down (whole). unpickle gets the stuff up (as a whole).
if you have to handle small dicts, pickle is ok. If you are going to work with something more complex, go for berkelydb. It is basically made to store key:value pairs.
Have you considered using dbm?
import dbm
import pandas as pd
import numpy as np
db = b=dbm.open('mydbm.db','n')
#create some data
df1 = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(101,200, size=(10, 3)), columns=list('EFG'))
#serialize the data and put in the the db dictionary
db['df1']=df1.to_json()
db['df2']=df2.to_json()
# in some other process:
db=dbm.open('mydbm.db','r')
df1a = pd.read_json(db['df1'])
df2a = pd.read_json(db['df2'])
This tends to work even without a db.close()

How to remove a key/value pair from yaml dump, in Python?

Suppose I have a naive class definition:
import yaml
class A:
def __init__(self):
self.abc = 1
self.hidden = 100
self.xyz = 2
def __repr__(self):
return yaml.dump(self)
A()
printing
!!python/object:__main__.A
abc: 1
hidden: 100
xyz: 2
Is there a clean way to remove a line containing hidden: 100 from yaml dump's printed output? The key name hidden is known in advance, but its numeric value may change.
Desired output:
!!python/object:__main__.A
abc: 1
xyz: 2
FYI: This dump is for display only and will not be loaded.
I suppose one can suppress key/value pair with key=hidden with use of yaml.representative. Another way is find hidden: [number] with RegEx in a string output.
I looked at the documentation for pyyaml and did not find a way to achieve your objective. A work-around would be to delete the attribte hidden, call yaml.dump, then add it back in:
def __repr__(self):
hidden = self.hidden
del self.hidden
return yaml.dump(self)
self.hidden = hidden
Taking a step back, why do you want to use yaml for __repr__? Can you just roll your own instead of relying on yaml?
json is mature solution and (at the moment of writing) have much better docs than pyyaml;
I'd use it instead while pyyaml's docs are hard to fully understand. As a bonus, YAML is (almost) superset of JSON, so you'll be able to read your data as YAML without converting it.
However, to easily use all goodies of YAML you will probably have to convert the data to YAML
json module is unable to serialize custom objects by default, but it can be easily extended:
import json
def default(o):
if isinstance(o, A):
result = vars(o).copy()
del result['hidden']
result['__class__'] = o.__class__.__name__
return result
else:
return o
json.dumps(A(), default=default) # => '{"__class__": "A", "xyz": 2, "abc": 1}'
If you don't want to write default=default everywhere you dumps, you can create custom serializer:
dumper = json.JSONEncoder(default=default)
dumper.encode(A()) # => '{"__class__": "A", "xyz": 2, "abc": 1}'
Or, to be able to easily extend it even further via subclassing:
class Dumper(json.JSONEncoder):
__slots__ = ()
def default(self, o):
if isinstance(o, A):
result = vars(o).copy()
del result['hidden']
result['__class__'] = o.__class__.__name__
return result
else:
return super().default(o)
dumper = Dumper()
dumper.encode(A()) # => '{"__class__": "A", "xyz": 2, "abc": 1}'
Note that fields in JSON are unordered.
Also, if you want to use this, I'd advise you not to serialize dict with key __class__, because it might be hard to distinguish it from serialized object.
See it working online

Can I "detect" a slicing expression in a python class method?

I am developing an application where I have defined a "variable" object containing data in the form of a numpy array. These variables are linked to (netcdf) data files, and I would like to dynamically load the variable values when needed instead of loading all data from the sometimes huge files at the start.
The following snippet demonstrates the principle and works well, including access to data portions with slices. For example, you can write:
a = var() # empty variable
print a.values[7] # values have been automatically "loaded"
or even:
a = var()
a[7] = 0
However, this code still forces me to load the entire variable data at once. Netcdf (with the netCDF4 library) would allow me to directly access data slices from the file. Example:
f = netCDF4.Dataset(filename, "r")
print f.variables["a"][7]
I cannot use the netcdf variable objects directly, because my application is tied to a web service which cannot remember the netcdf file handler, and also because the variable data don't always come from netcdf files, but may originate from other sources such as OGC web services.
Is there a way to "capture" the slicing expression in the property or setter methods and use them? The idea would be to write something like:
#property
def values(self):
if self._values is None:
self._values = np.arange(10.)[slice] # load from file ...
return self._values
instead of the code below.
Working demo:
import numpy as np
class var(object):
def __init__(self, values=None, metadata=None):
if values is None:
self._values = None
else:
self._values = np.array(values)
self.metadata = metadata # just to demonstrate that var has mor than just values
#property
def values(self):
if self._values is None:
self._values = np.arange(10.) # load from file ...
return self._values
#values.setter
def values(self, values):
self._values = values
First thought: Should I perhaps create values as a separate class and then use __getitem__? See In python, how do I create two index slicing for my own matrix class?
No, you cannot detect what will be done to the object after returning from .values. The result could be stored in a variable and only (much later on) be sliced, or sliced in different places, or used in its entirety, etc.
You indeed should instead return a wrapper object and hook into object.__getitem__; it would let you detect slicing and load data as needed. When slicing, Python passes in a slice() object.
Thanks to the guidance of Martijn Pieters and with a bit more reading, I came up with the following code as demonstration. Note that the Reader class uses a netcdf file and the netCDF4 library. If you want to try out this code yourself you will either need a netcdf file with variables "a" and "b", or replace Reader with something else that will return a data array or a slice from a data array.
This solution defines three classes: Reader does the actual file I/O handling, Values manages the data access part and invokes a Reader instance if no data have been stored in memory, and var is the final "variable" which in real life will contain a lot more metadata. The code contains a couple of extra print statements for educational purposes.
"""Implementation of a dynamic variable class which can read data from file when needed or
return the data values from memory if they were read already. This concepts supports
slicing for both memory and file access."""
import numpy as np
import netCDF4 as nc
FILENAME = r"C:\Users\m.schultz\Downloads\data\tmp\MACC_20141224_0001.nc"
VARNAME = "a"
class Reader(object):
"""Implements the actual data access to variable values. Here reading a
slice from a netcdf file.
"""
def __init__(self, filename, varname):
"""Final implementation will also have to take groups into account...
"""
self.filename = filename
self.varname = varname
def read(self, args=slice(None, None, None)):
"""Read a data slice. Args is a tuple of slice objects (e.g.
numpy.index_exp). The default corresponds to [:], i.e. all data
will be read.
"""
with nc.Dataset(self.filename, "r") as f:
values = f.variables[self.varname][args]
return values
class Values(object):
def __init__(self, values=None, reader=None):
"""Initialize Values. You can either pass numerical (or other) values,
preferrably as numpy array, or a reader instance which will read the
values on demand. The reader must have a read(args) method, where
args is a tuple of slices. If no args are given, all data should be
returned.
"""
if values is not None:
self._values = np.array(values)
self.reader = reader
def __getattr__(self, name):
"""This is only be called if attribute name is not present.
Here, the only attribute we care about is _values.
Self.reader should always be defined.
This method is necessary to allow access to variable.values without
a slicing index. If only __getitem__ were defined, one would always
have to write variable.values[:] in order to make sure that something
is returned.
"""
print ">>> in __getattr__, trying to access ", name
if name == "_values":
print ">>> calling reader and reading all values..."
self._values = self.reader.read()
return self._values
def __getitem__(self, args):
print "in __getitem__"
if not "_values" in self.__dict__:
values = self.reader.read(args)
print ">>> read from file. Shape = ", values.shape
if args == slice(None, None, None):
self._values = values # all data read, store in memory
return values
else:
print ">>> read from memory. Shape = ", self._values[args].shape
return self._values[args]
def __repr__(self):
return self._values.__repr__()
def __str__(self):
return self._values.__str__()
class var(object):
def __init__(self, name=VARNAME, filename=FILENAME, values=None):
self.name = name
self.values = Values(values, Reader(filename, name))
if __name__ == "__main__":
# define a variable and access all data first
# this will read the entire array and save it in memory, so that
# subsequent access with or without index returns data from memory
a = var("a", filename=FILENAME)
print "1: a.values = ", a.values
print "2: a.values[-1] = ", a.values[-1]
print "3: a.values = ", a.values
# define a second variable, where we access a data slice first
# In this case the Reader only reads the slice and no data are stored
# in memory. The second access indexes the complete array, so Reader
# will read everything and the data will be stored in memory.
# The last access will then use the data from memory.
b = var("b", filename=FILENAME)
print "4: b.values[0:3] = ", b.values[0:3]
print "5: b.values[:] = ", b.values[:]
print "6: b.values[5:8] = ",b.values[5:8]

With Python, can I keep a persistent dictionary and modify it?

So, I want to store a dictionary in a persistent file. Is there a way to use regular dictionary methods to add, print, or delete entries from the dictionary in that file?
It seems that I would be able to use cPickle to store the dictionary and load it, but I'm not sure where to take it from there.
If your keys (not necessarily the values) are strings, the shelve standard library module does what you want pretty seamlessly.
Use JSON
Similar to Pete's answer, I like using JSON because it maps very well to python data structures and is very readable:
Persisting data is trivial:
>>> import json
>>> db = {'hello': 123, 'foo': [1,2,3,4,5,6], 'bar': {'a': 0, 'b':9}}
>>> fh = open("db.json", 'w')
>>> json.dump(db, fh)
and loading it is about the same:
>>> import json
>>> fh = open("db.json", 'r')
>>> db = json.load(fh)
>>> db
{'hello': 123, 'bar': {'a': 0, 'b': 9}, 'foo': [1, 2, 3, 4, 5, 6]}
>>> del new_db['foo'][3]
>>> new_db['foo']
[1, 2, 3, 5, 6]
In addition, JSON loading doesn't suffer from the same security issues that shelve and pickle do, although IIRC it is slower than pickle.
If you want to write on every operation:
If you want to save on every operation, you can subclass the Python dict object:
import os
import json
class DictPersistJSON(dict):
def __init__(self, filename, *args, **kwargs):
self.filename = filename
self._load();
self.update(*args, **kwargs)
def _load(self):
if os.path.isfile(self.filename)
and os.path.getsize(self.filename) > 0:
with open(self.filename, 'r') as fh:
self.update(json.load(fh))
def _dump(self):
with open(self.filename, 'w') as fh:
json.dump(self, fh)
def __getitem__(self, key):
return dict.__getitem__(self, key)
def __setitem__(self, key, val):
dict.__setitem__(self, key, val)
self._dump()
def __repr__(self):
dictrepr = dict.__repr__(self)
return '%s(%s)' % (type(self).__name__, dictrepr)
def update(self, *args, **kwargs):
for k, v in dict(*args, **kwargs).items():
self[k] = v
self._dump()
Which you can use like this:
db = DictPersistJSON("db.json")
db["foo"] = "bar" # Will trigger a write
Which is woefully inefficient, but can get you off the ground quickly.
Unpickle from file when program loads, modify as a normal dictionary in memory while program is running, pickle to file when program exits? Not sure exactly what more you're asking for here.
Assuming the keys and values have working implementations of repr, one solution is that you save the string representation of the dictionary (repr(dict)) to file. YOu can load it using the eval function (eval(inputstring)). There are two main disadvantages of this technique:
1) Is will not work with types that have an unuseable implementation of repr (or may even seem to work, but fail). You'll need to pay at least some attention to what is going on.
2) Your file-load mechanism is basically straight-out executing Python code. Not great for security unless you fully control the input.
It has 1 advantage: Absurdly easy to do.
My favorite method (which does not use standard python dictionary functions): Read/write YAML files using PyYaml. See this answer for details, summarized here:
Create a YAML file, "employment.yml":
new jersey:
mercer county:
pumbers: 3
programmers: 81
middlesex county:
salesmen: 62
programmers: 81
new york:
queens county:
plumbers: 9
salesmen: 36
Step 3: Read it in Python
import yaml
file_handle = open("employment.yml")
my__dictionary = yaml.safe_load(file_handle)
file_handle.close()
and now my__dictionary has all the values. If you needed to do this on the fly, create a string containing YAML and parse it wth yaml.safe_load.
If using only strings as keys (as allowed by the shelve module) is not enough, the FileDict might be a good way to solve this problem.
pickling has one disadvantage. it can be expensive if your dictionary has to be read and written frequently from disk and it's large. pickle dumps the stuff down (whole). unpickle gets the stuff up (as a whole).
if you have to handle small dicts, pickle is ok. If you are going to work with something more complex, go for berkelydb. It is basically made to store key:value pairs.
Have you considered using dbm?
import dbm
import pandas as pd
import numpy as np
db = b=dbm.open('mydbm.db','n')
#create some data
df1 = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(101,200, size=(10, 3)), columns=list('EFG'))
#serialize the data and put in the the db dictionary
db['df1']=df1.to_json()
db['df2']=df2.to_json()
# in some other process:
db=dbm.open('mydbm.db','r')
df1a = pd.read_json(db['df1'])
df2a = pd.read_json(db['df2'])
This tends to work even without a db.close()

Categories

Resources