Closing files with dask's from_array function - python

I happened to use Dask's "from_array" method. As in the docs at https://docs.dask.org/en/latest/array-creation.html, I do as follows:
>>> import h5py
>>> f = h5py.File('myfile.hdf5') # HDF5 file
>>> d = f['/data/path'] # Pointer on on-disk array
>>> x = da.from_array(d, chunks=(1000, 1000))
But in this example, do you agree that I should close the hdf5 file after processing the data?
If yes, it may be useful to add a feature to Dask array to allow to just pass the file pointer and the dataset key in order to include a routine in Dask array that would close the source file, if any, when the dask array object is destroyed.
I know that a good way to proceed would be like this:
>>> import h5py
>>> with h5py.File('myfile.hdf5') as f: # HDF5 file
>>> d = f['/data/path'] # Pointer on on-disk array
>>> x = da.from_array(d, chunks=(1000, 1000))
But sometimes it is not really handy. For example, in my code, I have a function that returns a dask array from a filepath with some sanity checks in between, a bit like :
>>> import h5py
>>> function get_dask_array(filepath, key)
>>> f = h5py.File(filepath) # HDF5 file
>>> # ... some sanity checks here
>>> d = f[key] # Pointer on on-disk array
>>> # ... some sanity checks here
>>> return da.from_array(d, chunks=(1000, 1000))
In this case, I find it ugly to return the file pointer as well and keep it aside for the duration of the processing, before closing it.
Any suggestion on how I should do?
Thank you in advance for your answers,
Regards,
Edit: for now I am using a global variables inside a package as follows:
#atexit.register
def clean_files():
for f in SOURCE_FILES:
if os.path.isfile(s):
f.close()

Related

How to create numpy arrays automatically?

I wanted to create arrays by for loop to assign automatically array names.
But using a for loop, it didn't work and creating a dictionary with numpy.array() in it, does not work, too. Currently, I have no more ideas...
I am not really safe in handling with python.
import numpy as np
for file_name in folder:
file_name = np.array()
file_name.extend((blabla, blabla1))
I expected to get arrays with automatically assigned names, like file_name1, file_name2, ...
But I got the advice, "redeclared file_name defined above without usage" and the output was at line file_name = np.array()
TypeError: array() missing required argument 'object' (pos 1) ...
You can do it with globals() if you really want to use the strings as named variables.
globals()[filename] = np.array()
Example:
>>> globals()['test'] = 1
>>> test
1
Of course this populates the global namespace. Otherwise, you can use locals().
As #Mark Meyer said in comment, you should use dictionary (dict in Python) by setting file_name as key.
As per your error, when you create a numpy array, you should provide an iterable (ex. a list).
For example:
>>> folder = ['file1', 'file2']
>>> blabla = 0
>>> blabla1 = 1
>>> {f: np.array((blabla, blabla1)) for f in folder}
{'file1': array([0, 1]), 'file2': array([0, 1])}

Ctypes read data from a double pointer

I am working on a C++ Dll with C wrapper and I am creating a Python wrapper for future user (I discover ctypes since monday). One of the method of my Dll (because it is a class) return an unsigned short **, call data, which corresponds to an image. On C++, I get the value of a pixel using data[row][column].
I create in Python a function on the following model :
mydll.cMyFunction.argtypes = [c_void_p]
mydll.cMyFunction.restype = POINTER(POINTER(c_ushort))
When I call this function, I have result = <__main__.LP_LP_c_ushort at 0x577fac8>
and when I try to see the data at this address (using result.contents.contents) I get the correct value of the first pixel. But I don't know how to access values of the rest of my image. Is there a easy way to do something like C++ (data[i][j]) ?
Yes, just use result[i][j]. Here's a contrived example:
>>> from ctypes import *
>>> ppus = POINTER(POINTER(c_ushort))
>>> ppus
<class '__main__.LP_LP_c_ushort'>
>>> # This creates an array of pointers to ushort[5] arrays
>>> x=(POINTER(c_ushort)*5)(*[cast((c_ushort*5)(n,n+1,n+2,n+3,n+4),POINTER(c_ushort)) for n in range(0,25,5)])
>>> a = cast(x,ppus) # gets a ushort**
>>> a
<__main__.LP_LP_c_ushort object at 0x00000000026F39C8>
>>> a[0] # deref to get the first ushort[5] array
<__main__.LP_c_ushort object at 0x00000000026F33C8>
>>> a[0][0] # get an item from a row
0
>>> a[0][1]
1
>>>
>>> a[1][0]
5
So if you are returning the ushort** correctly from C, it should "just work".

Updating variables across files persistently [duplicate]

So, I want to store a dictionary in a persistent file. Is there a way to use regular dictionary methods to add, print, or delete entries from the dictionary in that file?
It seems that I would be able to use cPickle to store the dictionary and load it, but I'm not sure where to take it from there.
If your keys (not necessarily the values) are strings, the shelve standard library module does what you want pretty seamlessly.
Use JSON
Similar to Pete's answer, I like using JSON because it maps very well to python data structures and is very readable:
Persisting data is trivial:
>>> import json
>>> db = {'hello': 123, 'foo': [1,2,3,4,5,6], 'bar': {'a': 0, 'b':9}}
>>> fh = open("db.json", 'w')
>>> json.dump(db, fh)
and loading it is about the same:
>>> import json
>>> fh = open("db.json", 'r')
>>> db = json.load(fh)
>>> db
{'hello': 123, 'bar': {'a': 0, 'b': 9}, 'foo': [1, 2, 3, 4, 5, 6]}
>>> del new_db['foo'][3]
>>> new_db['foo']
[1, 2, 3, 5, 6]
In addition, JSON loading doesn't suffer from the same security issues that shelve and pickle do, although IIRC it is slower than pickle.
If you want to write on every operation:
If you want to save on every operation, you can subclass the Python dict object:
import os
import json
class DictPersistJSON(dict):
def __init__(self, filename, *args, **kwargs):
self.filename = filename
self._load();
self.update(*args, **kwargs)
def _load(self):
if os.path.isfile(self.filename)
and os.path.getsize(self.filename) > 0:
with open(self.filename, 'r') as fh:
self.update(json.load(fh))
def _dump(self):
with open(self.filename, 'w') as fh:
json.dump(self, fh)
def __getitem__(self, key):
return dict.__getitem__(self, key)
def __setitem__(self, key, val):
dict.__setitem__(self, key, val)
self._dump()
def __repr__(self):
dictrepr = dict.__repr__(self)
return '%s(%s)' % (type(self).__name__, dictrepr)
def update(self, *args, **kwargs):
for k, v in dict(*args, **kwargs).items():
self[k] = v
self._dump()
Which you can use like this:
db = DictPersistJSON("db.json")
db["foo"] = "bar" # Will trigger a write
Which is woefully inefficient, but can get you off the ground quickly.
Unpickle from file when program loads, modify as a normal dictionary in memory while program is running, pickle to file when program exits? Not sure exactly what more you're asking for here.
Assuming the keys and values have working implementations of repr, one solution is that you save the string representation of the dictionary (repr(dict)) to file. YOu can load it using the eval function (eval(inputstring)). There are two main disadvantages of this technique:
1) Is will not work with types that have an unuseable implementation of repr (or may even seem to work, but fail). You'll need to pay at least some attention to what is going on.
2) Your file-load mechanism is basically straight-out executing Python code. Not great for security unless you fully control the input.
It has 1 advantage: Absurdly easy to do.
My favorite method (which does not use standard python dictionary functions): Read/write YAML files using PyYaml. See this answer for details, summarized here:
Create a YAML file, "employment.yml":
new jersey:
mercer county:
pumbers: 3
programmers: 81
middlesex county:
salesmen: 62
programmers: 81
new york:
queens county:
plumbers: 9
salesmen: 36
Step 3: Read it in Python
import yaml
file_handle = open("employment.yml")
my__dictionary = yaml.safe_load(file_handle)
file_handle.close()
and now my__dictionary has all the values. If you needed to do this on the fly, create a string containing YAML and parse it wth yaml.safe_load.
If using only strings as keys (as allowed by the shelve module) is not enough, the FileDict might be a good way to solve this problem.
pickling has one disadvantage. it can be expensive if your dictionary has to be read and written frequently from disk and it's large. pickle dumps the stuff down (whole). unpickle gets the stuff up (as a whole).
if you have to handle small dicts, pickle is ok. If you are going to work with something more complex, go for berkelydb. It is basically made to store key:value pairs.
Have you considered using dbm?
import dbm
import pandas as pd
import numpy as np
db = b=dbm.open('mydbm.db','n')
#create some data
df1 = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(101,200, size=(10, 3)), columns=list('EFG'))
#serialize the data and put in the the db dictionary
db['df1']=df1.to_json()
db['df2']=df2.to_json()
# in some other process:
db=dbm.open('mydbm.db','r')
df1a = pd.read_json(db['df1'])
df2a = pd.read_json(db['df2'])
This tends to work even without a db.close()

Unpacking data with h5py

I want to write numpy arrays to a file and easily load them in again.
I would like to have a function save() that preferably works in the following way:
data = [a, b, c, d]
save('data.h5', data)
which then does the following
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('a', data=a)
h5f.create_dataset('b', data=b)
h5f.create_dataset('c', data=c)
h5f.create_dataset('d', data=d)
h5f.close()
Then subsequently I would like to easily load this data with for example
a, b, c, d = load('data.h5')
which does the following:
h5f = h5py.File('data.h5', 'r')
a = h5f['a'][:]
b = h5f['b'][:]
c = h5f['c'][:]
d = h5f['d'][:]
h5f.close()
I can think of the following for saving the data:
h5f = h5py.File('data.h5', 'w')
data_str = ['a', 'b', 'c', 'd']
for name in data_str:
h5f.create_dataset(name, data=eval(name))
h5f.close()
I can't think of a similar way of using data_str to then load the data again.
Rereading the question (was this edited or not?), I see load is supposed to function as:
a, b, c, d = load('data.h5')
This eliminates the global variable names issue that I worried about earlier. Just return the 4 arrays (as a tuple), and the calling expression takes care of assigning names. Of course this way, the global variable names do not have to match the names in the file, nor the names used inside the function.
def load(filename):
h5f = h5py.File(filename, 'r')
a = h5f['a'][:]
b = h5f['b'][:]
c = h5f['c'][:]
d = h5f['d'][:]
h5f.close()
return a,b,c,d
Or using a data_str parameter:
def load(filename, data_str=['a','b','c','d']):
h5f = h5py.File(filename, 'r')
arrays = []
for name in data_str:
var = h5f[name][:]
arrays.append(var)
h5f.close()
return arrays
For loading all the variables in the file, see Reading ALL variables in a .mat file with python h5py
An earlier answer that assumed you wanted to take the variable names from the file key names.
This isn't a h5py issue. It's about creating global (or local) variables using names from a dictionary (or other structure). In other words, how creat a variable, using a string as name.
This issue has come up often in connection with argparse, an commandline parser. It gives an object like args=namespace(a=1, b='value'). It is easy to turn that into a dictionary (with vars(args)), {'a':1, 'b':'value'}. But you have to do something tricky, and not Pythonic, to create a and b variables.
It's even worse if you create that dictionary inside a function, and then want to create global variables (i.e. outside the function).
The trick involves assigning to locals() or globals(). But since it's un-pythonic I'm reluctant to be more specific.
In so many words I'm saying the same thing as the accepted answer in https://stackoverflow.com/a/4467517/901925
For loading variables from a file into an Ipython environment, see
https://stackoverflow.com/a/28258184/901925 ipython-loading-variables-to-workspace
I would use deepdish (deepdish.io):
import deepdish as dd
dd.io.save(filename, {'dict1': dict1, 'obj2': obj2}, compression=('blosc', 9))

How do I split a string and rejoin it without creating an intermediate list in Python?

Say I have something like the following:
dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )
(i.e. strip any lines starting with # from the multi-line string src)
src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?
Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?
Here's a way to do a general type of split using itertools
>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'
groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures
Probably better to spend a few lines and make a generator
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
... i=j=0
... while True:
... try:
... j = s.index(t, i)
... except ValueError:
... if i<len(s):
... yield s[i:]
... raise StopIteration
... yield s[i:j]
... i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'
re has a method called finditer, that could be used for this purpose too
>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'
comparing the performance is an exercise for the OP to try on the real data
buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")
Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.
In your existing code you can change the list to a generator expression:
dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")
This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.
A completely different approach that avoids the temporary construction of both lists is to use a regular expression:
import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)
This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:
init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''
method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''
import timeit
for method in [method1, method2, method3, method4]:
print timeit.timeit(method, init, number = 100)
Results:
9.38s # Split then join with temporary list
9.92s # Split then join with generator
8.60s # Regular expression
64.56s # StringIO
As you can see the regular expression is the fastest method.
From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.
If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.
If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:
output = ""
for i in re.finditer("^.*\n",input,re.M):
i=i.group(0).strip()
if i.startswith("#"):
continue
output += i + "\n"
Here you can replace the regular expression by something more sophisticated.
The problem is that strings are immutable in python, so it's going to be very difficult to do anything at all without intermediate storage.

Categories

Resources