In our lab we store our data in hdf5 files trough the python package h5py.
At the beginning of an experiment we create an hdf5 file and store array after array of array of data in the file (among other things). When an experiment fails or is interrupted the file is not correctly closed.
Because our experiments run from iPython the reference to the data object remains (somewhere) in memory.
Is there a way to scan for all open h5py data objects and close them?
This is how it could be done (I could not figure out how to check for closed-ness of the file without exceptions, maybe you will find):
import gc
for obj in gc.get_objects(): # Browse through ALL objects
if isinstance(obj, h5py.File): # Just HDF5 files
try:
obj.close()
except:
pass # Was already closed
Another idea:
Dpending how you use the files, what about using the context manager and the with keyword like this?
with h5py.File("some_path.h5") as f:
f["data1"] = some_data
When the program flow exits the with-block, the file is closed regardless of what happens, including exceptions etc.
pytables (which h5py uses) keeps track of all open files and provides an easy method to force-close all open hdf5 files.
import tables
tables.file._open_files.close_all()
That attribute _open_files also has helpful methods to give you information and handlers for the open files.
I've found that hFile.bool() returns True if open, and False otherwise. This might be the simplest way to check.
In other words, do this:
hFile = h5py.File(path_to_file)
if hFile.__bool__():
hFile.close()
Related
I'm attempting to read in a series of files for processing contained in a single directory using RedVox:
input_directory = "/home/ben/Documents/Data/F1D1/21" # file location
rdvx_data = DataWindow(input_dir=input_directory, apply_correction=False, debug=True) # using RedVox to read in the files
print(os.listdir(input_directory)) # verifying the files actually exist...
# returns "['file1.rdvxz', 'file2.rdvxz', file3.rdvxz', ...etc]", they exist
# write audio portion to file
rdvx_data.to_json_file(base_dir=output_rpd_directory,
file_name=output_filename)
# this never runs, because rdvx_data.stations = [] (verified through debugging)
for station in rdvx_data.stations:
# some code here
Enabling debugging through arguments as seen above does not provide an extra details. In fact, there is no error message whatsoever. It writes the JSON file and pickle to disk, but the JSON file is full of null values and the pickle object is just a shell, no contents. So the files definitely exist, os.listdir() sees them, but RedVox does not.
I assume this is some very silly error or lack of understanding on my part. Any help is greatly appreciated. I have not worked with RedVox previously, nor do I have much understanding of what these files contain other than some audio data and some other data. I've simply been tasked with opening them to work on a model to analyze the data within.
SOLVED: Not sure why the previous code doesn't work (it was handed to me), however, I worked around the DataWindow call and went straight to calling the "redvox.api900.reader" object:
from redvox.api900 import reader
dataset_dir = "/home/*****/Documents/Data/F1D1/21/"
rdvx_files = glob(dataset_dir+"*.rdvxz")
for file in rdvx_files:
wrapped_packet = reader.read_rdvxz_file(file)
From here I can view all of the sensor data within:
if wrapped_packet.has_microphone_sensor():
microphone_sensor = wrapped_packet.microphone_sensor()
print("sample_rate_hz", microphone_sensor.sample_rate_hz())
Hope this helps anyone else who's confused.
I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE
Answering your questions:
Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.
That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.
I think that the solution would be some library written in C that
takes a list of files to read and then runs multiple threads (without
GIL). Is there something like this around?
In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).
If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.
Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil
import pickle, shutil, os
#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
#create two pickles
with open('test1.pickle', 'wb') as f:
pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
pickle.Pickler(f).dump(d2)
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
for pickle_file in ['test1.pickle', 'test2.pickle']:
with open(pickle_file, 'rb') as src:
shutil.copyfileobj(src, dst)
#unpack the data
with open('test3.pickle', 'rb') as f:
p = pickle.Unpickler(f)
while True:
try:
print(p.load())
except EOFError:
break
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.
Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.
I have written a sample that you can try.
import mmap
from time import perf_counter as pf
def load_files(filelist):
start = pf() # for rough time calculations
for filename in filelist:
with open(filename, mode="r", encoding="utf8") as file_obj:
with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
data = pickle.load(mmap_file_obj)
print(data)
print(f'Operation took {pf()-start} sec(s)')
Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file.
As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first.
As you said in your comment.
open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode.
The default mode is 0777 (octal), and the current umask value is first masked out.
Return the file descriptor for the newly opened file.
Give it a try and see how much impact does it do on your operation
You can read more about mmap here. And about file descriptor here
You can try multiprocessing:
import os,pickle
pickle_list=os.listdir("pickles")
output_dict=dict.fromkeys(pickle_list, '')
def pickle_process_func(picklename):
with open("pickles/"+picklename, 'rb') as file:
dapickle=pickle.load(file)
#if you need previus files output wait for it
while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
continue
#thandosomesh
print("loaded")
output_dict[picklename]=custom_func_i_dunno(dapickle)
from multiprocessing import Pool
with Pool(processes=10) as pool:
pool.map(pickle_process_func, pickle_list)
Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.
I'm trying to save and load a list of tuples of 2 ndarrays and an int to and from a .csv file.
In my current implementation, when I save and load a list l, there is some error in the recovered list on the order of 10^-10. Is there a way to save and recover values more precisely? I would also appreciate comments on my code in general. Thanks!
This is what I have now:
def save_l(l,path):
tup=()
for X in l:
u=X[0].reshape(784*9)
v=X[2]*np.ones(1)
w=np.concatenate((u,X[1],v))
tup+=(w,)
L=np.row_stack(tup)
df=pd.DataFrame(L)
df.to_csv(path)
def load_l(path):
df=pd.read_csv(path)
L=df.values
l=[]
for v in L:
tup=()
for i in range(784):
tup+=(v[9*i+1:9*(i+1)+1],)
T=np.row_stack(tup)
Q=v[9*784+1:10*784+1]
i=v[7841]
l.append((T,Q,i))
return(l)
It may be possible that the issue you are experiencing is due to absence of .csv file protection during save and load.
a good way to make sure that your file is locked until all data are saved/loaded completely is using a context manager. In this way, you won't lose any data in case your system stops execution due to whatever reason, because all results are saved in the moment when they are available.
I recommend to use the with-statement, whose primary use is an exception-safe cleanup of the object used inside (in this case your .csv). In other words, with makes sure that files are closed, locks released, contexts restored etc.
with open("myfile.csv", "a") as reference: # Drop to csv w/ context manager
df.to_csv(reference, sep = ",", index = False) # Same goes for read_csv
# As soon as you are here, reference is closed
If you try this and still see your error, it's not due to save/load issues.
I am trying to change the value of a keyword in the header of a FITS file.
Quite simple, this is the code:
import pyfits
hdulist = pyfits.open('test.fits') # open a FITS file
prihdr = hdulist[1].header
print prihdr['AREASCAL']
effarea = prihdr['AREASCAL']/5.
print effarea
prihdr['AREASCAL'] = effarea
print prihdr['AREASCAL']
I print the steps many times to check the values are correct. And they are.
The problem is that, when I check the FITS file afterwards, the keyword value in the header is not changed. Why does this happen?
You are opening the file in read-only mode. This won't prevent you from modifying any of the in-memory objects, but closing or flushing to the file (as suggested in other answers to this question) won't make any changes to the file. You need to open the file in update mode:
hdul = pyfits.open(filename, mode='update')
Or better yet use the with statement:
with pyfits.open(filename, mode='update') as hdul:
# Make changes to the file...
# The changes will be saved and the underlying file object closed when exiting
# the 'with' block
You need to close the file, or explicitly flush it, in order to write the changes back:
hdulist.close()
or
hdulist.flush()
Interestingly, there's a tutorial for that in the astropy tutorials github. Here is the ipython notebook viewer version of that tutorial that explains it all.
Basically, you are noticing that the python instance does not interact with disk instance. You have to save a new file or overwrite the old one explicitly.
I'm using xlwt in python to create a Excel spreadsheet. You could interchange this for almost anything else that generates a file; it's what I want to do with the file that's important.
from xlwt import *
w = Workbook()
#... do something
w.save('filename.xls')
I want to I have two use cases for the file: I stream it out to the user's browser or I attach it to an email. In both cases the file only needs to exist the duration of the web request that generates it.
What I'm getting at, the reason for starting this thread is saving to a real file on the filesystem has its own hurdles (stopping overwriting, cleaning up the file once done). Is there somewhere I could "save" it where it lives only in memory and only for the duration of the request?
cStringIO
(or mmap if it should be mutable)
Generalising the answer, as you suggested: If the "anything else that generates a file" won't accept a file-like object as well as a filepath, then you can reduce the hassle by using tempfile.NamedTemporaryFile