saving pandas dataframe as hdf5 - python

Using pandas version 0.19.1 (with py27-tables-3.2.2_1 and hdf5-1.10.0 installed on my system), I am trying to save a pandas dataframe as a .h5 with:
import pandas as pd
df = pd.DataFrame(dict(A=range(5), B=range(5)))
df.to_hdf('savefile.h5', 'table', mode='w')
However the following error results:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1138, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 270, in to_hdf
f(store)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 264, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 873, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 2864, in write
self.attrs.ndim = data.ndim
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 461, in __setattr__
self._g__setattr(name, value)
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 403, in _g__setattr
self._g_setattr(self._v_node, name, stvalue)
File "tables/hdf5extension.pyx", line 696, in tables.hdf5extension.AttributeSet._g_setattr (tables/hdf5extension.c:7549)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5A.c", line 634, in H5Awrite
not an attribute
End of HDF5 error back trace
Can't set attribute 'ndim' in node:
/table (Group) ''.
Could someone provide a simple working example of how to save a pandas dataframe in hdf5 format.

pytables is currently not compatible with hdf5-1.10 as reported in this issue on github, downgrading to hdf5-0.8 is the recommended solution.

Related

system error when creating date_range with pandas

I would like to create a date_range() with using pandas. I am kinda sure it worked before I updated pandas package.
with following line of code, I am trying to create the date_range():
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
and here is the error message:
ValueError: Error parsing datetime string "1/1/2018" at position 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 36, in <module>
date_time_index = pd.date_range(start='1/1/2018', periods=8760, freq='H')
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 2024, in date_range
closed=closed, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 301, in __new__
ambiguous=ambiguous)
File "/usr/local/lib/python3.6/dist-packages/pandas/tseries/index.py", line 403, in _generate
start = Timestamp(start)
File "pandas/tslib.pyx", line 406, in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:9940)
File "pandas/tslib.pyx", line 1401, in pandas.tslib.convert_to_tsobject (pandas/tslib.c:25239)
File "pandas/tslib.pyx", line 1516, in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:26859)
File "pandas/src/datetime.pxd", line 141, in datetime._string_t
SystemError: <class 'str'> returned a result with an error set
What am I doing wrong?
Pandas version 0.19.1 date_range() does not work with the input I gave. I updated pandas to 0.23.4 now everything is fine.
Meanwhile:
pip3 install --upgrade pandas

Reading in csv file as dataframe from hdfs

I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
I know next to nothing about hdfs, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.
pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Use read instead open, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)

Python layer can't read hdf5 file in caffe framework

I write python layer for caffe that can read hdf5 with some manipulation needs for me. But I have some issue when open and read hdf5 file in python setup method. Here it is.
When I used tables with the
code in setup:
def setup(self, bottom, top):
h5file = tables.open_file("/home/titan/models/hdf5/train_small.h5", driver="H5FD_CORE")
I have this error when run net:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 37, in setup
h5file = tables.open_file("/home/titan/models/hdf5/train_small.h5", driver="H5FD_CORE")
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 318, in open_file
return File(filename, mode, title, root_uep, filters, **kwargs)
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 784, in __init__
self._g_new(filename, mode, **params)
File "tables/hdf5extension.pyx", line 465, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:4872)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5FDcore.c", line 273, in H5Pset_fapl_core
not a file access property list
File "../../../src/H5Pint.c", line 3371, in H5P_object_verify
property list is not a member of the class
File "../../../src/H5Pint.c", line 3321, in H5P_isa_class
not a property list
End of HDF5 error back trace
When I used h5py
def setup(self, bottom, top):
self.data = h5py.File('/home/titan/models/hdf5_nose_mouth/train_small.h5', 'r')
I have the same error:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 11, in <module>
import h5py
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/__init__.py", line 31, in <module>
from .highlevel import *
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/highlevel.py", line 13, in <module>
from ._hl.base import is_hdf5, HLObject
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/_hl/base.py", line 78, in <module>
dlapl = default_lapl()
File "/home/titan/anaconda/lib/python2.7/site-packages/h5py/_hl/base.py", line 65, in default_lapl
lapl = h5p.create(h5p.LINK_ACCESS)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
File "h5py/h5p.pyx", line 130, in h5py.h5p.create (-------src-dir-------/h5py/h5p.c:2491)
ValueError: Not a property list class (Not a property list class)
When I used deepdish:
def setup(self, bottom, top):
self.data = dd.io.load('/home/titan/models/hdf5/train_smallest.h5')
I have error too:
Traceback (most recent call last):
File "/home/titan/scripts/python_layers/pydata_hdf5.py", line 36, in setup
self.data = dd.io.load('/home/titan/models/hdf5/train_smallest.h5')
File "/home/titan/anaconda/lib/python2.7/site-packages/deepdish/io/hdf5io.py", line 476, in load
with tables.open_file(path, mode='r') as h5file:
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 318, in open_file
return File(filename, mode, title, root_uep, filters, **kwargs)
File "/home/titan/anaconda/lib/python2.7/site-packages/tables/file.py", line 784, in __init__
self._g_new(filename, mode, **params)
File "tables/hdf5extension.pyx", line 488, in tables.hdf5extension.File._g_new (tables/hdf5extension.c:5081)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5F.c", line 1582, in H5Fopen
not file access property list
File "../../../src/H5Pint.c", line 3321, in H5P_isa_class
not a property list
End of HDF5 error back trace
But when I read simply *.txt file, all is ok. Also I can read this file from console and use it in hdf5 layer in caffe. Please help me, how I can read hdf5 file from python layer?
Try to install another version of h5py. I just solved it with:
pip install h5py==prev_version
I guess it has something to do with the linking but it would be interesting to know which is the exact cause for this problem.

Create HDF5 file using pytables with table format and data columns

I want to read a h5 file previously created with PyTables.
The file is read using Pandas, and with some conditions, like this:
pd.read_hdf('myH5file.h5', 'anyTable', where='some_conditions')
From another question, I have been told that, in order for a h5 file to be "queryable" with read_hdf's where argument it must be writen in table format and, in addition, some columns must be declared as data columns.
I cannot find anything about it in PyTables documentation.
The documentation on PyTable's create_table method does not indicate anything about it.
So, right now, if I try to use something like that on my h5 file createed with PyTables I get the following:
>>> d = pd.read_hdf('test_file.h5','basic_data', where='operation==1')
C:\Python27\lib\site-packages\pandas\io\pytables.py:3070: IncompatibilityWarning:
where criteria is being ignored as this version [0.0.0] is too old (or
not-defined), read the file in and write it out to a new file to upgrade (with
the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 323, in read_hdf
return f(store, True)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 305, in <lambda>
key, auto_close=auto_close, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3196, in read_axes
values = self.selection.select()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 4482, in select
start=self.start, stop=self.stop)
File "C:\Python27\lib\site-packages\tables\table.py", line 1567, in read_where
self._where(condition, condvars, start, stop, step)]
File "C:\Python27\lib\site-packages\tables\table.py", line 1528, in _where
compiled = self._compile_condition(condition, condvars)
File "C:\Python27\lib\site-packages\tables\table.py", line 1366, in _compile_condition
compiled = compile_condition(condition, typemap, indexedcols)
File "C:\Python27\lib\site-packages\tables\conditions.py", line 430, in compile_condition
raise _unsupported_operation_error(nie)
NotImplementedError: unsupported operand types for *eq*: int, bytes
EDIT:
The traceback mentions something about IncompatibilityWarning and version [0.0.0], however if I check my versions of Pandas and Tables I get:
>>> import pandas
>>> pandas.__version__
'0.15.2'
>>> import tables
>>> tables.__version__
'3.1.1'
So, I am totally confused.
I had the same issue, and this is what I have done.
Create a HDF5 file by PyTables;
Read this HDF5 file by pandas.read_hdf and use parameters like "where = where_string, columns = selected_columns"
I got the warning message like below and other error messages:
D:\Program
Files\Anaconda3\lib\site-packages\pandas\io\pytables.py:3065:
IncompatibilityWarning: where criteria is being ignored as this
version [0.0.0] is too old (or not-defined), read the file in and
write it out to a new file to upgrade (with the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
I tried commands like this:
hdf5_store = pd.HDFStore(hdf5_file, mode = 'r')
h5cpt_store_new = hdf5_store.copy(hdf5_new_file, complevel=9, complib='blosc')
h5cpt_store_new.close()
And run the command exactly like step 2, it works.
pandas.version
'0.17.1'
tables.version
'3.2.2'

pandas HDFStore - how to reopen?

I created a file by using:
store = pd.HDFStore('/home/.../data.h5')
and stored some tables using:
store['firstSet'] = df1
store.close()
I closed down python and reopened in a fresh environment.
How do I reopen this file?
When I go:
store = pd.HDFStore('/home/.../data.h5')
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 207, in __init__
self.open(mode=mode, warn=False)
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 302, in open
self.handle = _tables().openFile(self.path, self.mode)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 230, in openFile
return File(filename, mode, title, rootUEP, filters, **kwargs)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 495, in __init__
self._g_new(filename, mode, **params)
File "hdf5Extension.pyx", line 317, in tables.hdf5Extension.File._g_new (tables/hdf5Extension.c:3039)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5F.c", line 1582, in H5Fopen
unable to open file
File "H5F.c", line 1373, in H5F_open
unable to read superblock
File "H5Fsuper.c", line 334, in H5F_super_read
unable to find file signature
File "H5Fsuper.c", line 155, in H5F_locate_signature
unable to find a valid file signature
End of HDF5 error back trace
Unable to open/create file '/home/.../data.h5'
What am I doing wrong here? Thank you.
In my hands, following approach works best:
df = pd.DataFrame(...)
"write"
with pd.HDFStore('test.h5', mode='w') as store:
store.append('df', df, data_columns= df.columns, format='table')
"read"
with pd.HDFStore('test.h5', mode='r') as newstore:
df_restored = newstore.select('df')
You could try doing instead:
store = pd.io.pytables.HDFStore('/home/.../data.h5')
df1 = store['firstSet']
or use the read method directly:
df1 = pd.read_hdf('/home/.../data.h5', 'firstSet')
Either way, you should have pandas 0.12.0 or higher...
I had the same problem and finally fixed it by installing the pytables module (next to the pandas modules which I was using):
conda install pytables
which got me numexpr-2.4.3 and pytables-3.2.0
After that it worked. I am using pandas 0.16.2 under python 2.7.9

Categories

Resources