Create HDF5 file using pytables with table format and data columns - python

I want to read a h5 file previously created with PyTables.
The file is read using Pandas, and with some conditions, like this:
pd.read_hdf('myH5file.h5', 'anyTable', where='some_conditions')
From another question, I have been told that, in order for a h5 file to be "queryable" with read_hdf's where argument it must be writen in table format and, in addition, some columns must be declared as data columns.
I cannot find anything about it in PyTables documentation.
The documentation on PyTable's create_table method does not indicate anything about it.
So, right now, if I try to use something like that on my h5 file createed with PyTables I get the following:
>>> d = pd.read_hdf('test_file.h5','basic_data', where='operation==1')
C:\Python27\lib\site-packages\pandas\io\pytables.py:3070: IncompatibilityWarning:
where criteria is being ignored as this version [0.0.0] is too old (or
not-defined), read the file in and write it out to a new file to upgrade (with
the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 323, in read_hdf
return f(store, True)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 305, in <lambda>
key, auto_close=auto_close, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3196, in read_axes
values = self.selection.select()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 4482, in select
start=self.start, stop=self.stop)
File "C:\Python27\lib\site-packages\tables\table.py", line 1567, in read_where
self._where(condition, condvars, start, stop, step)]
File "C:\Python27\lib\site-packages\tables\table.py", line 1528, in _where
compiled = self._compile_condition(condition, condvars)
File "C:\Python27\lib\site-packages\tables\table.py", line 1366, in _compile_condition
compiled = compile_condition(condition, typemap, indexedcols)
File "C:\Python27\lib\site-packages\tables\conditions.py", line 430, in compile_condition
raise _unsupported_operation_error(nie)
NotImplementedError: unsupported operand types for *eq*: int, bytes
EDIT:
The traceback mentions something about IncompatibilityWarning and version [0.0.0], however if I check my versions of Pandas and Tables I get:
>>> import pandas
>>> pandas.__version__
'0.15.2'
>>> import tables
>>> tables.__version__
'3.1.1'
So, I am totally confused.

I had the same issue, and this is what I have done.
Create a HDF5 file by PyTables;
Read this HDF5 file by pandas.read_hdf and use parameters like "where = where_string, columns = selected_columns"
I got the warning message like below and other error messages:
D:\Program
Files\Anaconda3\lib\site-packages\pandas\io\pytables.py:3065:
IncompatibilityWarning: where criteria is being ignored as this
version [0.0.0] is too old (or not-defined), read the file in and
write it out to a new file to upgrade (with the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
I tried commands like this:
hdf5_store = pd.HDFStore(hdf5_file, mode = 'r')
h5cpt_store_new = hdf5_store.copy(hdf5_new_file, complevel=9, complib='blosc')
h5cpt_store_new.close()
And run the command exactly like step 2, it works.
pandas.version
'0.17.1'
tables.version
'3.2.2'

Related

Pandas giving file not found error while trying to access csv via anaconda prompt

Beginner here. trying to load this table via python so i can figure out how i can manipulate it and gain some insight with the eventual intention of calculating the WOE and/or running a regression.
The command ran fine on a test db of two rows i created so it must be something to do with the format of the csv im trying to use. Its a file with 8000 customers and 50 associated variables including some dates and then counts, sums and averages for 30, 60 and 90 day windows of a number of different factors. Could any of this be the reason i get the error message at the bottom?
(* are just redaction's)
data = pd.read_csv("C:\Users\******\Desktop\*******.csv")
>>> data = pd.read_csv(r"C:\Users\******\Desktop\**************")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'C:\\Users\\******\\Desktop\\**************' does not exist: b'C:\\Users\\******\\Desktop\\**************'
....
add r(raw string) before ":
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
You should replace single backslash with double backslash, like so
data = pd.read_csv("C:\\Users******\\Desktop*******.csv")
or prefix path with r
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
See here for full description on which characters need escaping in python strings.
Its better to create a separate folder where keep both data and your csv file...
Then just read by only file name... Try to press tab when you are in parenthesis
because it will give you also suggestion where you will see if the file is available or not.
df = pd.read_csv('filename.csv)

How can read Minecraft .mca files so that in python I can extract individual blocks?

I can't find a way of reading the Minecraft world files in a way that i could use in python
I've looked around the internet but can find no tutorials and only a few libraries that claim that they can do this but never actually work
from nbt import *
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
I expected this to work but instead I got errors about the file not being compressed or something of the sort
Full error:
Traceback (most recent call last):
File "C:\Users\rober\Desktop\MinePy\MinecraftWorldReader.py", line 2, in <module>
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 628, in __init__
self.parse_file()
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 652, in parse_file
type = TAG_Byte(buffer=self.file)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 99, in __init__
self._parse_buffer(buffer)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 105, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Use anvil parser. (Install with pip install anvil-parser)
Reading
import anvil
region = anvil.Region.from_file('r.0.0.mca')
# You can also provide the region file name instead of the object
chunk = anvil.Chunk.from_region(region, 0, 0)
# If `section` is not provided, will get it from the y coords
# and assume it's global
block = chunk.get_block(0, 0, 0)
print(block) # <Block(minecraft:air)>
print(block.id) # air
print(block.properties) # {}
https://pypi.org/project/anvil-parser/
According to this page, the .mca files is not totally kind of of NBT file. It begins with an 8KiB header which includes the offsets of chunks in the region file itself and the timestamps for the last updates of those chunks.
I recommend you to see the offical announcement and this page for more information.

Inserting data to impala table using Ibis python

I'm trying to insert df into a ibis created impala table with partition. I am running this on remote kernel using spyder 3.2.4 on windows 10 machine and python 3.6.2 on edge node machine running CentOS.
I get following error:
Writing DataFrame to temporary file
Writing CSV to: /tmp/ibis/pandas_0032f9dd1916426da62c8b4d8f4dfb92/0.csv
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
insert = target_table.insert(df3)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/client.py", line 1674, in insert
writer, expr = write_temp_dataframe(self._client, obj)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 225, in write_temp_dataframe
return writer, writer.delimited_table(path)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 188, in delimited_table
schema = self.get_schema()
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 184, in get_schema
return pandas_to_ibis_schema(self.df)
File "/usr/local/lib/python3.6/site-packages/ibis/impala/pandas_interop.py", line 219, in pandas_to_ibis_schema
return schema(pairs)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/api.py", line 105, in schema
return Schema.from_tuples(pairs)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 109, in from_tuples
return Schema(names, types)
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 55, in init
self.types = [validate_type(typ) for typ in types]
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 55, in
self.types = [validate_type(typ) for typ in types]
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 1040, in validate_type
return TypeParser(t).parse()
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 901, in parse
t = self.type()
File "/usr/local/lib/python3.6/site-packages/ibis/expr/datatypes.py", line 1033, in type
raise SyntaxError('Type cannot be parsed: {}'.format(self.text))
File "", line unknown
SyntaxError: Type cannot be parsed: integer
Error was coming due to structure and security of the hadoop system. Ibis package tries to create temp_db & temp_hdfs_location in __ibis_tmp & /tmp/ibis/ respectively. Since in our system default locations are not open to any user other than root/system admin... insert command was erroring out when getting data from /tmp/ibis/ to actual db (still not clear but may be via __ibis_tmp dbase). Once we edited the config_init.py file for ibis package to a allowed temp location/db. It worked like a charm.
instead of editing the config_init.py mentioned
https://stackoverflow.com/a/47543691/5485370
It is easier to assign the temp db and path using the ibis.options:
ibis.options.impala.temp_db = 'your_temp_db'
ibis.options.impala.temp_hdfs_path = 'your_temp_hdfs_path'

saving pandas dataframe as hdf5

Using pandas version 0.19.1 (with py27-tables-3.2.2_1 and hdf5-1.10.0 installed on my system), I am trying to save a pandas dataframe as a .h5 with:
import pandas as pd
df = pd.DataFrame(dict(A=range(5), B=range(5)))
df.to_hdf('savefile.h5', 'table', mode='w')
However the following error results:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1138, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 270, in to_hdf
f(store)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 264, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 873, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 1315, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py", line 2864, in write
self.attrs.ndim = data.ndim
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 461, in __setattr__
self._g__setattr(name, value)
File "/usr/local/lib/python2.7/site-packages/tables/attributeset.py", line 403, in _g__setattr
self._g_setattr(self._v_node, name, stvalue)
File "tables/hdf5extension.pyx", line 696, in tables.hdf5extension.AttributeSet._g_setattr (tables/hdf5extension.c:7549)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5A.c", line 634, in H5Awrite
not an attribute
End of HDF5 error back trace
Can't set attribute 'ndim' in node:
/table (Group) ''.
Could someone provide a simple working example of how to save a pandas dataframe in hdf5 format.
pytables is currently not compatible with hdf5-1.10 as reported in this issue on github, downgrading to hdf5-0.8 is the recommended solution.

pandas HDFStore - how to reopen?

I created a file by using:
store = pd.HDFStore('/home/.../data.h5')
and stored some tables using:
store['firstSet'] = df1
store.close()
I closed down python and reopened in a fresh environment.
How do I reopen this file?
When I go:
store = pd.HDFStore('/home/.../data.h5')
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 207, in __init__
self.open(mode=mode, warn=False)
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 302, in open
self.handle = _tables().openFile(self.path, self.mode)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 230, in openFile
return File(filename, mode, title, rootUEP, filters, **kwargs)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 495, in __init__
self._g_new(filename, mode, **params)
File "hdf5Extension.pyx", line 317, in tables.hdf5Extension.File._g_new (tables/hdf5Extension.c:3039)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5F.c", line 1582, in H5Fopen
unable to open file
File "H5F.c", line 1373, in H5F_open
unable to read superblock
File "H5Fsuper.c", line 334, in H5F_super_read
unable to find file signature
File "H5Fsuper.c", line 155, in H5F_locate_signature
unable to find a valid file signature
End of HDF5 error back trace
Unable to open/create file '/home/.../data.h5'
What am I doing wrong here? Thank you.
In my hands, following approach works best:
df = pd.DataFrame(...)
"write"
with pd.HDFStore('test.h5', mode='w') as store:
store.append('df', df, data_columns= df.columns, format='table')
"read"
with pd.HDFStore('test.h5', mode='r') as newstore:
df_restored = newstore.select('df')
You could try doing instead:
store = pd.io.pytables.HDFStore('/home/.../data.h5')
df1 = store['firstSet']
or use the read method directly:
df1 = pd.read_hdf('/home/.../data.h5', 'firstSet')
Either way, you should have pandas 0.12.0 or higher...
I had the same problem and finally fixed it by installing the pytables module (next to the pandas modules which I was using):
conda install pytables
which got me numexpr-2.4.3 and pytables-3.2.0
After that it worked. I am using pandas 0.16.2 under python 2.7.9

Categories

Resources