HDF5 Attributes of External Links only accessible in “r” mode - python

Python 3.8
h5py 2.10.0
Windows 10
I have found that when using any other mode, other than “r”, attributes are not accessible and an error is raised.
_casa.h5 is the HDF file which contains numerous links to external files.
"/149901/20118/VRM_DATA" is the path to a group within one of the external files.
This works:
# open file in read only mode
hfile = h5py.File("..\casa\_data\_casa.h5", “r”)
hfile["/149901/20118/VRM_DATA"].attrs
Out[21]: <Attributes of HDF5 object at 1660502855488>
hfile.close()
This does not work:
# Open file in Read/write, file must exist mode
hfile = h5py.File("..\casa\_data\_casa.h5", “r+”)
hfile["/149901/20118/VRM_DATA"].attrs
Traceback (most recent call last):
File “C:\Users\HIAPRC\Miniconda3\envs\py38\lib\site-packages\IPython\core\interactiveshell.py”, line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
hfile["/149901/20118/VRM_DATA"].attrs
File “h5py_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “C:\Users\HIAPRC\Miniconda3\envs\py38\lib\site-packages\h5py_hl\group.py”, line 264, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File “h5py_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py\h5o.pyx”, line 190, in h5py.h5o.open
KeyError: “Unable to open object (unable to open external file, external link file name = ‘.\sub_files\casa_2021-01-13_10-03-51.h5’)”
I consulted the documentation and Googled around and did not find any answer to this.
Is this a result of design or a bug?
If this is by design then I guess I will have to edit attributes by opening the external file directly and make the changes there.
[EDIT] - HDFView 3.1.1 has no problem opening the _casa.h5 file, editing and saving edits done on the attributes of groups in external files.
Thanks in advance.

I pulled code from other SO answers into a single example. The code below creates 3 files, each with a single dataset with an attribute. It then creates another HDF5 file with ExternalLinks to the previously created files. The link objects also have attributes. All 4 files are then reopened in 'r' mode and attributes are accessed and printed. Maybe this will help you diagnose your problem. See below. Note: I am running Python 3.8.3 with h5py 2.10.0 on Windows 7 (Anaconda distribution to be precise).
import h5py
import numpy as np
for fcnt in range(1,4,1):
fname = 'file' + str(fcnt) + '.h5'
arr = np.random.random(50).reshape(10,5)
with h5py.File(fname,'w') as h5fw :
h5fw.create_dataset('data_'+str(fcnt),data=arr)
h5fw['data_'+str(fcnt)].attrs['ds_attr']='attribute '+str(fcnt)
with h5py.File('SO_65705770.h5',mode='w') as h5fw:
for fcnt in range(1,4,1):
h5name = 'file' + str(fcnt) + '.h5'
link_obj = h5py.ExternalLink(h5name,'/')
h5fw['link'+str(fcnt)] = h5py.ExternalLink(h5name,'/')
h5fw['link'+str(fcnt)].attrs['link_attr']='attr '+str(fcnt)
for fcnt in range(1,4,1):
fname = 'file' + str(fcnt) + '.h5'
print (fname)
with h5py.File(fname,'r') as h5fr :
print( h5fr['data_'+str(fcnt)].attrs['ds_attr'] )
with h5py.File('SO_65705770.h5',mode='r') as h5fr:
for fcnt in range(1,4,1):
print ('file',fcnt,":")
print('link attr:', h5fr['link'+str(fcnt)].attrs['link_attr'] )
print('linked ds attr:', h5fr['link'+str(fcnt)]['data_'+str(fcnt)].attrs['ds_attr'] )

The h5py documentation on external links was either not clear or I didn't understand it.
Anyway, it turned out that I had not assigned the external links properly.
Why it worked for one mode and not another is beyond me. If the way I did the external link assignment had fail for all modes, I probably would have figured out the cause sooner.
EDIT: What I was doing wrong and how to do it correctly to get what I wanted...
External File and what I want it to look like in the file containing the links.
What I did wrong.
with h5py.File("file_with_links.h5", mode="w") as h5fw:
h5fw["/901/20111/VRM_DATA"] = h5py.ExternalLink(h5name, "/901/20111/VRM_DATA")
What I should have done:
with h5py.File("file_with_links.h5", mode="w") as h5fw:
h5fw["/901/20111"] = h5py.ExternalLink(h5name, "/901/20111")

Related

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.
For the loop, I first created a list with pathlib and glob
from docx import Document
from docx.shared import Inches
import pathlib
# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files
Output of files looks fine.
[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]
When I now want to read in a document with the list I get a zip error (see full traceback below)
document = Document(files[1])
Traceback (most recent call last):
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-482c5438fa33>", line 1, in <module>
document = Document(files[1])
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).
document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))
Edit to Comment
I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.
for file in files:
try:
document = Document(file)
except:
print(f"The file: {file} appears to be corrupted")
Output:
The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted
Semi Solution to Future Readers
Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.
You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But
According to https://python-docx.readthedocs.io/en/latest/user/documents.html
You can open word documents with different codes.
First:
document = Document()
document.save(files[1])
Second:
document = Document(files[1])
document.save(files[1])
Also According to docs you can open them like files:
with open(files[1], 'rb') as f:
document = Document(f)

How to write on top of pandas HDF5 'read-only mode' files?

I am storing data using pandas built-in HDF5 methods.
Somehow, these HDF5 files were turned into 'read-only' files, and I am getting a lot of Opening xxx in read-only mode messages when I open those files in write mode and I can't write them, which is something I really need to do.
The thing I really don't understand so far is how come those files turned into read-only, as I am not aware of a piece of code that I wrote that may result in that behavior. (I have tried to check if the data stored in the HDF5 is corrupt, but I am able to read it and manipulate it, so it seems to be working just fine)
I have 2 questions:
How can I append data to those 'read-only mode' HDF5 files? (Can I convert them back to write mode or any other clever solution?)
Is there any pandas method that would change the HDF5 file to a 'read-only mode' by default so I can avoid turning those files into read-only in the first place?
Code:
The piece of code that is raising this issue is, which is the piece I use to save the output I generated:
with pd.HDFStore('data/observer/' + self._currency + '_' + str(ts)) as hdf:
hdf.append(key='observers', value=df, format='table', data_columns=True)
I also use this piece of code to manipulate the outputs that were generated previously:
for the_file in list_dir:
if currency in the_file:
temp_df = pd.read_hdf(folder + the_file)
...
I use some select commands as well to get specific columns from the data files:
with pd.HDFStore('data/observer/' + self.currency + '_' + timestamp) as hdf:
df = hdf.select(key='observers', columns=[x, y])
Error Traceback:
File ".../data_processing/observer_data.py", line 52, in save_obs_to_pandas
hdf.append(key='observers', value=df, format='table', data_columns=True)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 963, in append
**kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 1341, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3930, in write
self.set_info()
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3163, in set_info
self.attrs.info = self.info
File ".../venv/lib/python3.5/site-packages/tables/attributeset.py", line 464, in __setattr__
nodefile._check_writable()
File ".../venv/lib/python3.5/site-packages/tables/file.py", line 2119, in _check_writable
raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../general_manager.py", line 144, in <module>
gm.run()
File ".../general_manager.py", line 114, in run
list_of_observer_managers = self.load_all_observer_managers()
File ".../general_manager.py", line 64, in load_all_observer_managers
observer = currency_pool.map(self.load_observer_manager, list_of_currencies)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
tables.exceptions.FileModeError: the file is not writable
The issue at hand was that I messed up with OS file permissions. The file I was trying to read belonged to the root (as I had run the code that generated those files with the root) and I was trying to access them with a user account.
I am running debian, and the following command (as root) solved my issues:
chown -R user.user folder
This commands recursively changes permissions of all files inside that folder to user.user.

cx_freeze and docx - problems when freezing

I have a simple program that takes input from the user and then does scraping with selenium. Since the user doesn't have Python environment installed I would like to convert it to *.exe. I usually use cx_freeze for that and I have successfully converted .py programs to .exe. At first it was missing some modules (like lxml) but I was able to solve it. Now I think I only have problem with docx package.
This is how I initiate the new document in my program (I guess this is what causes me problems):
doc = Document()
#then I do some stuff to it and add paragraph and in the end...
doc.save('results.docx')
When I run it from python everything works fine but when I convert to exe I get this error:
Traceback (most recent call last):
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\cx_Freeze\initscripts\Console.py", line 27, in <module>
exec(code, m.__dict__)
File "tribunalRio.py", line 30, in <module>
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\phys_pkg.py", line 31, in __new__
"Package not found at '%s'" % pkg_file
docx.opc.exceptions.PackageNotFoundError: Package not found at 'C:\Users\tyszkap\Dropbox (Dow Jones)\Python Projects\build\exe.win-a
md64-3.4\library.zip\docx\templates\default.docx'
This is my setup.py program:
from cx_Freeze import setup, Executable
executable = Executable( script = "tribunalRio.py" )
# Add certificate to the build
options = {
"build_exe": {'include_files' : ['default.docx'],
'packages' : ["lxml._elementpath", "inspect", "docx", "selenium"]
}
}
setup(
version = "0",
requires = [],
options = options,
executables = [executable])
I thought that explicitly adding default.docx to the package would solve the problem (I have even tried adding it to the library.zip but it gives me even more errors) but it didn't. I have seen this post but I don't know what they mean by:
copying the docx document.py module inside my function (instead of
using Document()
Any ideas? I know that freezing is not the best solution but I really don't want to build a web interface for such a simple program...
EDIT:
I have just tried this solution :
def find_data_file(filename):
if getattr(sys, 'frozen', False):
# The application is frozen
datadir = os.path.dirname(sys.executable)
else:
# The application is not frozen
# Change this bit to match where you store your data files:
datadir = os.path.dirname(__file__)
return os.path.join(datadir, filename)
doc = Document(find_data_file('default.docx'))
but again receive Traceback error (but the file is in this location...):
Traceback (most recent call last):
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\cx_Freeze\initscripts\Console.py", line 27, in <module>
exec(code, m.__dict__)
File "tribunalRio.py", line 43, in <module>
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\tyszkap\AppData\Local\Continuum\Anaconda3\lib\site-packages\docx\opc\phys_pkg.py", line 31, in __new__
"Package not found at '%s'" % pkg_file
docx.opc.exceptions.PackageNotFoundError: Package not found at 'C:\Users\tyszkap\Dropbox (Dow Jones)\Python Projects\build\exe.win-a
md64-3.4\default.docx'
What am I doing wrong?
I expect you'll find the problem has to do with your freezing operation not placing the default Document() template in the expected location. It's stored as package data in the python-docx package as docx/templates/default.docx (see setup.py here: https://github.com/python-openxml/python-docx/blob/master/setup.py#L37)
I don't know how to fix that in your case, but that's where the problem is it looks like.
I had the same problem and managed to get around it by doing the following. First, I located the default.docx file in the site-packages. Then, I copied it in the same directory as my .py file. I also start the .docx file with Document() which has a docx=... flag, to which I assigned the value: os.path.join(os.getcwd(), 'default.docx') and now it looks like doc = Document(docx=os.path.join(os.getcwd(), 'default.docx')). The final step was to include the file in the freezing process. Et voilà! So far I have no problem.

Read contents of .tar.gz file from website into a python 3.x object

I am new to python. I can't figure out what I am doing wrong when trying to read the contents of .tar.gz file into python. The tarfile I would like to read is hosted at the following web address:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
more info on file at this site (just so you can trust contents)
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901
The tarfile contains .pdf and .nxml copies of the journal article. And also a couple of image files.
If I open the file in my browser by copying and pasting. I can save to a location on my PC and import the tarfile fine using the following commands (note: winzip changes the file from .tar.gz to simply .tar when I save to location):
import tarfile
thetarfile = "C:/Users/dfcm/Documents/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar"
tfile = tarfile.open(thetarfile)
tfile
However, if I try to access the file directly using similar commands:
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
bbb = tarfile.open(thetarfile)
That results in the following error:
Traceback (most recent call last):
File "<pyshell#137>", line 1, in <module>
bbb = tarfile.open(thetarfile)
File "C:\Python30\lib\tarfile.py", line 1625, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Python30\lib\tarfile.py", line 1687, in gzopen
fileobj = bltn_open(name, mode + "b")
File "C:\Python30\lib\io.py", line 278, in __new__
return open(*args, **kwargs)
File "C:\Python30\lib\io.py", line 222, in open
closefd)
File "C:\Python30\lib\io.py", line 615, in __init__
_fileio._FileIO.__init__(self, name, mode, closefd)
IOError: [Errno 22] Invalid argument: 'ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar'
Can anyone explain what I am doing wrong when trying to read the .tar.gz file directly from the web address? Thanks in advance. Chris
Unfortunately you cannot just open files from the network. Things are a bit more complex here. You have to instruct the interpreter to create a network request and create an object representing the request state. This can be done using the urllib module.
import urllib.request
import tarfile
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(thetarfile)
thetarfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
The ftpstream object is a file-like that represents the connection to the ftp server. Then the tarfile module can access this stream. Since we do not pass the filename, we have to specify the compression in the mode parameter.

Python Zipfile - Invalid Argument Errno 22

I have a .zip file and would like to know the names of the files within it. Here's the code:
zip_path = glob.glob(path + '/*.zip')[0]
file = open(zip_path, 'r') # opens without error
if zipfile.is_zipfile(file):
print str(file) # prints to console
my_zipfile = zipfile.ZipFile(zip_path) # throws IOError
Here is the traceback:
<open file u'/Users/me/Documents/project/uploads/assets/peter/offline_message/offline_imgs.zip', mode 'r' at 0x107b2a150>
Traceback (most recent call last):
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1680, in get_dps_app_builder_assets
link_to_assets_zip = zip_dps_app_builder_assets(server_url, app_slug, button_slugs)
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1724, in zip_dps_app_builder_assets
my_zipfile = zipfile.ZipFile(zip_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 712, in __init__
self._GetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 746, in _GetContents
self._RealGetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 779, in _RealGetContents
fp.seek(self.start_dir, 0)
IOError: [Errno 22] Invalid argument
I am very confused as to why this is happening since the file is clearly there and is a valid .zip file. The documentation clearly states that you can pass it either the path to the file or a file-like object, neither of which work in my case:
http://docs.python.org/2/library/zipfile#zipfile-objects
I was not able to figure this issue out and ended up doing it a different way entirely.
EDIT: In the Django app I work with, users needed to be able to upload assets in the form of .zip files and later download everything they had uploaded (plus other content we generate dynamically) in another zip with a different structure. So, I wanted to unzip a previously uploaded file and zip up the contents of that file up in another zip, which I couldn't do because of the error. Instead of reading the zip file when the user requested the download, I ended up unzipping it from a Django InMemoryUploadedFile (whose contents I was able to successfully read) and just leaving the unzipped files on the file system to work with later. The contents of the zip are only two smallish image files, so this workaround of unzipping the zip ahead of time to be used later worked OK for my purposes.

Categories

Resources