Can't access external link with python + h5py - python

Recently I have started working with .hdf5 files and still can't figure out how to properly use external links.
I have got a few .hdf5 files. Each file has got the same structure e.g. same keys and data types. I want to merge them into one file but keep them separate with a different key for each file.
Here's what I do:
myfile = h5py.File("/path_to_the_directory/merged_images.hdf5", 'w')
myfile['0.000'] = h5py.ExternalLink("img_000.hdf5", "/path_to_the_directory/images")
myfile['0.001'] = h5py.ExternalLink("img_001.hdf5", "/path_to_the_directory/images")
myfile.flush()
Then I try to read it with:
myfile = h5py.File("/path_to_the_directory/merged_images.hdf5", 'r')
keys = list(myfile.keys())
print(keys)
print(list(myfile[keys[0]]))
The line print(keys) gives me ['0.000', '0.001']. So, I believe the file's structure is okay.
And the next lines gives me an exception:
KeyError: "Unable to open object (unable to open external file, external link file name = 'img_000.hdf5')"
Am I doing something wrong? The documentation is pretty poor and I haven
t found a relevant use-case there.

The problem is that you are mixing up paths. It is important to distinguish between two types of paths:
File path (the location on your hard drive).
Dataset path: this path is internal to the HDF5-file, and does not depend on where you store the file.
The syntax of h5py.ExternalLink, as mentioned in the documentation, is:
myfile['/path/of/link'] = h5py.ExternalLink('/path/to/file.hdf5', '/path/to/dataset')
Thereby I would like to encourage you to use a relative file path for the ExternalLink. If you do that, then everything will continue to work even if you move the collection of files to a new location on your hard drive (or give them to somebody else).
With the correct paths, your example works, as shown below.
Note that, to illustrate my remark about relative file paths, I have made all paths of the datasets absolute (these are only internal to the file, and do not depend on where the file is stored on the hard drive) while I kept the file paths relative.
import h5py
import numpy as np
myfile = h5py.File('test_a.hdf5', 'w')
myfile['/path/to/data'] = np.array([0,1,2])
myfile.close()
myfile = h5py.File('test_b.hdf5', 'w')
myfile['/path/to/data'] = np.array([3,4,5])
myfile.close()
myfile = h5py.File('test.hdf5', 'w')
myfile['/a'] = h5py.ExternalLink('test_a.hdf5', '/path/to/data')
myfile['/b'] = h5py.ExternalLink('test_b.hdf5', '/path/to/data')
myfile.close()
myfile = h5py.File('test.hdf5', 'r')
keys = list(myfile.keys())
print(keys)
print(list(myfile[keys[0]]))
print(list(myfile[keys[1]]))
myfile.close()
Prints (as expected):
['a', 'b']
[0, 1, 2]
[3, 4, 5]

Related

Edit multiple text files, and save as new files

My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')

json dump() into specific folder

This seems like it should be simple enough, but haven't been able to find a working example of how to approach this. Simply put I am generating a JSON file based on a list that a script generates. What I would like to do, is use some variables to run the dump() function, and produce a json file into specific folders. By default it of course dumps into the same place the .py file is located, but can't seem to find a way to run the .py file separately, and then produce the JSON file in a new folder of my choice:
import json
name = 'Best'
season = '2019-2020'
blah = ['steve','martin']
with open(season + '.json', 'w') as json_file:
json.dump(blah, json_file)
Take for example the above. What I'd want to do is the following:
Take the variable 'name', and use that to generate a folder of the same name inside the folder the .py file is itself. This would then place the JSON file, in the folder, that I can then manipulate.
Right now my issue is that I can't find a way to produce the file in a specific folder. Any suggestions, as this does seem simple enough, but nothing I've found had a method to do this. Thanks!
Python's pathlib is quite convenient to use for this task:
import json
from pathlib import Path
data = ['steve','martin']
season = '2019-2020'
Paths of the new directory and json file:
base = Path('Best')
jsonpath = base / (season + ".json")
Create the directory if it does not exist and write json file:
base.mkdir(exist_ok=True)
jsonpath.write_text(json.dumps(data))
This will create the directory relative to the directory you started the script in. If you wanted a absolute path, you could use Path('/somewhere/Best').
If you wanted to start the script while beeing in some other directory and still create the new directory into the script's directory, use: Path(__file__).resolve().parent / 'Best'.
First of all, instead of doing everything in same place have a separate function to create folder (if already not present) and dump json data as below:
def write_json(target_path, target_file, data):
if not os.path.exists(target_path):
try:
os.makedirs(target_path)
except Exception as e:
print(e)
raise
with open(os.path.join(target_path, target_file), 'w') as f:
json.dump(data, f)
Then call your function like :
write_json('/usr/home/target', 'my_json.json', my_json_data)
Use string format
import json
import os
name = 'Best'
season = '2019-2020'
blah = ['steve','martin']
try:
os.mkdir(name)
except OSError as error:
print(error)
with open("{}/{}.json".format(name,season), 'w') as json_file:
json.dump(blah, json_file)
Use os.path.join():
with open(os.path.join(name, season+'.json'), 'w') as json_file
The advantage above writing a literal slash is that it will automatically pick the type of slash for the operating system you are on (slash on linux, backslash on windows)

MetaData of downloaded zipped file

url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path='D:')
I am writing above code to download a zipped file from a url and have downloaded and extracted all files from it to a specified drive and it is working fine.
Is there a way I can get meta data of all files extracted from z for example.
filenames,file sizes and file extenstions etc?
Zipfile objects actually have built in tools for this that you can use without even extracting anything. infolist returns a list of ZipInfo objects that you can read certain information out of, including full file name and uncompressed size.
import os
url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
info = z.infolist()
data = []
for obj in info:
name = os.path.splitext(obj.filename)
data.append(name[0], name[1], obj.file_size)
I also used os.path.splitext just to separate out the file's name from its extension as you did ask for file type separately from the name.
I don't know of a built-in way to do that using the zipfile module, however it is easily done using os.path:
import os
EXTRACT_PATH = "D:"
z= zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path=EXTRACT_PATH)
extracted_files = [os.path.join(EXTRACT_PATH, filename) for filename in z.namelist()]
for extracted_file in extracted_files:
# All metadata operations here, such as:
print os.path.getsize(extracted_file)

Opening a file in a folder in Python

I want to open a file to write to.
with open('test.txt','a') as textfile:
...
It works like this.
Now I want this file to be opened/created from a directory called args.runkeyword.
with open(os.path.join(args.runkeyword, 'test.txt'),'a') as textfile:
t says it can't find test/test.txt (supposing runkeyword is test).
I also tried by appending with os.getcwd() but it still can't find or create the file.
Any ideas?
os.getcwd() is irrelevant on your work actually. Use os.listdir() to see every folder in a directory. If anything named by test before it may be problem.
A recursive function like this may usefull for you;
import os
def tara(directory):
start = os.getcwd()
files = []
os.chdir(directory)
for oge in os.listdir(os.curdir):
if not os.path.isdir(oge):
files.append(oge)
else:
files.extend(tara(oge))
os.chdir(start)
return files
file = open('test.txt', 'a+')
You should have 'a+' not 'a', the + allows you to append.

pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf,
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for i in range(1000):
filepath = f"my/pdfs/{i}.pdf"
reader = PdfReader(open(filepath, "rb"))
for page in reader.pages:
writer.add_page(page)
with open("document-output.pdf", "wb") as fh:
writer.write(fh)
Execute the above code,when reader = PdfReader(open(filepath, "rb")),
An error message:
IOError: [Errno 24] Too many open files:
I think this is a bug, If not, What should I do?
I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.
Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code
The Short Answer
Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:
from PyPDF2 import PdfFileMerger, PdfFileReader
[...]
merger = PdfFileMerger()
for filename in filenames:
merger.append(PdfFileReader(file(filename, 'rb')))
merger.write("document-output.pdf")
The Long Answer
The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.
To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.
My initial attempts looked like the following, and were resulting in the same IO Problems:
merger = PdfFileMerger()
for filename in filenames:
merger.append(filename)
merger.write(output_file_path)
Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'rb')
my_file = True
elif type(fileobj) == file:
fileobj.seek(0)
filecontent = fileobj.read()
fileobj = StringIO(filecontent)
my_file = True
elif type(fileobj) == PdfFileReader:
orig_tell = fileobj.stream.tell()
fileobj.stream.seek(0)
filecontent = StringIO(fileobj.stream.read())
fileobj.stream.seek(orig_tell)
fileobj = filecontent
my_file = True
We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!
However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.
I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().
Hope this helped!
EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.
If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.
EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks #Agostino).
The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Here is an example concatenation script.
The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
Disclaimer: I am the primary pdfrw author.
The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.
What you could try is closing the files in the for loop:
input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
f = open(file, 'rb')
input = PdfFileReader(f)
# Some code
f.close()
I have written this code to help with the answer:-
import sys
import os
import PyPDF2
merger = PyPDF2.PdfFileMerger()
#get PDFs files and path
path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)
#iterate among the documents
for pdf in pdfs:
try:
#if doc exist then merge
if os.path.exists(pdf):
input = PyPDF2.PdfFileReader(open(pdf,'rb'))
merger.append((input))
else:
print(f"problem with file {pdf}")
except:
print("cant merge !! sorry")
else:
print(f" {pdf} Merged !!! ")
merger.write("Merged_doc.pdf")
In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object
It maybe just what it says, you are opening to many files.
You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed.

Categories

Resources