Iterate over pathlib paths and python-docx: zipfile.BadZipFile - python

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.
For the loop, I first created a list with pathlib and glob
from docx import Document
from docx.shared import Inches
import pathlib
# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files
Output of files looks fine.
[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]
When I now want to read in a document with the list I get a zip error (see full traceback below)
document = Document(files[1])
Traceback (most recent call last):
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-482c5438fa33>", line 1, in <module>
document = Document(files[1])
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).
document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))
Edit to Comment
I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.
for file in files:
try:
document = Document(file)
except:
print(f"The file: {file} appears to be corrupted")
Output:
The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted
Semi Solution to Future Readers
Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.

You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But
According to https://python-docx.readthedocs.io/en/latest/user/documents.html
You can open word documents with different codes.
First:
document = Document()
document.save(files[1])
Second:
document = Document(files[1])
document.save(files[1])
Also According to docs you can open them like files:
with open(files[1], 'rb') as f:
document = Document(f)

Related

HDF5 Attributes of External Links only accessible in “r” mode

Python 3.8
h5py 2.10.0
Windows 10
I have found that when using any other mode, other than “r”, attributes are not accessible and an error is raised.
_casa.h5 is the HDF file which contains numerous links to external files.
"/149901/20118/VRM_DATA" is the path to a group within one of the external files.
This works:
# open file in read only mode
hfile = h5py.File("..\casa\_data\_casa.h5", “r”)
hfile["/149901/20118/VRM_DATA"].attrs
Out[21]: <Attributes of HDF5 object at 1660502855488>
hfile.close()
This does not work:
# Open file in Read/write, file must exist mode
hfile = h5py.File("..\casa\_data\_casa.h5", “r+”)
hfile["/149901/20118/VRM_DATA"].attrs
Traceback (most recent call last):
File “C:\Users\HIAPRC\Miniconda3\envs\py38\lib\site-packages\IPython\core\interactiveshell.py”, line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
hfile["/149901/20118/VRM_DATA"].attrs
File “h5py_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “C:\Users\HIAPRC\Miniconda3\envs\py38\lib\site-packages\h5py_hl\group.py”, line 264, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File “h5py_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py\h5o.pyx”, line 190, in h5py.h5o.open
KeyError: “Unable to open object (unable to open external file, external link file name = ‘.\sub_files\casa_2021-01-13_10-03-51.h5’)”
I consulted the documentation and Googled around and did not find any answer to this.
Is this a result of design or a bug?
If this is by design then I guess I will have to edit attributes by opening the external file directly and make the changes there.
[EDIT] - HDFView 3.1.1 has no problem opening the _casa.h5 file, editing and saving edits done on the attributes of groups in external files.
Thanks in advance.
I pulled code from other SO answers into a single example. The code below creates 3 files, each with a single dataset with an attribute. It then creates another HDF5 file with ExternalLinks to the previously created files. The link objects also have attributes. All 4 files are then reopened in 'r' mode and attributes are accessed and printed. Maybe this will help you diagnose your problem. See below. Note: I am running Python 3.8.3 with h5py 2.10.0 on Windows 7 (Anaconda distribution to be precise).
import h5py
import numpy as np
for fcnt in range(1,4,1):
fname = 'file' + str(fcnt) + '.h5'
arr = np.random.random(50).reshape(10,5)
with h5py.File(fname,'w') as h5fw :
h5fw.create_dataset('data_'+str(fcnt),data=arr)
h5fw['data_'+str(fcnt)].attrs['ds_attr']='attribute '+str(fcnt)
with h5py.File('SO_65705770.h5',mode='w') as h5fw:
for fcnt in range(1,4,1):
h5name = 'file' + str(fcnt) + '.h5'
link_obj = h5py.ExternalLink(h5name,'/')
h5fw['link'+str(fcnt)] = h5py.ExternalLink(h5name,'/')
h5fw['link'+str(fcnt)].attrs['link_attr']='attr '+str(fcnt)
for fcnt in range(1,4,1):
fname = 'file' + str(fcnt) + '.h5'
print (fname)
with h5py.File(fname,'r') as h5fr :
print( h5fr['data_'+str(fcnt)].attrs['ds_attr'] )
with h5py.File('SO_65705770.h5',mode='r') as h5fr:
for fcnt in range(1,4,1):
print ('file',fcnt,":")
print('link attr:', h5fr['link'+str(fcnt)].attrs['link_attr'] )
print('linked ds attr:', h5fr['link'+str(fcnt)]['data_'+str(fcnt)].attrs['ds_attr'] )
The h5py documentation on external links was either not clear or I didn't understand it.
Anyway, it turned out that I had not assigned the external links properly.
Why it worked for one mode and not another is beyond me. If the way I did the external link assignment had fail for all modes, I probably would have figured out the cause sooner.
EDIT: What I was doing wrong and how to do it correctly to get what I wanted...
External File and what I want it to look like in the file containing the links.
What I did wrong.
with h5py.File("file_with_links.h5", mode="w") as h5fw:
h5fw["/901/20111/VRM_DATA"] = h5py.ExternalLink(h5name, "/901/20111/VRM_DATA")
What I should have done:
with h5py.File("file_with_links.h5", mode="w") as h5fw:
h5fw["/901/20111"] = h5py.ExternalLink(h5name, "/901/20111")

Extracting files from stream-mode tarfile

I`ve got an stream which contains .tar file conten, so I work with it using tarfile.open('r|')
What I need to do - is to look into list of files inside it and read some of them, then upload whole tar into another place.
When I try to tarfile.extractfile() after tarfile.getnames() it raises an tarfile.StreamError. But I cannot extract file which name I dont know.
How can I get list of files without crushing tarfile? I cannot save the whole tar into RAM\disk, because some files inside it can be larger than 10GB.
>>> tf = tarfile.open(fileobj=open('Downloads/clean-alpine.ova', 'rb'), mode='r|')
>>> tfn = tf.getnames()
>>> tfn
['clean-alpine.ovf', 'clean-alpine.mf', 'clean-alpine-disk1.vmdk']
>>> tf.fileobj
<tarfile._Stream object at 0x7ff878dac7b8>
>>> tf.fileobj.pos
33595392
>>> ovf = tf.extractfile('clean-alpine.ovf')
>>> ovf
<ExFileObject name=''>
>>> d = ovf.read().decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/tarfile.py", line 696, in read
self.fileobj.seek(offset + (self.position - start))
File "/usr/lib/python3.6/tarfile.py", line 522, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
Looking at the source of TarFile.extractall() the important bit is to use TarFile as an iterable, like I did in my use case:
for member in tf:
if not member.isfile():
continue
dest = Path.cwd() / member.name # This is vulnerable to, like, 5 things
with tf.extractfile(member) as tfobj:
dest.write_bytes(tfobj.read())

Read multiple xml file from a folder using ElementTree

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:
I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse("../project/content/"+file)
root = tree.getroot()
The error message is the following:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
tree = ET.parse("../project/content/"+file)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown ParseError: no element found: line 1, column 0
where did I make mistakes?
Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?
maintext = root.find("mainText").text
Thank you very much!
The right way to create path elements is using join:
Add print messages to the code before you try and create the tree.
Is the XML you try parse valid?
Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
print(file)
tree = ET.parse(file)
root = tree.getroot()

How to turn a comma seperated value TXT into a CSV for machine learning

How do I turn this format of TXT file into a CSV file?
Date,Open,high,low,close
1/1/2017,1,2,1,2
1/2/2017,2,3,2,3
1/3/2017,3,4,3,4
I am sure you can understand? It already has the comma -eparated values.
I tried using numpy.
>>> import numpy as np
>>> table = np.genfromtxt("171028 A.txt", comments="%")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\npyio.py", line 1551, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rb'))
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 151, in open
return ds.open(path, mode)
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 501, in open
raise IOError("%s not found." % path)
OSError: 171028 A.txt not found.
I have (S&P) 500 txt files to do this with.
You can use csv module. You can find more information here.
import csv
txt_file = 'mytext.txt'
csv_file = 'mycsv.csv'
in_txt = csv.reader(open(txt_file, "r"), delimiter=',')
out_csv = csv.writer(open(csv_file, 'w+'))
out_csv.writerows(in_txt)
Per #dclarke's comment, check the directory from which you run the code. As you coded the call, the file must be in that directory. When I have it there, the code runs without error (although the resulting table is a single line with four nan values). When I move the file elsewhere, I reproduce your error quite nicely.
Either move the file to be local, add a local link to the file, or change the file name in your program to use the proper path to the file (either relative or absolute).

Python Zipfile - Invalid Argument Errno 22

I have a .zip file and would like to know the names of the files within it. Here's the code:
zip_path = glob.glob(path + '/*.zip')[0]
file = open(zip_path, 'r') # opens without error
if zipfile.is_zipfile(file):
print str(file) # prints to console
my_zipfile = zipfile.ZipFile(zip_path) # throws IOError
Here is the traceback:
<open file u'/Users/me/Documents/project/uploads/assets/peter/offline_message/offline_imgs.zip', mode 'r' at 0x107b2a150>
Traceback (most recent call last):
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1680, in get_dps_app_builder_assets
link_to_assets_zip = zip_dps_app_builder_assets(server_url, app_slug, button_slugs)
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1724, in zip_dps_app_builder_assets
my_zipfile = zipfile.ZipFile(zip_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 712, in __init__
self._GetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 746, in _GetContents
self._RealGetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 779, in _RealGetContents
fp.seek(self.start_dir, 0)
IOError: [Errno 22] Invalid argument
I am very confused as to why this is happening since the file is clearly there and is a valid .zip file. The documentation clearly states that you can pass it either the path to the file or a file-like object, neither of which work in my case:
http://docs.python.org/2/library/zipfile#zipfile-objects
I was not able to figure this issue out and ended up doing it a different way entirely.
EDIT: In the Django app I work with, users needed to be able to upload assets in the form of .zip files and later download everything they had uploaded (plus other content we generate dynamically) in another zip with a different structure. So, I wanted to unzip a previously uploaded file and zip up the contents of that file up in another zip, which I couldn't do because of the error. Instead of reading the zip file when the user requested the download, I ended up unzipping it from a Django InMemoryUploadedFile (whose contents I was able to successfully read) and just leaving the unzipped files on the file system to work with later. The contents of the zip are only two smallish image files, so this workaround of unzipping the zip ahead of time to be used later worked OK for my purposes.

Categories

Resources