Read multiple xml file from a folder using ElementTree

Read multiple xml file from a folder using ElementTree - python

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:
I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse("../project/content/"+file)
root = tree.getroot()
The error message is the following:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
tree = ET.parse("../project/content/"+file)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown ParseError: no element found: line 1, column 0
where did I make mistakes?
Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?
maintext = root.find("mainText").text
Thank you very much!

The right way to create path elements is using join:
Add print messages to the code before you try and create the tree.
Is the XML you try parse valid?
Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
print(file)
tree = ET.parse(file)
root = tree.getroot()

Related

File "<string>", line unknown ParseError: not well-formed (invalid token): line 1, column 5

After running this, i get the errors below and i have tried everything else mentioned here and i cannot get past this error
from xml.etree import ElementTree as ET
from os import path, listdir
path__ = "blogs/"
files = [path.join(path__, f) for f in listdir(path__)
if f.endswith('.xml')]
for file in files:
print(file)
parse = ET.XMLParser(encoding="unicode_escape")
tree = ET.fromstring(file, parser=parse)
blogs/1000331.female.37.indUnk.Leo.xml
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-40e3bf76804f>", line 8, in <module>
tree = ET.fromstring(file, parser=parse)
File "/usr/lib/python3.9/xml/etree/ElementTree.py", line 1347, in XML
parser.feed(text)

The fromstring, as the name says takes in a string that is XML, not a file name. Hence it tries to parse the text
blogs/foo.xml
as an XML document. You want to use ET.parse instead:
parser = ET.XMLParser(encoding="unicode_escape")
for file in files:
print(file)
tree = ET.parse(file, parser=parser)

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.
For the loop, I first created a list with pathlib and glob
from docx import Document
from docx.shared import Inches
import pathlib
# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files
Output of files looks fine.
[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]
When I now want to read in a document with the list I get a zip error (see full traceback below)
document = Document(files[1])
Traceback (most recent call last):
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-482c5438fa33>", line 1, in <module>
document = Document(files[1])
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).
document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))
Edit to Comment
I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.
for file in files:
try:
document = Document(file)
except:
print(f"The file: {file} appears to be corrupted")
Output:
The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted
Semi Solution to Future Readers
Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.

You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But
According to https://python-docx.readthedocs.io/en/latest/user/documents.html
You can open word documents with different codes.
First:
document = Document()
document.save(files[1])
Second:
document = Document(files[1])
document.save(files[1])
Also According to docs you can open them like files:
with open(files[1], 'rb') as f:
document = Document(f)

Merge PDF files with same prefix using PyPDF2 Python

I have multiple PDF files that have different prefixes. I want to merge these pdf files based on the third prefix (third value in the underscore). I want to do this using python library PyPDF2.
This is the error message
Traceback (most recent call last):
File "C:/test2.py", line 12, in <module>
merger.append(filename)
File "C:\py\lib\site-packages\PyPDF2\merger.py", line 203, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\py\lib\site-packages\PyPDF2\merger.py", line 114, in merge
fileobj = file(fileobj, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '0_2021_564495_12345.pdf'
Process finished with exit code 1
For example:
0_2021_1_123.pdf
0_2021_1_1234.pdf
0_2021_1_12345.pdf
0_2021_2_123.pdf
0_2021_2_1234.pdf
0_2021_2_12345.pdf
Expected outcome
1_merged.pdf
2_merged.pdf
Here is what i tried but i am getting an error and it is not working. Any help is much appreciated.
from PyPDF2 import PdfFileMerger
import io
import os
files = os.listdir("C:\\test\\raw")
x=0
merger = PdfFileMerger()
for filename in files:
print(filename.split('_')[2])
prefix = filename.split('_')[2]
if filename.split('_')[2] == prefix:
merger.append(filename)
merger.write("C:\\test\\result" + prefix + "_merged.pdf")
merger.close()

python parse/process all xml files in folder

I am trying to run my code on all xml files in the folder
I get a few errors when I run the code and it generates some files
but not all
here is my code:
import xml.etree.ElementTree as ET
import os
import glob
path = 'C:/xml/'
for infile in glob.glob( os.path.join(path, '*.xml') ):
tree = ET.parse(infile)
root = tree.getroot()
with open(infile+'new.csv','w') as outfile:
for elem in root.findall('.//event[#type="MEDIA"]'):
mediaidelem = elem.find('./mediaid')
if mediaidelem is not None:
outfile.write("{}\n".format(mediaidelem.text))
here is the error log all the
Traceback (most recent call last):
File "C:\xml\2.py", line 8, in <module>
tree = ET.parse(infile)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line None
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Considering the error message you may have some empty (or malformed) files.
I would add a error handling here to warn user about such error and then skip the file. Something like:
for infile in glob.glob( os.path.join(path, '*.xml') ):
try:
tree = ET.parse(infile)
except xml.etree.ElementTree.ParseError as e:
print infile, str(e)
continue
...
I did not tried to reproduce it here, it is just a guess.

Parse XML Tag value for all files in directory using Python

I can't quite make the leap despite pre-existing similar questions. Help would be valued!
I am trying to recursively parse all xml files in the directory/sub directory
I am looking for the value that appears for the tag "Operator id"
Example source XML:
<Operators>
<Operator id="OId_LD">
<OperatorCode>LD</OperatorCode>
<OperatorShortName>ARRIVA THE SHIRES LIMIT</OperatorShortName>
This is the code I have thus far:
from xml.dom.minidom import parse
import os
def jarv(target_folder):
for root,dirs,files in os.walk(target_folder):
for targetfile in files:
if targetfile.endswith(".xml"):
print targetfile
dom=parse(targetfile)
name = dom.getElementsByTagName('Operator_id')
print name[0].firstChild.nodeValue
This is the terminal command I am running:
python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
And this is the error I receive:
tfl_64-31_-37434-y05.xml
encodings.xml
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "xml_tag.py", line 8, in jarv
dom=parse(targetfile)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: 'encodings.xml'
(frigo)andytmac:AtoB_GTFS admin$ python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
tfl_64-31_-37434-y05.xml
If I comment out the code after the 'print target file' line it does list all the xml files I have.
Thanks for your assistance,
Andy

You're not looking at the right place (relative path) : when you use for root, dirs, files in os.walk(target_folder):, files is a list of the file names in the directory root, and not their absolute path.
Try remplacing dom=parse(targetfile) by dom = parse(os.sep.join(root, targetfile))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read multiple xml file from a folder using ElementTree - python

Related

File "<string>", line unknown ParseError: not well-formed (invalid token): line 1, column 5

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

Merge PDF files with same prefix using PyPDF2 Python

python parse/process all xml files in folder

Parse XML Tag value for all files in directory using Python

Categories

Resources