python parse/process all xml files in folder - python

I am trying to run my code on all xml files in the folder
I get a few errors when I run the code and it generates some files
but not all
here is my code:
import xml.etree.ElementTree as ET
import os
import glob
path = 'C:/xml/'
for infile in glob.glob( os.path.join(path, '*.xml') ):
tree = ET.parse(infile)
root = tree.getroot()
with open(infile+'new.csv','w') as outfile:
for elem in root.findall('.//event[#type="MEDIA"]'):
mediaidelem = elem.find('./mediaid')
if mediaidelem is not None:
outfile.write("{}\n".format(mediaidelem.text))
here is the error log all the
Traceback (most recent call last):
File "C:\xml\2.py", line 8, in <module>
tree = ET.parse(infile)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "C:\Python34\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "<string>", line None
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Considering the error message you may have some empty (or malformed) files.
I would add a error handling here to warn user about such error and then skip the file. Something like:
for infile in glob.glob( os.path.join(path, '*.xml') ):
try:
tree = ET.parse(infile)
except xml.etree.ElementTree.ParseError as e:
print infile, str(e)
continue
...
I did not tried to reproduce it here, it is just a guess.

Related

File Opening - Script still finds file in cwd even when absolute path specified

I have looked at How to open a list of files in Python This problem similar but not covered.
path = "C:\\test\\test5\\"
files = os.listdir(path)
fileNames = []
for f in files:
fileNames.append(f)
for fileName in fileNames:
pathFileName = path + fileName
print(f"This is the path: {pathFileName}")
fin = open(pathFileName, 'rt')
texts = []
with open(fileName) as file_in:
# read file text lines into an array
for text in file_in:
texts.append(text)
for text in texts:
print(text)
The file aaaatest.txt is in C:\test\test5 The output is:
This is the path: C:\test\test5\aaaatest.txt
Traceback (most recent call last):
File "c:\Users\david\source\repos\python-street-spell\diffLibFieldFix.py", line 30, in <module>
with open(fileName) as file_in:
FileNotFoundError: [Errno 2] No such file or directory: 'aaaatest.txt'
So here's the point. If I take a copy of aaaatest.txt (leaving original where it is) and put it in the current working directory. Running the script again I get:
This is the path: C:\test\test5\aaaatest.txt
A triple AAA test
This is the path: C:\test\test5\AALTONEN-ALLAN_PENCARROW_PAGE_1.txt
Traceback (most recent call last):
File "c:\Users\david\source\repos\python-street-spell\diffLibFieldFix.py", line 30, in <module>
with open(fileName) as file_in:
FileNotFoundError: [Errno 2] No such file or directory: 'AALTONEN-ALLAN_PENCARROW_PAGE_1.txt'
The file aaaatest.txt is opened and the single line of text, contained in it, is outputted. Following this an attempt is made to open the next file of C:\test\test5 where the same error occurs again.
Seems to me that while the path is saying C:\test\test5 the file is only being read from the cwd?

Read multiple xml file from a folder using ElementTree

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:
I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse("../project/content/"+file)
root = tree.getroot()
The error message is the following:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
tree = ET.parse("../project/content/"+file)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line unknown ParseError: no element found: line 1, column 0
where did I make mistakes?
Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?
maintext = root.find("mainText").text
Thank you very much!
The right way to create path elements is using join:
Add print messages to the code before you try and create the tree.
Is the XML you try parse valid?
Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
print(file)
tree = ET.parse(file)
root = tree.getroot()

Why does do I have an IO error saying my file doesn't exist even though it does exist in the directory?

I am trying to loop over a Python directory, and I have a specific file that happens to be the last file in the directory such that I get an IOerror for that specific file.
The error I get is:
IOError: [Errno 2] No such file or directory: 'nod_gyro_instance_11_P_4.csv'
My script:
for filename in os.listdir("/Users/my_name/PycharmProjects/My_Project/Data/Nod/Gyro"):
data = []
if filename.endswith(".csv"):
data.append(k_fold(filename))
continue
else:
continue
k_fold does this:
def k_fold(myfile, myseed=11109, k=20):
# Load data
data = open(myfile).readlines()
The entire traceback:
Traceback (most recent call last):
File "/Users/my_name/PycharmProjects/MY_Project/Cross_validation.py", line 30, in <module>
data.append(k_fold(filename))
File "/Users/my_name/PycharmProjects/My_Project/Cross_validation.py", line 8, in k_fold
data = open(myfile).readlines()
IOError: [Errno 2] No such file or directory: 'nod_gyro_instance_11_P_4.csv'
My CSV files are such:
nod_gyro_instance_0_P_4.csv
nod_gyro_instance_0_P_3.csv
nod_gyro_instance_0_P_2.csv
nod_gyro_instance_0_P_5.csv
...
nod_gyro_instance_11_P_4.csv
nod_gyro_instance_10_P_6.csv
nod_gyro_instance_10_P_5.csv
nod_gyro_instance_10_P_4.csv
Why doesn't it recognize my nod_gyro_instance_10_P_4.csv file?
os.listdir returns just filenames, not absolute paths. If you're not currently in that same directory, trying to read the file will fail.
You need to join the dirname onto the filename returned:
data_dir = "/Users/my_name/PycharmProjects/My_Project/Data/Nod/Gyro"
for filename in os.listdir(data_dir):
k_fold(os.path.join(data_dir, filename))
Alternatively, you could use glob to do both the listing (with full paths) and extension filtering:
import glob
for filename in glob.glob("/Users/my_name/PycharmProjects/My_Project/Data/Nod/Gyro/*.csv"):
k_fold(filename)

Object has no attribute error while using ElementTree

I'm a python newbie and have the following problem. If I uncomment the last line of this code:
# traverse all directories
for root, dirs, files in os.walk(args.rootfolder):
for file in files:
print('Current file name: ' + file)
if file.endswith('.savx'):
# read single xml savx file
currfilename = os.path.join(root,file)
print('Current full file name: ' + currfilename)
tree = ET.parse(currfilename)
# root = tree.getroot() <-- if I uncomment this line I get errors
I get this error:
Traceback (most recent call last):
File "convert.py", line 30, in <module>
currfilename = os.path.join(root,file)
File "C:\progs\develop\Python34\lib\ntpath.py", line 108, in join
result_drive, result_path = splitdrive(path)
File "C:\progs\develop\Python34\lib\ntpath.py", line 161, in splitdrive
normp = p.replace(_get_altsep(p), sep)
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'replace'
It looks like the error appears after the second file in files loop iteration.
You used root twice in your code, for different things:
for root, dirs, files in os.walk(args.rootfolder):
# ^^^^
and
root = tree.getroot()
So when you try to use root as a path string when building the next filename to load:
currfilename = os.path.join(root,file)
you'll find you replaced the root path name with an ElementTree object instead.
Use a different name for either the root directory name or your ElementTree object. Use dirname for example:
for dirname, dirs, files in os.walk(args.rootfolder):
for file in files:
print('Current file name: ' + file)
if file.endswith('.savx'):
# read single xml savx file
currfilename = os.path.join(dirname, file)
print('Current full file name: ' + currfilename)
tree = ET.parse(currfilename)
root = tree.getroot()

Parse XML Tag value for all files in directory using Python

I can't quite make the leap despite pre-existing similar questions. Help would be valued!
I am trying to recursively parse all xml files in the directory/sub directory
I am looking for the value that appears for the tag "Operator id"
Example source XML:
<Operators>
<Operator id="OId_LD">
<OperatorCode>LD</OperatorCode>
<OperatorShortName>ARRIVA THE SHIRES LIMIT</OperatorShortName>
This is the code I have thus far:
from xml.dom.minidom import parse
import os
def jarv(target_folder):
for root,dirs,files in os.walk(target_folder):
for targetfile in files:
if targetfile.endswith(".xml"):
print targetfile
dom=parse(targetfile)
name = dom.getElementsByTagName('Operator_id')
print name[0].firstChild.nodeValue
This is the terminal command I am running:
python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
And this is the error I receive:
tfl_64-31_-37434-y05.xml
encodings.xml
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "xml_tag.py", line 8, in jarv
dom=parse(targetfile)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: 'encodings.xml'
(frigo)andytmac:AtoB_GTFS admin$ python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
tfl_64-31_-37434-y05.xml
If I comment out the code after the 'print target file' line it does list all the xml files I have.
Thanks for your assistance,
Andy
You're not looking at the right place (relative path) : when you use for root, dirs, files in os.walk(target_folder):, files is a list of the file names in the directory root, and not their absolute path.
Try remplacing dom=parse(targetfile) by dom = parse(os.sep.join(root, targetfile))

Categories

Resources