Merging multiple XML files - python

I have a directory of xml files and I'm trying to merge them all into one big xml file
full = ET.Element('dataset')
for filename in glob.glob(os.path.join(path, '*.xml')):
tree = ET.parse(filename, parser=xmlp)
root = tree.getroot()
for pair in root: #root.iter('pair'):
full.append(pair)
I tried the above code and get this trivial error:
ParseError: parsing finished: line 330, column 0
The problem is that only the first file is appended to the new xml doc, how can I avoid this? Or is there a better way of merging? (The structures are identical)
Edit: they are of this structure:
<dataset>
<pair>
<t1></t1>
<t2></t2>
</pair>
...
</dataset>
Update: Used XML Copy Editor, and couldn't open told me unknown encoding MS932 even though it is in ISO-8859-1. Same error I got from trying to open with lxml and not xml in python. Manually recreated a new xml, not really a solution but oh well.
Thanks

Related

How to change XML file's root element so that the file parses from different root

I have an XML file that is when parsed reads a root different from the root I want it to read from.
the file starts with a tree that shows a summary of the data included in the file then another tree with the actual data in the subsequent section. However, when I parse it. it doesn't even read the tree that includes the data. It only reads the tree that includes the summary.
My question is how to modify my code to start reading the file from the tree that includes the data that I need
The XML file starts like that
I tried parsing it using xml.etree.ElementTree it reads the root "WorkflowSearch" what I want is for it to read the root at "SearchResults" so that I can read the fields into a pandas DataFrame
`tree = ET.parse('Workflows.xml')`
`root = tree.getroot()`
`root.tag, root.attrib`
I appreciate your help

Editing xml file without creating a new one

I have tried to save the xml file in the following variable and later work on it as normal xml file. This is not working. How can I approach this situation. I need to edit the xml file without editing in the original file and without creating a new xml file. Is that possible?
comment_2 = open("cool.xml").read()
Thanks and Regards
You can use xml.etree.ElementTree to parse the XML file and then save it to a variable:
import xml.getElementTree as ET
tree = ET.parse('xml.etr.xml')
root = tree.getroot()
root.save(root)
root_variable = root_variable
Then you can save the xml file to an instance of ElementTree.

Removing sub-tags from XML and create new XML files

I have an XML input file which I need to split into multiple files based on MAPPING and WORKFLOW tags.
Since I have two MAPPING tags in my input XML and one WORKFLOW tag, I need to generate three files:
m_demo_trans_agg.XML
m_demo_trans_exp.XML
wf_m_demo_trans_agg_exp.XML
So, my mapping file (starting with m_) will have tags SOURCE, TARGET, and MAPPINGS. The workflow file will have tags WORKFLOW and CONFIG.
Please let me know how can I create mapping XML.
I started with workflow XML creation.
My code looks like:
import xml.etree.ElementTree
tree = ET.parse('input.xml')
root = tree.getroot()
target_node_first_parent = 'FOLDER'
target_nodes = ['SOURCE', 'TARGET', 'MAPPING']
for node in root.iter(target_node_first_parent):
for subnode in node.iter():
if subnode.tag in ['SOURCE', 'TARGET', 'MAPPING']:
print(subnode.tag)
node.remove(subnode)
out_tree = ET.ElementTree(root)
out_tree.write('output.xml')
I am getting the TARGET tags in my output.xml.
I am open to using any libraries apart from xml.etree.ElementTree.
Please assist.
Thanks

How to concatenate two xml files in python?

Using the Python module import xml.etree.ElementTree , how can I concatenate two files? Assuming each file is written to disk and NOT hard coded in. As an illustration of the Linux environment, cat is being used to print the contents of each of these files.
[<user/path>]$ cat file1.xml
<file>has_content</file>
[<user/path>]$ cat file2.xml
<root>more_content</root>
After each of the files has been opened and concatenated to the first
[<user/path>]$ cat new_file.xml
<new_root><file>has_content</file><root><more_content</root></new_root>
I would like to simply 'merge' these two files together but I have been struggling. All I have been really able to find is about appending to a child or adding a SubElement.
Could you try this and check whether your requirement is satisfied. I have used the module xml.etree.ElementTree for getting the XML data of files file1, file2 and then appended the contents to a new file with the root node being
<new_root>
...
</new_root>
Code:
import xml.etree.ElementTree as ET
data1 = ET.tostring(ET.parse('file1.xml').getroot()).decode("utf-8")
data2 = ET.tostring(ET.parse('file2.xml').getroot()).decode("utf-8")
f = open("new_file.xml", "a+")
f.write('<new_root>')
f.write(data1)
f.write(data2)
f.write('</new_root>')
f.close()

Getting file name while reading files from local system using pyspark

Additional update:
I tried writing same code for my files present in hdfs there it is working but when i am using same code for my local files system i am getting error. Caused by: java.io.FileNotFoundException: File file:/root/cd/parsed_cd_5.xml does not exist
Original question and initial update
I am using ElementTree to parse XML files. I ran the code in python and it worked like charm. But when i am trying to run the same using spark i am getting below error.
Error:
File "/root/sparkCD.py", line 82, in
for filename in glob.glob(os.path.join(path, '*.xml')): File "/usr/lib64/python2.6/posixpath.py", line 67, in join
elif path == '' or path.endswith('/'):
From the error it is clear that issue is with "for filename in glob.glob(os.path.join(path, '*.xml'))". But i don't know how to achieve the same in pyspark.
since i can't share my code i will only share the snippet where i am getting error compared to the python code where i am not getting the error.
Python:
path = '/root/cd'
for filename in glob.glob(os.path.join(path, '*.xml')):
tree = ET.parse(filename)
doc = tree.getroot()
Pyspark:
path = sc.textFile("file:///root/cd/")
for filename in glob.glob(os.path.join(path, '*.xml')):
tree = ET.parse(filename)
doc = tree.getroot()
how can i resolve this issue. All i want is the filename that i am currently processing that is currently in my local system cd directory using pyspark.
Forgive me if this sounds stupid to you.
Update:
I tried the suggestion given below but i am not getting the file name.
below is my code:
filenme = sc.wholeTextFiles("file:///root/cd/")
nameoffile = filenme.map(lambda (name, text): name.split("/").takeRight(1)(0)).take(0)
print (nameoffile)
result i am gettng is
PythonRDD[22] at RDD at PythonRDD.scala:43
Update:
I have written below code instead of wholeTextFiles but i am getting same error. Also i want to say that according to my question i want to get the name of my file so textFile will not help me with that. I tried running the code you suggested but same result i am getting.
path = sc.textFile("file:///root/cd/")
print (path)
If input directory contains many small files then wholeTextFiles would help, check detailed description here.
>>pairRDD = sc.wholeTextFiles('<path>')
>>pairRDD.map(lambda x:x[0]).collect() #print all file names
pairRDD each record contains key as absolute file path and value as entire file content.
Not a full solution, but this appears to be a clear problem with your code.
In python you have:
path = '/root/cd'
Now path should contain the location that you are interested in.
In pySpark however, you do this:
path = sc.textFile("file:///root/cd/")
Now path contains the text in the file at the location that you are interested in.
If you try to run your followup command on that, it makes sense that it tries to do something strange (and thus fails).

Categories

Resources