Accessing non tree structured xml data in python

Accessing non tree structured xml data in python - python

I have several xml files that I want to parse in python. I am aware of the ElementTree package in python, however my xml files aren't stored in a tree like structure. Below is an example
<tag1 attribute1="at1" attribute2="at2">My files are text that I annotated with a tool
to create these xml files.</tag1>
Some parts of the text are enclosed in an xml tag, whereas others are not.
<tag1 attribute1="at1" attribute2="at2"><tag2 attribute3="at3" attribute4="at4">Some
are even enclosed in multiple tags.</tag1></tag2>
And some have overlapping tags:
<tag1 attribute1="at1" attribute2="at2">This is an example sentence
<tag3 attribute5="at5">containing a nested example sentence</tag3></tag1>
Whenever I use an ElementTree like function to parse the file, I can only access the very first tag. I am looking for a way to parse all the tags and don't want a tree like structure. Any help is greatly appreciated.

If you have one XML fragment per line, just parse each line individually.
for line in some_file:
# parse using ET and getroot.

Related

Parse deeply nested XML to pandas dataframe

I'm trying to fetch particular parts of a XML file and move it into a pandas dataframe. Following some tutorials from xml.etree I'm still stuck at getting the output. So far, I've managed to find the child nodes, but I can't access them (i.e. can't get the actual data out of it). So, here is what I've got so far.
tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
What I want is to get the data from the node programDescriptions and specifically the child programDescriptionText xml:lang="nl", and of course a couple extra. But first focus on this one.
Some data to work with:
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster#url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
<applicationOpen>true</applicationOpen>
<applicationType>individual</applicationType>
<maxNumberOfParticipants>12</maxNumberOfParticipants>
<minNumberOfParticipants>8</minNumberOfParticipants>
<paymentDue>up-front</paymentDue>
<requiredLevel>academic bachelor</requiredLevel>
<startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
<instructionMode>training</instructionMode>
<teacher>
<id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
<name>SomeName</name>
<summary xml:lang="nl">
Long text of the summary. Not needed.
</summary>
</teacher>
<studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
<programName xml:lang="nl">Program Course Name</programName>
<programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
<programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
<programDescriptionText xml:lang="nl">This part is needed from the XML.
Big program description text. This part is needed to parse from the XML file.
</programDescriptionText>
<programDescriptionHtml xml:lang="nl">Not needed;
Not needed as well;
</programDescriptionHtml>
<subjectText>
<subject>curriculum</subject>
<header1 xml:lang="nl">Beschrijving</header1>
<descriptionHtml xml:lang="nl">Yet another HTML desscription;
Not necessarily needed;</descriptionHtml>
</subjectText>
<searchword xml:lang="nl">search word</searchword>
<webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
<programRun>
<id>PR-019514</id>
<status>application opened</status>
<startDate isFinal="true">2019-06-26</startDate>
<endDate isFinal="true">2020-02-11</endDate>
</programRun>
</programSchedule>
</program>
</programs>

Try the code below: (55703748.xml contains the xml you have posted)
import xml.etree.ElementTree as ET
tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
print(node.text)
Output
short Program Course Name summary

How to extract some text from json file without loading it?

python lxml can be used to extract text (e.g., with xpath) from XML files without having to fully parse XML. For example, I can do the following which is faster than BeautifulSoup, especially for large input. I'd like to have some equivalent code for JSON.
from lxml import etree
tree = etree.XML('<foo><bar>abc</bar></foo>')
print type(tree)
r = tree.xpath('/foo/bar')
print [x.tag for x in r]
I see http://goessner.net/articles/JsonPath/. But I don't see an example python code to extract some text from a json file without having use json.load(). Could anybody show me an example? Thanks.

I'm assuming you don't want to load the entire JSON for performance reasons.
If that's the case, perhaps ijson is what you need. I used it to search huge JSON files (>8gb) and it works well.
However, you will have to implement the search code yourself.

Python 3.x: parse ATOM XML and convert to dict

I'm struggling to parse an ATOM XML file, coming from an API, to a common data structure, like dict, Pandas dataframe or JSON,
I understand XML files are more complex than JSON files, and hence there won't be a very simple, generic solution to this. I hope that given the fact that I'm dealing with an ATOM structure might help parsing the file to a more general data structure.
The structure of the XML data: http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet
And similar for JSON here: http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet
The reason I can't use the JSON file is that it is often not available.
I played around with libraries like xml.etree, xmltodict, lxml, xmljson and feedparser, but I keep getting errors.
For example, using feedparser:
r = requests.get('http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet')
tree = ElementTree.fromstring(r.content)
Yields the error
xml.etree.ElementTree.ParseError: not well-formated (invalid token): line 1, column 0
Help would be highly appreciated!

I don't know if you solved it yet but, have you tried using?:
tree = ElementTree.fromstring(r.text)
r.content returns the content in bytes (see: http://docs.python-requests.org/en/master/api/#requests.Response)

Updating an Existing XML Document in Python

I have a large XML file whose structure is approximately as follows:
<GROUNDTRUTH>
<thing fileName="1" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="2" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
<thing fileName="3" attrib="2">
<SUBSUB moreStuff="12" otherStuff="13"/>
</thing>
</GROUNDTRUTH>
I don't think I was clear enough in the original posting of this question. I have an xml document called GROUNDTRUTH, and inside of that I have several thousand "things". I want to search through all of the things in the document via filename and then change an attribute. So if I was searching for fileName="2", I would change its attribute to attrib=x. And for some thing, perhaps I'd go down to the sub level and change moreStuff.
My plan is to store into a csv file the names of the 'things' I need to change, and what I want to change the value of 'attrib' to. What function or module will provide this kind of functionality? Or am I just missing an easy/obvious approach? Ultimately I'd like to have a working script that will take a csv file with the thing identifier, and value to be updated, and take the xml file to make those changes onto.
Thanks for your help and suggestions!

First, you can transform the original xml file into an outputted xml file using an xslt stylesheet which can modify xml files in any way, shape, or form such as modifying, re-structuring, re-ordering attributes, elements, etc. Do note xsl is a declarative special-purpose language to transform and render XML documents.
Then, you can use Python's lxml library to run the transformation:
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('originalfile.xml')
xslt = ET.parse('transformfile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True)
xmlfile = open('finalfile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
By the way, PHP, Java, C, VB, or pretty much any language, even your everyday browser can run transformations! To have the browser run it, simply add stylesheet in header:
<?xml-stylesheet type="text/xsl" href="transformfile.xsl"?>

convert an xmi file to xml file using python

I need to convert an activity diagram in xmi format to xml format.Is this conversion possible using python?Are there any tools to convert xmi files to xml?

Converting XML to XML is usually called XML transformation. For Python you can use libxsltmod to perform XML transformations by using XSLT 'stylesheets'.

As Ignacio says, the problem may not be that the target tool expects XML but that probably expects a diffent XMI format.
Unfortunately, each tool follows its own interpretation of the XMI standard so two modeling tools will most likely generate two incompatible XMI files for the same model. See an example in this "model once open anywhere not true" post

you can get the information that you need (classes and attribute ...) from any file.xmi this doc maybe help
from xml.dom import minidom
xmldoc = minidom.parse('file.xmi')
for element in xmldoc.getElementsByTagName("UML:Class"):
print(" -> UML:Class ",element.getAttribute('name'))
for a in element.getElementsByTagName("UML:Attribute"):
print(" -> UML:Attr : ",a.getAttribute('name'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing non tree structured xml data in python - python

If you have one XML fragment per line, just parse each line individually. for line in some_file: # parse using ET and getroot.

Related

Parse deeply nested XML to pandas dataframe

How to extract some text from json file without loading it?

Python 3.x: parse ATOM XML and convert to dict

Updating an Existing XML Document in Python

convert an xmi file to xml file using python

Categories

Resources