Python XML DOM unique object? - python

I have a problem understanding pythons way of handling references in lists. I tried googling and reading python books but did not find a suitable answer for my problem.
If I have a file called test.py with the following code:
from lxml import etree as ET
__check = ET.Entity('check')
def test():
entries = []
for c in range(2):
row = []
row.append(ET.Element('entry'))
a = ET.Element('entry')
a.append(__check)
row.append(a)
entries.append(row)
for row in entries:
for e in row:
ET.dump(e)
When executing the test() method the output is:
<entry/>
<entry/>
<entry/>
<entry>&check;</entry>
The expected output would be:
<entry/>
<entry>&check;</entry>
<entry/>
<entry>&check;</entry>
What am I missing? For sure I can just edit the line with a.append(__check) to a.append(copy.deepcopy(__check)) and it works. But I don't understand why the previous example does not work the way I think.
Edit: I am using python 2.7.6

You're appending the same element over and over. XML DOM does not allow the same element to exist in two places in the tree (it wouldn't be a tree if it did), so your second append moves the __check element to the new place in the tree.

Related

Modifying element in xml using python

can anyone please explain how to modify xml element in python using elementtree.
I want to keep the rego AD-4214 and change make 'Tata' into 'Nissan' and model 'Sumo' into 'Skyline'.
If rewriting the entire file is acceptable1, the easiest way would be to turn the xml file into a dictionary (see for example here: How to convert an XML string to a dictionary?), do your modifications on that dictionary, and convert this dict back to xml (like for example here: https://pypi.org/project/dicttoxml/)
1 Consider lost formatting: whitespace, number formats etc may not be preserved by this.
This should work:
import xml.etree.ElementTree as ET
tree = ET.parse('your_xml_source.xml')
root = tree.getroot()
root[1][1].text = "Nissan"
root[1][2].text = "Skyline"
getroot() gives you the root element (<motorvehicle>), [1] selects its second child, the <vehicle> with rego AD-4214. The secondary indexing, [1] and [2], gives you AD-4214's <make> and <model> respectively. Then using the text attribute, you can change their text content.

Using ElementTree to find a node - invalid predicate

I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!
The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.

Get inner xml from lxml

I have the following string which is part of an bigger XML Document:
content = '<odvNameElem stopID="9001002"><itdMapItemList/>Rathaus</odvNameElem>'
And I want to access Rathaus. My current approach is to parse it with lxml and trying to access the text of the element 'odvNameElem':
from lxml import etree
content = '<odvNameElem stopID="9001002"><itdMapItemList/>Rathaus</odvNameElem>'
root = etree.fromstring(content)
print(root.text)
This however results in None. What am I doing wrong?
etree.__version__ = '4.2.5'
I am not sure why the following works:
root.xpath("string()") but root.xpath("//text()") only returns an empty list. Can somebody please explain this?
The "Rathaus" string is the value of the tail property of the itdMapItemList element. Examples:
root.xpath("itdMapItemList")[0].tail
root.find("itdMapItemList").tail
See https://lxml.de/tutorial.html#elements-contain-text.
root.xpath("string()") returns the concatenation of the string values of the root node and its descendants, which indeed is "Rathaus" in this case.
See https://www.w3.org/TR/xpath-10/#function-string.
root.xpath("//test") does not make sense (there is no test element). Did you mean root.xpath("//text()")?
root.xpath("//text()") returns a list of all text nodes, which in this case is ['Rathaus'].
If the input XML is changed to
<odvNameElem stopID="9001002">ABC<itdMapItemList/>Rathaus</odvNameElem>
then the result is ['ABC', 'Rathaus']

Python XML 'TypeError: must be xml.etree.ElementTree.Element, not str'

I currently am trying to build an XML file from a CSV file. Currently my code reads the CSV file to data and begins creating the XML from the data that is stored within the CSV.
CSV Example:
Element,XMLFile
SubElement,XMLName,XMLFile
SubElement,XMLDate,XMLName
SubElement,XMLInformation,XMLDate
SubElement,XMLTime,XMLName
Expected Output:
<XMLFile>
<XMLName>
<XMLDate>
<XMLInformation />
</XMLDate>
<XMLTime />
</XMLName>
</XMLFile>
Currently my code attempts to look at the CSV to see what the parent is for the new subelement:
# Defines main element
# xmlElement = xml.Element(XMLFile)
xmlElement = xml.Element(csvData[rowNumber][columnNumber])
# Should Define desired parent (FAIL) and SubElement name (PASS)
# xmlSubElement = xml.SubElement(XMLFile, XMLName)
xmlSubElement = xml.SubElement(csvData[rowNumber][columnNumber + 2], csvData[rowNumber][columnNumber + 1])
When the code attempts to use the CSV source string as the parent parameter, Python 3.5 generates the following error:
TypeError: must be xml.etree.ElementTree.Element, not str
Known cause of the error is that the parent paramenter is being returned as a string, when it is expected to be an Element or SubElement.
Is it possible to recall the stored value from the CSV and have it reference the Element or SubElement, instead of a string? The goal is to allow the code to read the CSV file and assign any SubElement to the parent listed in the CSV.
I cannot tell for sure, but it looks like you are doing:
ElementTree.SubElement(str, str)
when you should be doing:
ElementTree.SubElement(Element, str)
It also seems like you already know this. The real question, then, is how are you going to reference the parent object when you only know its tag string? You could search for Elements in the ElementTree with that particular tag string, but this is generally not a good idea as XML allows multiple instances of similar elements.
I would suggest you either:
Find a strategy to store references to parent elements
See if there is a way to uniquely identify the parent element using XPath

Xml creation using ElementTree

I am new to python. I want to create a xml tree with one parent, several childs and several subchilds. I've stored child tags are in list 'TAG' and Subchild tags are in list 'SUB'
And i have came up with following code but i am not able to achieve the desired result !
def make_xml(tag,sub):
'''
Takes in two lists and Returns a XML object.
The first list has to contain all the tag objects
The Second list has to contain child data's
'''
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
top = Element("Grand Parent")
comment = Comment('This is the ccode parse tree')
top.append(comment)
i=0
try:
for ee in tag:
child = SubElement(top, 'Tag'+str(i))
child.text = str(tag[i]).encode('utf-8',errors = 'ignore')
subchild = SubElement(child, 'Content'+str(i))
subchild.text = str(sub[i]).encode('utf-8',errors = 'ignore')
i = i+1;
except UnicodeDecodeError:
print 'oops'
return top
EDIT:
I have two lists like these:
TAG = ['HAPPY','GO','LUCKY']
SUB = ['ED','EDD','EDDY']
What i want is:
<G_parent>
<parent1>
HAPPY
<child1>
ED
<\child1>
<\parent1>
<parent2>
GO
<child2>
EDD
<\child2>
<\parent2>
<parent3>
LUCKY
<child3>
EDDY
<\child3
<\parent3>
<\G_parent>
The actual list has many more contents than this. I want to achieve using a for loop or so.
EDIT:
OOP's. My bad !
The code works as expected when i pass the example list. But in my real application the list is long. The list contains text fragments extracted from a pdf file. Somewhere in that text i get UnicodeDecodeError(reason: pdf extracted text messy. Proof: 'oops' get printed once ) and the returned xml object is incomplete.
So I need to figure out a way that even on UnicodeDecodeErrors my complete list is parsed. Is that possible ! I'm using .decode('utf-8',errors='ignore') even then the parsing does not complete !

Categories

Resources