I've got a large XML file that I need to parse and look for a specific node. Once it has been found, I need to make a copy, edit a couple of values and write the file again.
So far I've managed to get the DOM element that I want. There is actually two of these elements already in the XML so after I'm finished, there will be three. Once I've made a copy of the DOM and edited the value, how do I then write this into the DOM (and thus the file)?
I'm using Python's from xml.dom import minidom at the moment.
In minidom you start with creating Document:
Document doc = Document("your_root")
then if it is a text node you want to add, you append it with:
text_node = doc.createTextNode(str(some content))
doc.appendChild(text_node)
if you had for example <some_elem key="my value">some my text</some_elem>:
do it like this:
text_node = doc.createTextNode('some my text')
elem.appendChild(text_node)
elem.setAttribute('key', 'my value')
if it is complex element create it with:
elem = doc.createElement('your_elem')
if you need to set attributes do:
elem.setAttribute("some-attribute",your_attr)
if you need to append something to it:
elem.appendChild( some_other_elem )
then append the element:
doc.appendChild( elem )
if you need a string representation do:
doc.toxml()
of
doc.toprettyxml()
From the minidom documentation:
from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
newdoc = impl.createDocument(None, "some_tag", None)
top_element = newdoc.documentElement
text = newdoc.createTextNode('Some textual content.')
top_element.appendChild(text)
So I guess appendChild is what you ask for?
Related
can anyone please explain how to modify xml element in python using elementtree.
I want to keep the rego AD-4214 and change make 'Tata' into 'Nissan' and model 'Sumo' into 'Skyline'.
If rewriting the entire file is acceptable1, the easiest way would be to turn the xml file into a dictionary (see for example here: How to convert an XML string to a dictionary?), do your modifications on that dictionary, and convert this dict back to xml (like for example here: https://pypi.org/project/dicttoxml/)
1 Consider lost formatting: whitespace, number formats etc may not be preserved by this.
This should work:
import xml.etree.ElementTree as ET
tree = ET.parse('your_xml_source.xml')
root = tree.getroot()
root[1][1].text = "Nissan"
root[1][2].text = "Skyline"
getroot() gives you the root element (<motorvehicle>), [1] selects its second child, the <vehicle> with rego AD-4214. The secondary indexing, [1] and [2], gives you AD-4214's <make> and <model> respectively. Then using the text attribute, you can change their text content.
I have the following string which is part of an bigger XML Document:
content = '<odvNameElem stopID="9001002"><itdMapItemList/>Rathaus</odvNameElem>'
And I want to access Rathaus. My current approach is to parse it with lxml and trying to access the text of the element 'odvNameElem':
from lxml import etree
content = '<odvNameElem stopID="9001002"><itdMapItemList/>Rathaus</odvNameElem>'
root = etree.fromstring(content)
print(root.text)
This however results in None. What am I doing wrong?
etree.__version__ = '4.2.5'
I am not sure why the following works:
root.xpath("string()") but root.xpath("//text()") only returns an empty list. Can somebody please explain this?
The "Rathaus" string is the value of the tail property of the itdMapItemList element. Examples:
root.xpath("itdMapItemList")[0].tail
root.find("itdMapItemList").tail
See https://lxml.de/tutorial.html#elements-contain-text.
root.xpath("string()") returns the concatenation of the string values of the root node and its descendants, which indeed is "Rathaus" in this case.
See https://www.w3.org/TR/xpath-10/#function-string.
root.xpath("//test") does not make sense (there is no test element). Did you mean root.xpath("//text()")?
root.xpath("//text()") returns a list of all text nodes, which in this case is ['Rathaus'].
If the input XML is changed to
<odvNameElem stopID="9001002">ABC<itdMapItemList/>Rathaus</odvNameElem>
then the result is ['ABC', 'Rathaus']
I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).
I'm trying to put python parsing this XML code from an HTML page:
<weather>
<loc mobiurl="http://foreca.mobi/?lon=-8.6110&lat=41.1496&source=navi/" url="http://foreca.com/?lon=-8.6110&lat=41.1496&source=navi/">
<obs station="Porto / Pedras Rubras" dist="11 km NW" dt="2013-03-06 17:00:00" t="14" tf="14" s="d320" wn="S" ws="8" p="997" rh="94" v="5000"/>
<fc dt="2013-03-07" tx="16" tn="11" s="d220"/>
<fc dt="2013-03-08" tx="15" tn="10" s="d220"/>
<fc dt="2013-03-09" tx="15" tn="10" s="d220"/>
</loc>
</weather>
I want to get the information on dr, s, tx and tn fields but I don't know how to do it with XML functions. I try to read the HTML file and then create and arrow to store the content after the paths said before but I can't get it working.
Is there any easy way to get the data with python?
Some HTML scraping is easily done with pyparsing, using that library's makeHTMLTags method (makeHTMLTags returns a pair of expressions, for opening and closing tags, but in your example, only the opening tag is needed):
from pyparsing import makeHTMLTags
fcTag = makeHTMLTags("fc")[0]
tagAttrs = 'dt s tx tn'.split()
for match in fcTag.searchString(htmltext):
print ' '.join("%s:%s" % (attr,match[attr]) for attr in tagAttrs)
Prints:
dt:2013-03-07 s:d220 tx:16 tn:11
dt:2013-03-08 s:d220 tx:15 tn:10
dt:2013-03-09 s:d220 tx:15 tn:10
This makes it easy to incorporate this fragment parser with pyparsing's other features, such as run-time parse actions, semantic checking, etc.
EDIT
If you want all the dt's, s's, etc. in their own respective lists (in Python, we call them "lists", not "vectors"), do this:
dtArray = []
sArray = []
txArray = []
tnArray = []
for match in fcTag.searchString(htmltext):
dtArray.append(match.dt)
sArray.append(match.s)
txArray.append(match.tx)
tnArray.append(match.tn)
print ' '.join("%s:%s" % (attr,match[attr]) for attr in tagAttrs)
I've seen code like this before, and it is a poor data structure pattern. You access the value of the i'th entry of the original table by getting dtArray[i], sArray[i], etc.
Please consider instead one of the several structured types offered by Python. You have several to choose from:
A. Use dicts.
fcArray = []
for match in fcTag.searchString(htmltext):
fcArray.append(dict((attr,match[attr]) for attr in tagAttrs))
Now to get at the i'th entry, just get fc = fcArray[i], and access the fc['dt'], fc['s'] etc. values from that dict.
B. Use namedtuples.
from collections import namedtuple
FCData = namedtuple("FCData", tagAttrs)
fcArray = []
for match in fcTag.searchString(htmltext):
fcArray.append(FCData(*(match[attr] for attr in tagAttrs)))
You again use fc = fcArray[i] to get the i'th entry, but now you access the values using fc.dt, fc.s, etc. I find this form to be cleaner-looking than the dict form, but there are some restrictions. All the tag names have to be legal Python identifiers, so if you have a tag "rise/run", then you can't use a namedtuple. Also, namedtuples are immutable - you can't take an existing FCData fc and assign into its dt field with fc.dt = "new datetime value". dicts on the other hand would allow this.
C. Use objects. The simplest is a "bag"-type object that creates empty object instances, which you than add attributes to through simple assignment or setattr calls:
class FCData(object): pass
fcArray = []
for match in fcTag.searchString(htmltext):
fc = FCdata()
for attr in tagAttrs:
setattr(fc, attr, match[attr])
fcArray.append(fc)
You get the i'th entry with fc = fcArray[i], and like the namedtuple, you get the attributes using fc.dt and so on. But you can also modify the attributes if need be, and the assignment fc.dt = "new datetime value" would work.
D. Just use the objects created by pyparsing's searchString method.
fcArray = fcTag.searchString(htmltext)
pyparsing returns ParseResults, and it combines the behavior of both dicts and namedtuples. Just like before you access the i'th entry with fc = fcArray[i]. You can read the dt attribute with fc.dt or fc['dt']. You can read fc.dt, but you can't assign to it, just like the namedtuple. You can assign to fc['dt'], just like the dict.
If you can extract just the weather tags easily, you can use the xml.etree.ElementTree API which comes with Python.
import xml.etree.ElementTree as ET
tree = ET.fromstring(weatherdata)
for fcelem in tree.findall('.//fc'):
print fcelem.attrib['tx'], fcelem.attrib['tn']
If you want to extract it from the HTML document, then it depends on how well-formed the HTML is. If it is a XHTML document, the ElementTree API can handle it fine.
Otherwise, you'll need to switch to a HTML parser instead. You could install the lxml library; that library supports the same ElementTree API but has a dedicated HTML parser included.
You could also use BeautifulSoup for an alternate HTML API. In fact, lxml and BeautifulSoup can work in concert giving you a choice of APIs for your tasks; use whichever is easier for you.
Both lxml and BeautifulSoup are external libraries.
I am new to python. I want to create a xml tree with one parent, several childs and several subchilds. I've stored child tags are in list 'TAG' and Subchild tags are in list 'SUB'
And i have came up with following code but i am not able to achieve the desired result !
def make_xml(tag,sub):
'''
Takes in two lists and Returns a XML object.
The first list has to contain all the tag objects
The Second list has to contain child data's
'''
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
top = Element("Grand Parent")
comment = Comment('This is the ccode parse tree')
top.append(comment)
i=0
try:
for ee in tag:
child = SubElement(top, 'Tag'+str(i))
child.text = str(tag[i]).encode('utf-8',errors = 'ignore')
subchild = SubElement(child, 'Content'+str(i))
subchild.text = str(sub[i]).encode('utf-8',errors = 'ignore')
i = i+1;
except UnicodeDecodeError:
print 'oops'
return top
EDIT:
I have two lists like these:
TAG = ['HAPPY','GO','LUCKY']
SUB = ['ED','EDD','EDDY']
What i want is:
<G_parent>
<parent1>
HAPPY
<child1>
ED
<\child1>
<\parent1>
<parent2>
GO
<child2>
EDD
<\child2>
<\parent2>
<parent3>
LUCKY
<child3>
EDDY
<\child3
<\parent3>
<\G_parent>
The actual list has many more contents than this. I want to achieve using a for loop or so.
EDIT:
OOP's. My bad !
The code works as expected when i pass the example list. But in my real application the list is long. The list contains text fragments extracted from a pdf file. Somewhere in that text i get UnicodeDecodeError(reason: pdf extracted text messy. Proof: 'oops' get printed once ) and the returned xml object is incomplete.
So I need to figure out a way that even on UnicodeDecodeErrors my complete list is parsed. Is that possible ! I'm using .decode('utf-8',errors='ignore') even then the parsing does not complete !