Python XML Parsing can't find children of children - python

I'm trying to parse XML returned as a string from a http get request. I need to get a specific link inside the XML structure but for some reason I can't get to the link I need.
I tried **enumerating** the XML and printing child.attrib but the link I need is not displaying.
I need to find an element that is a child of a child and the element is called Vm, then I need to get the .attrib of that element.
Thus, I did some more research and tried finding the XML I needed by node name
The XML structure is:
<vapp>
<link></link>
<othertags></othertags>
<Children>
<Vm href='link I need'>
<other tag options>
</other tag options>
</vm>
</Children>
</vapp>
python code:
for i, child in enumerate(vappXML):
if 'href' in child.attrib and 'name' in child.attrib:
vapp_url = child.attrib['href']
r=requests.get(vapp_url, headers = new_headers)
vmlinkXML = fromstring(r.content)
for VM in vmlinkXML.findall('Children'):
print VM
for i, child in enumerate(vmlinkXML):
if 'vm-' in child:
print child.attrib
if 'href' in child.attrib:
vm_url = child.attrib['href']
if 'vm-' in vm_url:
print vm_url
I can't get to the url no matter how I try. I only get the main child of vApp it never parses the tag, or rather my code never goes further than the first child of the vapp and I don't know why.
I guess I wasn't clear. I'm parsing vCloud Director Rest API XML that is returned as a string. The first level is the vApp link which is essentially a container for VMs. I need to get the VM link under each vApp. The first one will select vApp links and query them.
Once it does a get request on the vApp link it gets the next level of XML data which is the structure I put above. so it passes the initial XML statement and returns vApp information.
Even when I print out every child.attrib fom vmlinkXML the link with vm doesnt get printed. however, If I just print r.content the link is there. Its almost like the XML parser doesn't see the tag.
I'm using Pythons XML.etree
from lxml import etree
from xml.etree.ElementTree import XML, fromstring, tostring
So to be clear the structure is:
to get the vApp Links /api/admin/extension/vapps/query
then the returned information will contain links to each vapp in vCloud.
then I call the vApp link
https://vcloud.test.co/api/vApp/vapp-3b4980e7-c5ab-4462-9cfe-abc6292c15748
and it will return a structure similar to this:
<vapp>
<link></link>
<othertags></othertags>
<Children>
<Vm href='link I need'>
<other tag options>
</other tag options>
</vm>
</Children>
</vapp>
Tag contains the next level of link I need to query. However the XML parser with child.attrib never outputs anything under the tag.

Solved***
r=requests.get(url + '/api/admin/extension/vapps/query', headers = new_headers)
vappXML = fromstring(r.content)
for i, child in enumerate(vappXML):
if 'href' in child.attrib and 'name' in child.attrib:
vapp_url = child.attrib['href']
r=requests.get(vapp_url, headers = new_headers)
DOMTree = parseString(r.content)
vmElements = DOMTree.documentElement
VMS = vmElements.getElementsByTagName("Vm")
for vm in VMS:
if vm.hasAttribute("href"):
vm_link = vm.getAttribute("href")
print vm_link

Related

"None" result at Parsing XML with ElementTree

I've tried everything to get a XML content but all I've got is a 'None' as return. Could anybody help me?
The code I'm trying is:
import xml.etree.cElementTree as ET
parsedXML = ET.parse("C:\\Users\\denis\\Documents\\Projetos\\NFe\\Arquivos\\33180601279711000100550020001554261733208443-nfeo.xml")
for node in parsedXML.getroot():
email = node.find('cNF')
phone = node.find('natOp')
street = node.find('nNF')
print(email)
Part of the XML (content is bigger than this) is right bellow:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nfeProc xmlns="http://www.portalfiscal.inf.br/nfe" versao="3.10">
<NFe xmlns="http://www.portalfiscal.inf.br/nfe">
<infNFe versao="3.10" Id="NFe33180601279711000100550020001554261733208443">
<ide>
<cUF>33</cUF>
<cNF>73320844</cNF>
<natOp>VENDA DE PRODUCAO DO ESTABELECIMENTO</natOp>
<indPag>1</indPag>
<mod>55</mod>
<serie>2</serie>
<nNF>155426</nNF>
<dhEmi>2018-06-25T16:06:33-03:00</dhEmi>
<dhSaiEnt>2018-06-25T16:06:08-03:00</dhSaiEnt>
<tpNF>1</tpNF>
<idDest>2</idDest>
<cMunFG>3304557</cMunFG>
<tpImp>2</tpImp>
<tpEmis>1</tpEmis>
<cDV>3</cDV>
<tpAmb>1</tpAmb>
<finNFe>1</finNFe>
<indFinal>1</indFinal>
<indPres>9</indPres>
<procEmi>0</procEmi>
<verProc>NeoGrid NFe 1.63.4</verProc>
</ide>
<emit>
I appreciate your help!
You are using an XML document with namespaces, so you need to provide it during you call, as shown in this answer.
Here, we get
namespaces = {'n': 'http://www.portalfiscal.inf.br/nfe'}
root = parsedXML.getroot()
root.find('n:NFe', namespaces)
to return the element, while root.find('NFe') returns None.
Also note that find and findall only search the direct children, not nested children (cf. documentation), which mean that you will have to iter over children (see e.g. here for an example).

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Remove element in a XML file with Python

I'm a newbie with Python and I'd like to remove the element openingHours and the child elements from the XML.
I have this input
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>05:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<station id= "2">
<name>foo</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>06:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<stations/>
<Root/>
I'd like this output
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<station/>
<station id= "2">
<name>foo</name>
<station/>
<stations/>
<Root/>
So far I've tried this from another thread How to remove elements from XML using Python
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//*[attribute::openingHour]'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
However, It doesn't seem to be working.
Thanks
I took your code for a spin but at first Python couldn't agree with the way you composed your XML, wanting the / in the closing tag to be at the beginning (like </...>) instead of at the end (<.../>).
That aside, the reason your code isn't working is because the xpath expression is looking for the attribute openingHour while in reality you want to look for elements called openingHours. I got it to work by changing the expression to //openingHours. Making the entire code:
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
You want to remove the tags <openingHours> and not some attribute with name openingHour:
from lxml import etree
doc = etree.parse('stations.xml')
for elem in doc.findall('.//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))

XML parsing specific values - Python

I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value

Unable to Access Child Node in Parsing XML with Python Language

I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.
I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:
<events>
<event id="abcde01">
<title> Name of event </title>
<url> The URL of the Event <- the url tag I do not need </url>
<image>
<url> THE URL I DO NEED </url>
</image>
</event>
Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).
class XMLParser(webapp2.RequestHandler):
def get(self):
base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
#downloads data from xml file:
response = urllib.urlopen(base_url)
#converts data to string
data = response.read()
unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
#closes file
response.close()
#parses xml downloaded
dom = mdom.parseString(data)
node = dom.documentElement #needed for declaration of variable
#print out all event names (titles) found in the eventful xml
event_main = dom.getElementsByTagName('event')
#URLs list parsing - MY ATTEMPT -
urls_list = []
for im in event_main:
image_url = image.getElementsByTagName("image")[0].childNodes[0]
urls_list.append(image_url)
The error I receive is the following any help is much appreciated, Karen
image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range
First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.
Next, I'd use the ElementTree API for a task like this:
from xml.etree import ElementTree as ET
response = urllib.urlopen(base_url)
tree = ET.parse(response)
urls_list = []
for event in tree.findall('.//event[image]'):
# find the text content of the first <image><url> tag combination:
image_url = event.find('.//image/url')
if image_url is not None:
urls_list.append(image_url.text)
This only consideres event elements that have a direct image child element.

Categories

Resources