Unable to Access Child Node in Parsing XML with Python Language - python

I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.
I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:
<events>
<event id="abcde01">
<title> Name of event </title>
<url> The URL of the Event <- the url tag I do not need </url>
<image>
<url> THE URL I DO NEED </url>
</image>
</event>
Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).
class XMLParser(webapp2.RequestHandler):
def get(self):
base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
#downloads data from xml file:
response = urllib.urlopen(base_url)
#converts data to string
data = response.read()
unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
#closes file
response.close()
#parses xml downloaded
dom = mdom.parseString(data)
node = dom.documentElement #needed for declaration of variable
#print out all event names (titles) found in the eventful xml
event_main = dom.getElementsByTagName('event')
#URLs list parsing - MY ATTEMPT -
urls_list = []
for im in event_main:
image_url = image.getElementsByTagName("image")[0].childNodes[0]
urls_list.append(image_url)
The error I receive is the following any help is much appreciated, Karen
image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range

First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.
Next, I'd use the ElementTree API for a task like this:
from xml.etree import ElementTree as ET
response = urllib.urlopen(base_url)
tree = ET.parse(response)
urls_list = []
for event in tree.findall('.//event[image]'):
# find the text content of the first <image><url> tag combination:
image_url = event.find('.//image/url')
if image_url is not None:
urls_list.append(image_url.text)
This only consideres event elements that have a direct image child element.

Related

Ahref html issue, html tags not showing

I don't seem to be having much luck solving this issue, i am pulling data from an xml url that looks like:
<?xml version="1.0" encoding="utf-8"?>
<task>
<tasks>
<taskId>46</taskId>
<taskUserId>4</taskUserId>
<taskName>test</taskName>
<taskMode>2</taskMode>
<taskSite>thetest.org</taskSite>
<taskUser>NULL</taskUser>
<taskPass>NULL</taskPass>
<taskEmail>NULL</taskEmail>
<taskUrl>https://www.thetest.com/</taskUrl>
<taskTitle>test</taskTitle>
<taskBody>This is a test using html tags.</taskBody>
<taskCredentials>...</taskProxy>
</tasks>
</task>
This part is where i'm having issues:
<taskBody>This is a test using html tags.</taskBody>
I pull the data using BeautifulSoup like:
# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')
xml_task_id = soup.find('taskId')
xml_task_user_id = soup.find('taskUserId')
xml_task_name = soup.find('taskName')
xml_mode = soup.find('taskMode')
xml_site_name = soup.find('taskSite')
xml_username = soup.find('taskUser')
xml_password = soup.find('taskPass')
xml_email = soup.find('taskEmail')
xml_url = soup.find('taskUrl')
xml_content_title = soup.find('taskTitle')
xml_content_body = soup.find('taskBody')
xml_credentials = soup.find('taskCredentials')
xml_proxy = soup.find('taskProxy')
print(xml_content_body.get_text())
When i print out this part, it prints like: This is a test using html tags.
Instead of showing the ahref tag in full like: This is a test
I literally need the full string printed as is, but it keeps executing the html code instead of printing the string.
Any help would be appreciated.
Python doesn't "execute" HTML code. I'm guessing you're viewing it in a web browser, and that's interpreting the <a> tags just like it's designed to.
Use the html.escape method to turn all tags into escape sequences (with > and < and the like), which stops the browser from interpreting them.
For Phython - Escape HTML Tags: https://wiki.python.org/moin/EscapingHtml
For PHP - Escape HTMl Tags:
https://www.w3schools.com/php/showphp.asp?filename=demo_func_string_strip_tags2

How to parse out xml from noisy file using python

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.
File to parse:
Requesting event notifications...
Receiving command objects...
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected
Command execution successful...
Python:
import re
with open('./output.out', 'r') as outFile:
data = outFile.read().replace('\n','')
regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)
if m:
print(m.group())
else:
print('No match')
Output:
No match
What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.
Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.
import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service"
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1"
# version="2016.1.2700-SNAPSHOT"></data></root>]
roots[0] is an XML document. You can do anything you want with it.

Python XML Parsing can't find children of children

I'm trying to parse XML returned as a string from a http get request. I need to get a specific link inside the XML structure but for some reason I can't get to the link I need.
I tried **enumerating** the XML and printing child.attrib but the link I need is not displaying.
I need to find an element that is a child of a child and the element is called Vm, then I need to get the .attrib of that element.
Thus, I did some more research and tried finding the XML I needed by node name
The XML structure is:
<vapp>
<link></link>
<othertags></othertags>
<Children>
<Vm href='link I need'>
<other tag options>
</other tag options>
</vm>
</Children>
</vapp>
python code:
for i, child in enumerate(vappXML):
if 'href' in child.attrib and 'name' in child.attrib:
vapp_url = child.attrib['href']
r=requests.get(vapp_url, headers = new_headers)
vmlinkXML = fromstring(r.content)
for VM in vmlinkXML.findall('Children'):
print VM
for i, child in enumerate(vmlinkXML):
if 'vm-' in child:
print child.attrib
if 'href' in child.attrib:
vm_url = child.attrib['href']
if 'vm-' in vm_url:
print vm_url
I can't get to the url no matter how I try. I only get the main child of vApp it never parses the tag, or rather my code never goes further than the first child of the vapp and I don't know why.
I guess I wasn't clear. I'm parsing vCloud Director Rest API XML that is returned as a string. The first level is the vApp link which is essentially a container for VMs. I need to get the VM link under each vApp. The first one will select vApp links and query them.
Once it does a get request on the vApp link it gets the next level of XML data which is the structure I put above. so it passes the initial XML statement and returns vApp information.
Even when I print out every child.attrib fom vmlinkXML the link with vm doesnt get printed. however, If I just print r.content the link is there. Its almost like the XML parser doesn't see the tag.
I'm using Pythons XML.etree
from lxml import etree
from xml.etree.ElementTree import XML, fromstring, tostring
So to be clear the structure is:
to get the vApp Links /api/admin/extension/vapps/query
then the returned information will contain links to each vapp in vCloud.
then I call the vApp link
https://vcloud.test.co/api/vApp/vapp-3b4980e7-c5ab-4462-9cfe-abc6292c15748
and it will return a structure similar to this:
<vapp>
<link></link>
<othertags></othertags>
<Children>
<Vm href='link I need'>
<other tag options>
</other tag options>
</vm>
</Children>
</vapp>
Tag contains the next level of link I need to query. However the XML parser with child.attrib never outputs anything under the tag.
Solved***
r=requests.get(url + '/api/admin/extension/vapps/query', headers = new_headers)
vappXML = fromstring(r.content)
for i, child in enumerate(vappXML):
if 'href' in child.attrib and 'name' in child.attrib:
vapp_url = child.attrib['href']
r=requests.get(vapp_url, headers = new_headers)
DOMTree = parseString(r.content)
vmElements = DOMTree.documentElement
VMS = vmElements.getElementsByTagName("Vm")
for vm in VMS:
if vm.hasAttribute("href"):
vm_link = vm.getAttribute("href")
print vm_link

XML parsing specific values - Python

I've been attempting to parse a list of xml files. I'd like to print specific values such as the userName value.
<?xml version="1.0" encoding="utf-8"?>
<Drives clsid="{8FDDCC1A-0C3C-43cd-A6B4-71A6DF20DA8C}"
disabled="1">
<Drive clsid="{935D1B74-9CB8-4e3c-9914-7DD559B7A417}"
name="S:"
status="S:"
image="2"
changed="2007-07-06 20:57:37"
uid="{4DA4A7E3-F1D8-4FB1-874F-D2F7D16F7065}">
<Properties action="U"
thisDrive="NOCHANGE"
allDrives="NOCHANGE"
userName=""
cpassword=""
path="\\scratch"
label="SCRATCH"
persistent="1"
useLetter="1"
letter="S"/>
</Drive>
</Drives>
My script is working fine collecting a list of xml files etc. However the below function is to print the relevant values. I'm trying to achieve this as suggested in this post. However I'm clearly doing something incorrectly as I'm getting errors suggesting that elm object has no attribute text. Any help would be appreciated.
Current Code
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
elm = doc.find('userName')
print elm.text
doc.find looks for a tag with the given name. You are looking for an attribute with the given name.
elm.text is giving you an error because doc.find doesn't find any tags, so it returns None, which has no text property.
Read the lxml.etree docs some more, and then try something like this:
doc = ET.parse(fi)
root = doc.getroot()
prop = root.find(".//Properties") # finds the first <Properties> tag anywhere
elm = prop.attrib['userName']
userName is an attribute, not an element. Attributes don't have text nodes attached to them at all.
for el in doc.xpath('//*[#userName]'):
print el.attrib['userName']
You can try to take the element using the tag name and then try to take its attribute (userName is an attribute for Properties):
from lxml import etree as ET
def read_files(files):
for fi in files:
doc = ET.parse(fi)
props = doc.getElementsByTagName('Properties')
elm = props[0].attributes['userName']
print elm.value

How to find the value in particular tag elemnet in xml using python?

I am trying to parse xml data received from RESTful interface. In error conditions (when query does not result anything on the server), I am returned the following text. Now, I want to parse this string to search for the value of status present in the fifth line in example given below. How can I find if the status is present or not and if it is present then what is its value.
content = """
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="3"/>
<exchange-documents>
<exchange-document system="ops.epo.org" country="US" doc-number="20060159695" status="not found">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="epodoc">
<doc-number>US20060159695</doc-number>
</document-id>
</publication-reference>
<parties/>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(content)
res = root.iterfind(".//{http://www.epo.org/exchange}exchange-documents[#status='not found']/..")
Just use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
print soup.findAll('exchange-document')["status"]
#> not found
If you store every xml output in a single file, would be useful to iterate them:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
for tag in soup.findAll('exchange-document'):
print tag["status"]
#> not found
This will display every [status] tag from [exchange-document] element.
Plus, if you want only useful status you should do:
for tag in soup.findAll('exchange-document'):
if tag["status"] not in "not found":
print tag["status"]
Try this:
from xml.dom.minidom import parse
xmldoc = parse(filename)
elementList = xmldoc.getElementsByTagName(tagName)
elementList will contain all elements with the tag name you specify, then you can iterate over those.

Categories

Resources