XML not parsing correctly using requests and lxml - python

I'm trying to get content out of XML from an API call. I'm able to use requests to get the xml content, but can't seem to parse it correctly. Here is the code that has been semi-successful so far:
import requests
from lxml import etree
data = requests.get('http://elections.huffingtonpost.com/pollster/api/polls.xml', params={'sort':'updated'})
tree = etree.XML(data.content)
The tree is showing the line breaks from the xml as text, and some of the nodes that are more than 3 levels deep are gone.

Related

parsing xml with namespace from request with lxml in python

I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().

xPath with ElementTree (python) to parse XML from string

I'm using ElementTree to parse some XML retrieved from a website, but somehow I can't see to be able to use ".find" or ".findall". I tried to use ElementTree, and I tired lxml.etree and nothing is working with me. My goal is to retrieve //course from my XML file retrieved from a URL.
import requests
import xml.etree.ElementTree as ET
res = requests.get(COURSES_URL).text #Storing the XML into res
XML = ET.fromstring(res)
print(XML.findall('//COURSE'))
COURSES_URL is my own URL which I am retrieving the XML from, and yes it is working since I got the output XML that I want (sample):
<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated by Oracle Reports version 11.1.2.1.0 -->
<SYRSPOS_REP>
<LIST_G_PROGRAM>
<G_PROGRAM>
<SPRIDEN_ID>U712214</SPRIDEN_ID>
<STUDENT_NAME>Mark Adam Johns</STUDENT_NAME>
<SMBPOGN_PIDM>98</SMBPOGN_PIDM>
<SMBPOGN_REQUEST_NO>46</SMBPOGN_REQUEST_NO>
<COURSE ID=1411001>PASS</COURSE>
<COURSE ID=1411023>PASS</COURSE>
<COURSE ID=1411136>PASS</COURSE>
</G_PROGRAM>
</LIST_G_PROGRAM>
</SYRSPOS_REP>
Solved:
Apparently I had 2 issues.
First of all I can't use findall in print since it returns a list, I had to do a for in loop for i in XML.findall(), then I print i.text().
Secondly, I had to add a dot after the quotation mark, as in ".//COURSES"

Parsing XML Object Python 3.4

Basically what I am doing is using urllib.request to make an API call to pubmed, receive an XML file in return, and am trying to parse it with no luck.
I have tried using Element Tree and other modules with no luck. I believe there may be an issue with XML object itself.
#Imorting URL Request Modules for API Calls
#Also importing ElemenTree as it seems to be best for XML parsing
import urllib.request
import urllib.parse
import re
import xml.etree.ElementTree as ET
from urllib import request
#Now I can make the API call.
id_request = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568')
#id_request will be an object that I'm not sure I understand?
#id_request Returns: "<http.client.HTTPResponse object at 0x0000000003693FD0>"
#Let's now read this baby in XML format!
id_pubmed = id_request.read()
#If I look at the id_pubmed object, I not have the XML file I want to parse.
You can see what the XML file id_pubmed is calling/prints here: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568
My issue is I can't get Element Tree to parse this at all. I have tried:
tree = ET.parse(id_pubmed)
root = tree.getroot()
as well as various other suggestions from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
ET.parse() method requires either the location of the xml file (on local file system) or a file like object , but your id_pubmed seems to be a string .
In that case , you should use ET.fromstring() . Example -
root = ET.fromstring(id_pubmed)

Python lxml.etree - Is it more effective to parse XML from string or directly from link?

With the lxml.etree python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2), to return a string and then parse from that? Or does it make no difference at all?
Method 1 - Parse directly from link
from lxml import etree as ET
parsed = ET.parse(url_link)
Method 2 - Parse from string
from lxml import etree as ET
import urllib2
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)
# note: I do not have access to python
# at the moment, so not sure whether
# the .fromstring() function is correct
Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?
I ran the two methods with a simple timing rapper.
Method 1 - Parse XML Directly From Link
from lxml import etree as ET
#timing
def parseXMLFromLink():
parsed = ET.parse(url_link)
print parsed.getroot()
for n in range(0,100):
parseXMLFromLink()
Average of 100 = 98.4035 ms
Method 2 - Parse XML From String Returned By Urllib2
from lxml import etree as ET
import urllib2
#timing
def parseXMLFromString():
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.fromstring(xml_string)
print parsed
for n in range(0,100):
parseXMLFromString()
Average of 100 = 286.9630 ms
So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink() function would still remain quicker as it is urllib2 that seems to slow the second function down.
I ran this a few times and the results stayed the same.
If by 'effective' you mean 'efficient', I'm relatively certain you will see no difference between the two at all (unless ET.parse(link) is horribly implemented).
The reason is that the network time is going to be the most significant part of parsing an online XML file, a lot longer than storing the file to disk or keeping it in memory, and a lot longer than actually parsing it.

Creating an XML document with BeautifulSoup

In all the examples and tutorials I have seen of BeautifulSoup, an HTML/XML document is passed and a soup object is returned which can then be used to modify the document. However, how can I use BeautifulSoup to create a HTML/XML document from scratch? In other words, I have data that I would like to put in an XML file, but the XML file does not exist yet and I would like to build it from scratch. How can I go about it?
Just create an empty BeautifulSoup() object:
soup = BeautifulSoup()
and start adding elements:
soup.append(soup.new_tag("a", href="http://www.example.com"))
For XML you could start out with a XML header by using the xml tree builder:
soup = BeautifulSoup(features='xml')
This requires lxml to be installed first. This sets the .is_xml flag on the BeautifulSoup object (which can also be set manually).

Categories

Resources