Parsing XML from API response - python

I've been trying for some hours to grab the response from the imgur API. I got the XML in the terminal, but I don't know how to grab it and parse it. Here's my code.
c = pycurl.Curl()
values = [
("key", "Super Secret API Number"),
("image", (c.FORM_FILE, "pic.jpg"))]
c.setopt(c.URL, "http://api.imgur.com/2/upload.xml")
c.setopt(c.HTTPPOST, values)
c.perform()
c.close()
I'm a big noob with python, this is my first time. Python virgin. I read that you can parse the xml with ElementTree, but I can't find any cool documentation.
Hope you can help me. Thanks.

Store the response from imgur-api into a file.Than need to use a xml parser to parse the xml response/file you are getting from Imgur-API.
There are lots of option available like lxml or BeautifulSoup.
Here is an example of how to use lxml with XPath expressions.
from lxml import etree
xml = """<foo>baz!</foo>"""
>>> xml = """<foo>baz!</foo>"""
>>> xp = etree.fromstring(xml)
>>> values = xp.xpath("//foo/text()")
>>> values
['baz!']
If you need to parse a xml file:
# parse from file
et = etree.parse(source_xml)
value = et.xpath("your xpath xpr here")
If you need to parse directly from url
# parse from URL
etree.parse("http://example.com/somefile.xml")
For, XPath use firefox's firebug extension or install firepath

When I started using the included ElementTree module I found the documentation lacking good examples (currently there are only 3, and only one of those shows anything immediately practical).
I've answered a couple of questions here on SO related to lxml/ElementTree, and I usually see people getting stuck trying to write these weird list comprehensions to deal with something XPath handles in one line much more clearly:
Parsing lxml.etree._Element contents
lxml classic: Get text content except for that of nested tags?
If you have a more specific question, please post some source XML and desired effect.
I hope this helps,

Related

parsing xml with namespace from request with lxml in python

I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().

How to extract some text from json file without loading it?

python lxml can be used to extract text (e.g., with xpath) from XML files without having to fully parse XML. For example, I can do the following which is faster than BeautifulSoup, especially for large input. I'd like to have some equivalent code for JSON.
from lxml import etree
tree = etree.XML('<foo><bar>abc</bar></foo>')
print type(tree)
r = tree.xpath('/foo/bar')
print [x.tag for x in r]
I see http://goessner.net/articles/JsonPath/. But I don't see an example python code to extract some text from a json file without having use json.load(). Could anybody show me an example? Thanks.
I'm assuming you don't want to load the entire JSON for performance reasons.
If that's the case, perhaps ijson is what you need. I used it to search huge JSON files (>8gb) and it works well.
However, you will have to implement the search code yourself.

Python 3.x: parse ATOM XML and convert to dict

I'm struggling to parse an ATOM XML file, coming from an API, to a common data structure, like dict, Pandas dataframe or JSON,
I understand XML files are more complex than JSON files, and hence there won't be a very simple, generic solution to this. I hope that given the fact that I'm dealing with an ATOM structure might help parsing the file to a more general data structure.
The structure of the XML data: http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet
And similar for JSON here: http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet
The reason I can't use the JSON file is that it is often not available.
I played around with libraries like xml.etree, xmltodict, lxml, xmljson and feedparser, but I keep getting errors.
For example, using feedparser:
r = requests.get('http://opendata.cbs.nl/ODataFeed/OData/70266ned/TypedDataSet')
tree = ElementTree.fromstring(r.content)
Yields the error
xml.etree.ElementTree.ParseError: not well-formated (invalid token): line 1, column 0
Help would be highly appreciated!
I don't know if you solved it yet but, have you tried using?:
tree = ElementTree.fromstring(r.text)
r.content returns the content in bytes (see: http://docs.python-requests.org/en/master/api/#requests.Response)

How do you extract feed urls from an OPML file exported from Google Reader?

I have a piece of software called Rss-Aware that I'm trying to use. It basically desktop feed-checker that checks if RSS feeds are updated and gives a notification through Ubuntu's Notify-OSD system.
However, to know what feeds to check, you have to list out the feed urls in a text file in ~/.rss-aware/rssfeeds.txt one after the other in a list with linebreak between each feed url. Something like:
http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml
...Seems pretty simple right? Well, the list of feeds I'd like to use are exported from Google Reader as an OPML file (it's a type of XML) and I have no clue how to parse it to just output the the feed urls. It seems like it should be pretty straight forward yet I'm stumped.
I'd love if anyone could give an implementation in Python or Ruby or something I could do quickly from a prompt. A bash script would be awesome.
Thanks you so much for the help, I'm a really weak programmer and would love to learn how to do this basic parsing.
EDIT: Also, here is the OPML file I'm trying to extract the feed urls from.
I wrote a subscription list parser for this very purpose. It's called listparser, and it's written in Python. I just tested your OPML file, and it appears to parse the file perfectly. It will also make your feeds' labels available.
If you've ever used feedparser, the interface should be familiar:
>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']
It's possible to create the file with feed URLs using a script similar to:
import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
f.write(i.url + '\n')
f.close()
Just replace USERNAME with your actual username. Done!
XML parsing was so easy to implement and worked great for me.
from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
urls = []
with open(filename, 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//outline'):
url = node.attrib.get('xmlUrl')
if url:
urls.append(url)
return urls
urls = extract_rss_urls_from_opml('your_file')
Since it's an XML file, you can use an XPath query to extract the urls.
In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //#xmlUrl will select all values of that attribute.
If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.
You could also use a regex. I used the following search-and-replace regex to convert my Google Reader OPML export to a Firefox HTML live-bookmark import:
^\s+<outline.*?title="(.*?)".*?xmlUrl="(.*?)".*?htmlUrl="(.*?)".*?/>
<DT><A FEEDURL="$2" HREF="$3">$1</A>

Basic Python file searching and I/O

I'm trying to complete a simple task in Python and I'm new to the language (I'm C++). I hope someone might be able to point me in the right direction.
Problem:
I have an XML file (12mb) full of data and within the file there are start tags 'xmltag' and end tags '/xmltag' that represent the start and end of the data sections I would like to pull out.
I would like to navigate through this open file with a loop and for each instance locate a start tag and copy the data within the section to a new file until the end tag. I would then like to repeat this to the end of the file.
I'm happy with the file I/O but not the most efficient looping, searching and extracting of the data.
I really like the look of the language and hopefully I'm going to get more involved so I can give back to the community.
Big thanks!
Check BeautifulSoup
from BeautifulSoup import BeautifulSoup
with open('bigfile.xml', 'r') as xml:
soup = BeautifulSoup(xml):
for xmltag in soup('xmltag'):
print xmltag.contents
Dive Into Python 3 have a great chapter about this:
http://diveintopython3.org/xml.html#xml-parse
It'a great free book about python, worth reading !
The BeautifulSoup answer is good but this executes faster and doesn't require an external library:
import xml.etree.cElementTree as ET
tree = ET.parse('xmlfile.xml')
results = (elem for elem in tree.getiterator('xmltag'))
# in Python 2.7+, getiterator() is deprecated; use tree.iter('xmltag')
No need to install BeautifulSoup, Python includes the ElementTree parser in its standard library.
from xml.etree import cElementTree as ET
tree = ET.parse('myfilename')
new_tree = ET('new_root_element')
for element in tree.findall('.//xmltag'):
new_tree.append(tree.element)
print ET.tostring(new_tree)
xml=open("xmlfile").read()
x=xml.split("</xmltag>")
for block in x:
if "<xmltag>" in block:
print block.split("<xmltag>")[-1]

Categories

Resources