BeautifulSoup4 not recognizing xml tag - python

I'm using BeautifulSoup4 (with lxml parser) to parse xml that looks like this:
<?xml version="1.0" encoding="UTF-8" ?>
<data>
<metadata id="8735180" name="Dauphin Island" lat="30.2500" lon="-88.0750"/>
<observations>
<wl t="2013-12-14 00:00" v="0.725" s="0.059" f="0,0,0,0" q="v" />
<wl t="2013-12-14 00:06" v="0.771" s="0.066" f="0,0,0,0" q="v" />
<wl t="2013-12-14 00:12" v="0.764" s="0.085" f="0,0,0,0" q="v" />
....etc
The python code is like so:
obs_soup = BeautifulSoup(urllib2.urlopen('http://tidesandcurrents.noaa.gov/api/datagetter?product=water_level&application=NOS.COOPS.TAC.WL&begin_date=20131214&end_date=20131216&datum=MSL&station=8735180&time_zone=GMT&units=english&interval=&format=xml),'lxml')
for l in obs_soup.findall('wl'):
obs.append(l['v'])
I keep getting the error:
for l in obs_soup.findall('wl'):
TypeError: 'NoneType' object is not callable
I tried the solution here (except instead of looking for 'html', I looked for 'data'), but that didn't work. Any suggestions?

There are two problems here.
First, there is no such method as findall in BeautifulSoup. Change that to:
for l in obs_soup.find_all('wl'):
obs.append(l['v'])
… and it will work.
So, why are you getting this TypeError: 'NoneType' object is not callable instead of the more usual AttributeError? Because of BeautifulSoup's magic lookup—the same thing that lets you do obs_soup.wl as a shortcut for finding a <wl> also lets you do obs_soup.findall as a shortcut for finding a <findall>. Because there is no <findall> node, it returns None. And then you're trying to call that None object as a function, which of course is nonsense.
Also, if you actually had copied and pasted the copy from here as you claimed, you wouldn't have had this problem. That code uses findAll, with a capital "A", which is a deprecated synonym for find_all. (You shouldn't use the deprecated synonyms, of course.)
Second, you're explicitly asking for lxml's HTML parser instead of its XML parser. Don't do that. See the docs:
BeautifulSoup(markup, ["lxml", "xml"])
Since you didn't give us a complete XML document, I don't know whether this will affect you, or whether you'll happen to get lucky. But you shouldn't rely on happening to get lucky when it's so easy to actually do things right.

Related

Parsing XML with undeclared prefixes in Python

I am trying to parse XML data with Python that uses prefixes, but not every file has the declaration of the prefix. Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>
I have been using xml.etree.ElementTree to parse these files, but whenever the prefix is not properly declared, ElementTree throws a parse error. (unbound prefix, right at the start of <abc:thing2>)
Searching for this error leads me to solutions that suggest I fix the namespace declaration. However, I do not control the XML that I need to work with, so modifying the input files is not a viable option.
Searching for namespace parsing in general leads me to many questions about searching in namespace-agnostic way, which is not what I need.
I am looking for some way to automatically parse these files, even if the namespace declaration is broken. I have thought about doing the following:
tell ElementTree what namespaces to expect beforehand, because I do know which ones can occur. I found register_namespace, but that does not seem to work.
have the full DTD read in before parsing, and see if that solves it. I could not find a way to do this with ElementTree.
tell ElementTree to not bother about namespaces at all. It should not cause issues with my data, but I found no way to do this
use some other parsing library that can handle this issue - though I prefer not to need installation of extra libraries. I have difficulty seeing from the documentation if any others would be able to solve my issue.
some other route that I am currently not seeing?
UPDATE:
After Har07 put me on the path of lxml, I tried to see if this would let me perform the different solutions I had thought of, and what the result would be:
telling the parser what namespaces to expect beforehand: I still could not find any 'official' way to do this, but in my searches before I had found the suggestion to simply add the requisite declaration to the data programmatically. (for a different programming situation - unfortunately I can't find the link anymore) It seemed terribly hacky to me, but I tried it anyway. It involves loading the data as a string, changing the enclosing element to have the right xmlns declarations, and then handing it off to lxml.etree's fromstring method. Unfortunately, that also requires removing all reference to encoding declaration from the string. It works, though.
Read in the DTD before parsing: it is possible with lxml (through attribute_defaults, dtd_validation, or load_dtd), but unfortunately does not solve the namespace issue.
Telling lxml not to bother about namespaces: possible through the recover option. Unfortunately, that also ignores other ways in which the XML may be broken (see Har07's answer for details)
One possible way is using ElementTree compatible library, lxml. For example :
from lxml import etree as ElementTree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<item subtype="bla">
<thing>Word</thing>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
thing = tree.xpath("//thing")[0]
print(ElementTree.tostring(thing))
All you need to do for parsing a non well-formed XML using lxml is passing parameter recover=True to constructor of XMLParser. lxml also has full support for xpath 1.0 which is very useful when you need to get part of XML document using more complex criteria.
UPDATE :
I don't know all the types of XML error that recover=True option can tolerate. But here is another type of error that I know besides unbound namespace prefix: unclosed tag. lxml will fix -rather than ignore- unclosed tag by adding corresponding closing tag automatically. For example, given the following broken XML :
xml = """<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</item>"""
parser = ElementTree.XMLParser(recover=True)
tree = ElementTree.fromstring(xml, parser)
print(ElementTree.tostring(tree))
The final output XML after parsed by lxml is as follow :
<item subtype="bla">
<thing>Word</thing>
<bad>
<abc:thing2>Another Word</abc:thing2>
</bad></item>

Parsing XML with namespaces using ElementTree in Python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Navigating in html by xpath in python

So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.
It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.
I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.
Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

How to retrieve attributes of xml tag in Python?

I'm looking for a way to add attributes to xml tags in python. Or to create a new tag with a new attributes
for example, I have the following xml file:
<types name='character' shortName='chrs'>
....
...
</types>
and i want to add an attribute to make it look like this:
<types name='character' shortName='chrs' fullName='MayaCharacters'>
....
...
</types>
how do I do that with python? by the way. I use python and minidom for this
please help. thanks in advance
You can use the attributes property of the respective Node object.
For example:
from xml.dom.minidom import parseString
documentNode = parseString("<types name='character' shortName='chrs'></types>")
typesNode = documentNode.firstChild
# Getting an attribute
print typesNode.attributes["name"].value # will print "character"
# Setting an attribute
typesNode.attributes["mynewattribute"] = u"mynewvalue"
print documentNode.toprettyxml()
The last print statement will output this XML document:
<?xml version="1.0" ?>
<types mynewattribute="mynewvalue" name="character" shortName="chrs"/>
I learned xml parsing in python from this great article. It has an attribute section, which cant be linked to, but just search for "Attributes" on that page and you'll find it, that holds the information you need.
But in short (snippet stolen from said page):
>>> building_element.setAttributeNS("http://www.boddie.org.uk/paul/business", "business:name", "Ivory Tower")
>>> building_element.getAttributeNS("http://www.boddie.org.uk/paul/business", "name")
'Ivory Tower'
You probably want to skip the handling of namespaces, to make the code cleaner, unless you need them.
It would seem that you just call setAttribute for the parsed dom objects.
http://developer.taboca.com/cases/en/creating_a_new_xml_document_with_python/

Categories

Resources