i have been using Python recently and i want to extract information from a given xml file. The problem is that the information is really badly stored, in a format like this
<Content>
<tags>
....
</tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>
I can not post the entire data here, since it is about 20.000 lines.
I just want to recieve the list containing ["string1", "string2", ...] and this is the code i have been using so far:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
print (node.text)
However my output is none. How can i recieve the comment data? (again, I am using Python)
You'll want to create a SAX based parser instead of a DOM based parser. Especially with a document as large as yours.
A sax based parser requires you to write your own control logic in how data is stored. It's more complicated than simply loading it into a DOM, but much faster as it loads line by line and not the entire document at once. Which gives it the advantage that it can deal with squirrely cases like yours with comments.
When you build your handler, you'll probably want to use the LexicalHandler in your parser to pull out those comments.
I'd give you a working example on how to build one, but it's been a long time since I've done it myself. There's plenty of guides on how to build a sax based parser online, and will defer that discussion to another thread.
The problem is that your comment does not seem to be standard. The standard comment is <!--Comment here--> like this.
And these kind of comments can be parsed with Beautifulsoup for example:
from bs4 import BeautifulSoup, Comment
xml = """<Content>
<tags>
...
</tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)
This returns ['[CDATA["string1"; "string2"; ....]]'] From where it could be easy to parse further into the required strings.
If you have non standard comments, i would recommend a regular expression like:
import re
xml = """<Content>
<tags>
asd
</tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
for j in re.findall('\".+\"', i):
print(j)
This returns: "string1"; "string2"
With Python 3.8 you can insert Comment in Element TREE
A sample code to read attrs, value, tag and comment in XML
import csv, sys
import xml.etree.ElementTree as ET
parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True)) # Python 3.8
tree = ET.parse(infile_path, parser)
csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)
COMMENT = ""
TAG =""
NAME=""
# Get the comment nodes
for node in tree.iter():
if "function Comment" in str(node.tag):
COMMENT = node.text
else:
#read tag
TAG = node.tag # string
#read attributes
NAME= node.attrib.get("name") # ID
#Value
VALUE = node.text # value
print(TAG, NAME, VALUE, COMMENT)
Related
I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.
What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.
I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)
E.g. I have an XML file that looks like
<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>
I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.
EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:
root = ET.fromstring(xmlstring)
root.tag # returns 'myTag1'
root[0].tag # returns 'myTag2'
root[0].text # returns None, but I want it to return the HTML string
The HTML string I want has been parsed and is stored as a tag and text:
root[0][0].tag # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text # returns 'My ... day.'
But really I'd like to be able to do something like this...
root[0].unparsedtext # returns '<p>My ... day.</p>'
SOLUTION:
har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:
def _getInner(element):
if element.text == None:
textStr = ''
else:
textStr = element.text
return textStr + ''.join(ET.tostring(e) for e in element)
Then if
element = ET.fromstring('<myTag>Let us be <b>gratuitous</b> with tags</myTag>')
the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:
''.join(ET.tostring(e) for e in element) # returns '<b>gratuitous</b> with tags'
_getInner(element) # returns 'Let us be <b>gratuitous</b> with tags'
I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :
import xml.etree.ElementTree as ET
def getUnparsedContent(element):
return ''.join(ET.tostring(e) for e in element)
xmlstring = """<myTag1 myAttrib="value">
<myTag2>
<p>My what a lovely day.</p>
</myTag2>
</myTag1>"""
root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))
output :
<p>My what a lovely day.</p>
You should be able to implement this through the built-in minidom xml parser.
from xml.dom import minidom
xmldoc = minidom.parse("document.xml")
rootNode = xmldoc.firstChild
firstNode = rootNode.childNodes[0]
In your example case, firstNode would end up as:
<p>My what a lovely day.</p>
Note that minidom (and probably any other xml-parsing library you might use) won't recognize HTML by default. This is by design, because XML documents do not have predefined tags.
You could then use a series of if or try statements to determine whether you have reached a HTML formatted node while extracting data:
for i in range (0, len(rootNode))
rowNode = rootNode.childNodes[i]
if "<p>" in rowNode:
#this is an html-formatted node: extract the value and continue
I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A
I have a corpus with tens of thousands of XML file (small sized files) and I'm trying to use Python and extract the text contained in one of the XML tags, for example, everything between the body tags for something like:
<body> sample text here with <bold> nested </bold> tags in this paragraph </body>
and then write a text document that contains this string, and move on down the list of XML files.
I'm using effbot's ELementTree but couldn't find the right commands/syntax to do this. I found a website that uses miniDOM's dom.getElementsByTagName but I'm not sure what the corresponding method is for ElementTree. Any ideas would be greatly appreciated.
A better answer, showing how to actually use XML parsing to do this:
import xml.etree.ElementTree as ET
stringofxml = "<body> sample text here with <bold> nested </bold> tags in this paragraph </body>"
def extractTextFromElement(elementName, stringofxml):
tree = ET.fromstring(stringofxml)
for child in tree:
if child.tag == elementName:
return child.text.strip()
print extractTextFromElement('bold', stringofxml)
I would just use re:
import re
body_txt = re.match('<body>(.*)</body>',body_txt).groups()[0]
then to remove the inner tags:
body_txt = re.sub('<.*?>','',body_txt)
You shouldn't use regexp when they are not needed, it's true... but there's nothing wrong with using them when they are.
I want to get the whole text of an Element to parse some xhtml:
<div id='asd'>
<pre>skdsk</pre>
</div>
begin E = div element on the above example, I want to get
<pre>skdsk</pre>
How?
Strictly speaking:
from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()
In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.
Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].