The input file is actually multiple XML files appending to one file. (Sourced from Google Patents). This is an example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23">
<applicants>
<applicant sequence="001" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Beyer</last-name>
<first-name>Daniel Lee</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
<applicant sequence="002" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Friedland</last-name>
<first-name>Jason Michael</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
</applicants>
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
I'm trying to create a string with a "-".join xpath for all of the children and grandchildren within <applicant> using the following in python with lxml:
import urllib2, os, zipfile
from lxml import etree
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 1: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
applicant = "-".join(doc.xpath('//applicants/applicant/*/text()'))
print "DocID: {0}\nTitle: {1}\nApplicant: {2}\n".format(docID,title,applicant)
outFile.write(str(docID) +"|"+ str(title) +"|"+ str(applicant) +"\n")
I've tried mutliple xpath combinations but I can't produce a string with hyphens for <applicants> and while //text() cant get to the grandchild it doesn't help with the stringing. What is the appropriate xpath syntax to select all text within the children and grandchildren of <applicant> and still punch it out in a string? While not shown in this example is there a way to ignore unicode that might be present at the beginning of a text line too (I believe it appears in some of the later xml docs)? The 'applicant' output I'm hoping to get should look something like:
Beyer-Daniel Lee-Franklin-TN-US-omitted-US-Friedland-Jason Michael-Franklin-TN-US-omitted-US
This question is very similar to this other question of yours.
There are two problems here:
How to get from "non-standard XML" to "standard XML"?
How to use XPath to get text values of descendant elements and concatenate them?
You need to solve 1 before attacking 2. If you need help with that, ask a separate question.
"Non-standard XML" is the same as not XML at all. You can't parse it as XML, and you can't use XPath on it. But you have phrased the question in a way that makes it look like you are trying to do that anyway.
Assuming that your question is actually about working with "standard XML", how about using the same approach as in my answer to your other question?
Related
This question already has answers here:
How does XPath deal with XML namespaces?
(2 answers)
Parsing XML with namespace in Python via 'ElementTree'
(7 answers)
Closed 6 months ago.
I'm trying to parse an xml file using Python through root.findall.
Basically my file looks like this - and I'm trying to access elements under "Level3".
Edit: #trincot, already provided solution.....but, Now, I've added namespace to the sample data(xmlns="http://xyz.abc/forms"), which is causing the trouble. Why would adding 'xmlns=' cause the issue ? :O
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xyz.abc/forms" xmlns:abc="http://bus-message-envelope" xmlns:env="http://www.w3.org/2003/05/soap-envelope" abc:version="1-2">
<env:Header>
<abc:col1>col1Text</abc:col1>
<abc:col2>col2Text</abc:col2>
<abc:col3>col3Text</abc:col3>
</env:Header>
<env:Body>
<Level1>
<Level2 schemaVersion="1-1">
<Level3>
<cell1>cell1Text</cell1>
<cell2>cell2Text</cell2>
<cell3>cell3Text</cell3>
<cell4>cell4Text</cell4>
</Level3>
</Level2>
</Level1>
</env:Body>
</env:Envelope>
Trying this, but doesn't return anything :
from xml.etree import ElementTree
tree = ElementTree.parse("/tmp/test.xml")
root = tree.getroot()
for form in root.findall(".//Level3"):
print(form.text)
print("Inside Loop") --> Not even hitting this
Expected Output:
cell1Text
cell2Text
cell3Text
cell4Text
I was able to access the same elements through code below. But, how to achieve this using findall?
for x in root[1][0][0][0]:
print(x.text)
Output:
cell1Text
cell2Text
cell3Text
cell4Text
I did go through most of Stack Overflow, but couldn't get an answer to this. Tried many things but failed :( .
In the first code snippet you access form.text, but form corresponds to the Level3 element which has no other text than just white space. The actual text you want to output is sitting in its child nodes. So print(form.text) prints white space only.
The working code iterates the children of that same Level3 element:
for x in root[1][0][0][0]:
print(x.text)
Here x is the deeper cellX element, which does have the text you expect.
To achieve this with findall do:
for x in root.findall(".//Level3/*"):
print(x.text)
Note the extra level /* in the argument of findall, which means: any child element of Level3 elements.
See both the original and corrected code run on repl.it
If you didn't get any output with the first version, then please check spelling. It looks suspicious that the Elements in your XML sometimes start with a capital (like Level3) and sometimes not (like cell1). This could be a reason of not getting output. However, I loaded your code and XML as-is, and it produced the message "Inside Loop", as you can see when you follow the link above.
I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.
Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.
Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100
I have the following xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
What is the most efficient way to parse and extract the <![CDATA[content]]> into a list. Let's say:
[#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING Ugh YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt #username Shout out to me???? ]
This is what I tried:
from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out
And this is the output:
[<document>"#username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING </document>]
The problem with this output is that I should not get the <document></document>. How can I remove the <document></document> tags and get all the elements of this xml in a list?.
There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question).
You need to pass in a file handle, not a file name.
That is: y = BeautifulSoup(open(x))
You need to tell BeautifulSoup that it's dealing with XML.
That is: y = BeautifulSoup(open(x), 'xml')
CDATA sections don't create elements. You can't search for them in the DOM, because they don't exist in the DOM; they're just syntactic sugar. Just look at the text directly under the document, don't try to search for something named CDATA.
To state it again, somewhat differently: <doc><![CDATA[foo]]</doc> is exactly the same as <doc>foo</doc>. What's different about a CDATA section is that everything inside it is automatically escaped, meaning that <![CDATA[<hello>]] is interpreted as <hello>. However -- you can't tell from the parsed object tree whether your document contained a CDATA section with literal < and > or a raw text section with < and >. This is by design, and true of any compliant XML DOM implementation.
Now, how about some code that actually works:
import bs4
doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That came at the wrong time ????" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
"""
doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]
If you want to read from a file, replace doc with open(filename, 'r').
So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.
It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.
I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.
Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1
I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.
I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).
Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.
If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.