Parsing of xml in Python - python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>

You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>

You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

Related

How to process xml response from flickr

import flickrapi
from xml.etree import ElementTree as ET
from lxml import etree
flickr = flickrapi.FlickrAPI(api_key,secret=api_secret)
r = flickr.photos_search(tags='e-waste', has_geo="1", per_page='100')
tree = ET.ElementTree(r)
xml_input = etree.parse("response_clean.xml")
transform = etree.XSLT(xslt_root)
links = str(transform(xml_input))
The idea of this little script is to get xml response from Flickr, and then use xsl file to process it further.
I want to convert r object (which is of type lxml.etree._Element)
to xml_input (of type lxml.etree._ElementTree).
I used tree = ET.ElementTree(r) but result is of type xml.etree.ElementTree.ElementTree.
I see that this is not exactly the same, but I don't understand the difference.
How should r be converted to xml_input ?
The code creates xml.etree.ElementTree.ElementTree because ET in the corresponding import statement references xml.etree.ElementTree. You should've used etree.ElementTree instead, which was imported from lxml :
>>> from xml.etree import ElementTree as ET
>>> from lxml import etree
>>> raw ='''<root></root>'''
>>> r = etree.fromstring(raw)
>>> root = etree.ElementTree(r)
>>> type(r)
<type 'lxml.etree._Element'>
>>> type(root)
<type 'lxml.etree._ElementTree'>

Python, extract date from XML

Apologies, my Python knowledge is pretty non-existant. I need to extract a date from some XML which is in a format similar to:
<Header>
<Version>1.0</Version>
....
<cd:Data>...</Data>
.....
<cd:DateReceived>20070620171524</cd:DateReceived>
From looking around here I found something similar
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
print collection.getElementsByTagName("cd:DateReceived").item(0)
However this only prints the Hex value:
<DOM Element: cd:DateReceived at 0x1529e0>
How can I get the date 20070620171524?
I've tried using the following
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
date = cd:DateReceived[0].firstChild.nodeValue
print date
but it gives an error as it doesn't like the "cd" part of the tag
date = cd:DateReceived[0].firstChild.nodeValue
^
SyntaxError: invalid syntax
Any help would be appreciated. Thanks!
collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue

Lxml: Return a part of the xml file as string

I am parsing my xml file using lxml parser which looks like this:
# some elements above
<contact>
<phonenumber>
#something
</phonenumber>
</contact>
I want to be able to return only a part of the xml file.
Like Suppose if I am on phonenumber, I want lxml to return the everything between as a string .
I dont want to return textb/w phonenumber but the entire string :
<phonenumebr>something</phonenumber>
Is it possible ?
To print a part of the XML tree, you can use lxml.etree.tostring. On Python 2:
In [1]: from lxml.etree import tostring, parse
In [2]: tree = parse('test.xml')
In [3]: elem = tree.xpath('//phonenumber')[0]
In [4]: print tostring(elem)
<phonenumber>
something
</phonenumber>
For more information you can refer to the "Serialisation" section of the lxml tutorial.

How to print Element as correct xml with xml tag?

So I have this function in my view:
from django.http import HttpResponse
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def helloworld(request):
root_element = Element("root_element")
comment = Comment("Hello World!!!")
root_element.append(comment)
foo_element = Element("foo")
foo_element.text = "bar"
bar_element = Element("bar")
bar_element.text = "foo"
root_element.append(foo_element)
root_element.append(bar_element)
return HttpResponse(tostring(root_element), "application/xml")
What it does it prints something like this:
<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>
As you can see, it is missing the xml tag at the beginning. How to output proper XML beginning with xml declaration?
If you can add a dependency in your project, I suggest you to use lxml which is more complete and optimized than the basic xml module that come with Python.
For doing this, you just have to change your import statement to :
from lxml.etree import Element, SubElement, Comment, tostring
And then, you'll have a tostring() with a 'xml_declaration' option :
>>> tostring(root, xml_declaration=False)
'<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>'
>>> tostring(root, xml_declaration=True)
"<?xml version='1.0' encoding='ASCII'?>\n<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>"
In the standard lib, only the write() method of ElementTree have a xml_declaration option. An other solution would be to create a wrapper which use ElementTree.write() to write into a StringIO and then, to return the content of the StringIO.

What's the best way to handle -like entities in XML documents with lxml?

Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.
What's the best way to get a â text child under the aa tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
#Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.

Categories

Resources