lxml: Converting XML to HTML through XSLT and get HtmlElements - python

I have data that comes as an XML file. I have also been provided an XSLT to transform the XML to HTML. I can use lxml to perform the conversion, however, I want to alter some of the HTML tags after the transformation. How do I convert this new etree into HtmlElements so that I can specifically use certain methods like .cssselect() and so on.

>>> import lxml.etree
>>> import lxml.html
>>>
>>> xmlstring = '''\
... <?xml version='1.0' encoding='ASCII'?>
... <root><a class="here">link1</a><a class="there">link2</a></root>
... '''
>>> root = lxml.etree.fromstring(xmlstring)
>>> root.cssselect('a.here')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'lxml.etree._Element' object has no attribute 'cssselect'
lxml.etree.tostring(root) -> lxml.html.fromstring(..)
>>> root = lxml.html.fromstring(lxml.etree.tostring(root))
>>> root.cssselect('a.here')
[<Element a at 0x2989308>]
Get XML output:
>>> print lxml.etree.tostring(root, xml_declaration=True)
<?xml version='1.0' encoding='ASCII'?>
<root><a class="here">link1</a><a class="there">link2</a></root>

Related

Parsing of xml in Python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>
You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>
You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

extract a specific tag from xml file using beautiful soup in python

I have an xml file (lets call is abc.xml) which looks like this.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<product name="XYZ" version="123"/>
<application-links>
<application-links>
<id>111111111111111</id>
<name>Link_1</name>
<primary>true</primary>
<type>applinks.ABC</type>
<display-url>http://ABC.displayURL</display-url>
<rpc-url>http://ABC.displayURL</rpc-url>
</application-links>
</application-links>
</properties>
my python code is like this
f = open ('file.xml', 'r')
from bs4 import BeautifulSoup
soup = BeautifulSoup(f,'lxml')
print(soup.product)
for applinks in soup.application-links:
print(applinks)
which prints the following
<product name="XYZ" version="123"></product>
Traceback (most recent call last):
File "parse.py", line 7, in <module>
for applinks in soup.application-links:
NameError: name 'links' is not defined
Please can you help me understand how to print lines which have tags including a dash/hyphen '-'
I don't know if beautifulsoup is the best option here, but I really suggest using the ElementTree module in python like so:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('file.xml').getroot()
>>> for app in root.findall('*/application-links/'):
... print(app.text)
111111111111111
Link_1
true
applinks.ABC
http://ABC.displayURL
http://ABC.displayURL
So, to print the value inside the <name> tag, you can do so:
>>> for app in root.findall('*/application-links/name'):
... print(app.text)
Link_1

ExpatError: junk after document element xml python error

I have a project that needs to be conversion from xml to dict in python. I am using the xmltodict library however when I convert the xml to dict it raises the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/deanchristianarmada/Desktop/projects/asian_gaming/radar/lib/python2.7/site-packages/xmltodict.py", line 311, in parse
parser.Parse(xml_input, True)
ExpatError: junk after document element: line 2, column 0
my code is:
import xmltodict
xml = '<row dataType="TR" ID="3B6B408870BA7AC3E05381010A0A5849" agentCode="690001001001001" transferId="G87_AGIN160901115820S441XB" tradeNo="160831287638239" platformType="AGIN" playerName="mubuuvu2" transferType="IN" transferAmount="28" previousAmount="0" currentAmount="28" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:16" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BB7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227456_Hunter_Out" tradeNo="160831287639025" platformType="AGIN" playerName="zxh123" transferType="OUT" transferAmount="-50" previousAmount="50" currentAmount="0" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:18" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BC7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227452_Hunter_In" tradeNo="160831287639507" platformType="AGIN" playerName="qqq19qq32b" transferType="IN" transferAmount="71" previousAmount="0" currentAmount="71" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:19" gameCode="" />\r\n'
_dict = xmltodict.parse(xml, attr_prefix="")
I can't seem to find a way to fix it and I'm not used in xml, I'm used with JSON
If you add a starting root tag in the beginning and an ending root tag in the end of the xml string, it should work.
import xmltodict
xml = 'xml string here'
xml = '<root>'+xml+'</root>'
_dict = xmltodict.parse(xml, attr_prefix="")
Basically, it's just missing the <root> tag.

How to change element text of only one element with python elementtree

I have the following xml example file:
<Book>
<Location>page10</Location>
<Chapter>
<Location>page11</Location>
</Chapter>
</Book>
I want to change the text value of element <Location> right beneath <book>.
Using findall gives both 'Location' elements.
Using find gives the first, that could be right, but in case element 'Chapter' is placed before Location than I get the wrong element.
Anyone any suggestions?
Use paths..
>>> import xml.etree.ElementTree as etree
>>> frag = '<Book><Chapter><Location>page11</Location></Chapter><Location>page10</Location></Book>'
>>> tree = etree.fromstring(frag)
>>> tree.findall('./Location')[0].text
'page10'
>>> tree.findall('./Location')[1].text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range

XML Not Parsing in Python 2.7 with ElementTree

I have the following XML file which I get from REST API
<?xml version="1.0" encoding="utf-8"?>
<boxes>
<home id="1" name="productname"/>
<server>111.111.111.111</server>
<approved>yes</approved>
<creation>2007 handmade</creation>
<description>E-Commerce, buying and selling both attested</description>
<boxtype>
<sizes>large, medium, small</sizes>
<vendor>Some Organization</vendor>
<version>ANY</version>
</boxtype>
<method>Handmade, Handcrafted</method>
<time>2014</time>
</boxes>
I am able to get the above output, store in a string variable and print in console,
but when I send this to xml ElementTree
import base64
import urllib2
from xml.dom.minidom import Node, Document, parseString
from xml.etree import ElementTree as ET
from xml.etree.ElementTree import XML, fromstring, tostring
print outputxml ##Printing xml correctly, outputxml contains xml above
content = ET.fromstring(outputxml)
boxes = content.find('boxes')
print boxes
boxtype = boxes.find("boxes/boxtype")
If I print the boxes it is giving me None and hence is giving me below error
boxtype = boxes.find("boxes/boxtype")
AttributeError: 'NoneType' object has no attribute 'find'
The root level node is boxes, and it cannot find boxes within itself.
boxtype = content.find("boxtype")
should be sufficient.
DEMO:
>>> import base64
>>> import urllib2
>>> from xml.dom.minidom import Node, Document, parseString
>>> from xml.etree import ElementTree as ET
>>> from xml.etree.ElementTree import XML, fromstring, tostring
>>>
>>> print outputxml ##Printing xml correctly, outputxml contains xml above
<?xml version="1.0" encoding="utf-8"?>
<boxes>
<home id="1" name="productname"/>
<server>111.111.111.111</server>
<approved>yes</approved>
<creation>2007 handmade</creation>
<description>E-Commerce, buying and selling both attested</description>
<boxtype>
<sizes>large, medium, small</sizes>
<vendor>Some Organization</vendor>
<version>ANY</version>
</boxtype>
<method>Handmade, Handcrafted</method>
<time>2014</time>
</boxes>
>>> content = ET.fromstring(outputxml)
>>> boxes = content.find('boxes')
>>> print boxes
None
>>>
>>> boxes
>>> content #note that the content is the root level node - boxes
<Element 'boxes' at 0x1075a9250>
>>> content.find('boxtype')
<Element 'boxtype' at 0x1075a93d0>
>>>

Categories

Resources