How to get value from XML file?

How to get value from XML file? - python

I have that xml file, and I need only to get value from steamID64 (76561198875082603).
<profile>
<steamID64>76561198875082603</steamID64>
<steamID>...</steamID>
<onlineState>online</onlineState>
<stateMessage>...</stateMessage>
<privacyState>public</privacyState>
<visibilityState>3</visibilityState>
<avatarIcon>...</avatarIcon>
<avatarMedium>...</avatarMedium>
<avatarFull>...</avatarFull>
<vacBanned>0</vacBanned>
<tradeBanState>None</tradeBanState>
<isLimitedAccount>0</isLimitedAccount>
<customURL>...</customURL>
<memberSince>December 8, 2018</memberSince>
<steamRating/>
<hoursPlayed2Wk>0.0</hoursPlayed2Wk>
<headline>...</headline>
<location>...</location>
<realname>
<![CDATA[ THEMakci7m87 ]]>
</realname>
<summary>...</summary>
<mostPlayedGames>...</mostPlayedGames>
<groups>...</groups>
</profile>
Now I have only that code:
xml_url = f'{url}?xml=1'
then I don't know what to do.

It's fairly simple with lxml:
import lxml.html as lh
steam = """your html above"""
doc = lh.fromstring(steam)
doc.xpath('//steamid64/text()')
Output:
['76561198875082603']
Edit:
With the actual url, it's clear that the underlying data is xml; so the better way to do it is:
import requests
from lxml import etree
url = 'https://steamcommunity.com/id/themakci7m87/?xml=1'
req = requests.get(url)
doc = etree.XML(req.text.encode())
doc.xpath('//steamID64/text()')
Same output.

You better use builtin XML lib named ElementTree
lxml is an external XML lib that requires a separate installation.
See below
import requests
import xml.etree.ElementTree as ET
r = requests.get('https://steamcommunity.com/id/themakci7m87/?xml=1')
if r.status_code == 200:
root = ET.fromstring(r.text)
steam_id_64 = root.find('./steamID64').text
print(steam_id_64)
else:
print('Failed to read data.')
output:
76561198875082603

Related

Parsing of xml in Python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>

You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>

You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

Extracting list or dictionary from xml file

Hello I never worked with xml.. Can someone help me with creating a list or dictionary in python which gives an ID a specific name (string) from the xml file.
Here is my xml file:
api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
I can show you a snippet:
<Response success="true" start_row="10001" num_rows="9990" total_rows="19991">
<objects>
<object>
<acronym>Hdac4</acronym>
<alias-tags>4932408F19Rik AI047285</alias-tags>
<chromosome-id>34</chromosome-id>
<ensembl-id nil="true"/>
<entrez-id>208727</entrez-id>
<genomic-reference-update-id>491928275</genomic-reference-update-id>
<homologene-id>55946</homologene-id>
<id>84010</id>
<legacy-ensembl-gene-id nil="true"/>
<name>histone deacetylase 4</name>
<organism-id>2</organism-id>
<original-name>histone deacetylase 4</original-name>
<original-symbol>Hdac4</original-symbol>
<reference-genome-id nil="true"/>
<sphinx-id>188143</sphinx-id>
<version-status>no change</version-status>
</object>
<object>
<acronym>Prss54</acronym>
<alias-tags>4931432M23Rik Klkbl4</alias-tags>
<chromosome-id>53</chromosome-id>
<ensembl-id nil="true"/>
<entrez-id>70993</entrez-id>
<genomic-reference-update-id>491928275</genomic-reference-update-id>
<homologene-id>19278</homologene-id>
<id>46834</id>
<legacy-ensembl-gene-id nil="true"/>
<name>protease, serine 54</name>
<organism-id>2</organism-id>
<original-name>protease, serine, 54</original-name>
<original-symbol>Prss54</original-symbol>
<reference-genome-id nil="true"/>
<sphinx-id>65991</sphinx-id>
<version-status>updated</version-status>
</object>
<object>
...
So in the end I want to have a dictionary or list that says:
208727 is Hdac4 and that for all in my 2 xml files..
So I need the entrez ID and the original symbol.
I want to have that out of two xml files:
http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=1&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
and
http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
Can someone help me with that?
I am not sure in which format I should store it.. In the end I want to search for the ID and get the original name.

I see one question about something close to XML and you can try use them.
Using the lib of python lxml, with docs in link
You can start with:
import requests
from lxml import etree, html
# edit: Yes, BeautfulSoup works too, like your friend say before
from bs4 import BeautifulSoup
url = "http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]"
req = requests.get(url)
doc = req.text
root = etree.XML(doc) # Works with this or ...
soup = BeautifulSoup(doc) # works with this
them you need read to docs to see how to navigate by tags

If you have the XML stored in a file called results.xml
Then using BeautifulSoup is as simple as
from bs4 import BeautifulSoup
with open('results.xml') as f:
soup = BeautifulSoup(f.read(), 'xml')
final_dictionary = {}
for object in soup.find_all('object'):
final_dictionary[object.find('acronym').string] = object.find('entrez-id').string
print(final_dictionary)
If instead, you want to retrieve XML from a URL, then that is also simple
import requests
from bs4 import BeautifulSoup
url = "<your_url>"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
# Once you have the 'soup' variable assigned
# It's the same code as above example from here on
Output
{'Hdac4': '208727', 'Prss54': '70993'}

Python LXML parse error on Evernote XML

I'm trying to parse Evernote Markup Language (ENML) with lxml in Python 2.7. ENML is a superset of XHTML.
from StringIO import StringIO
import lxml.etree as etree
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = etree.parse(xml_str)
The code above errors out with:
XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 32
How do I successfully parse ENML?

is understood by the HTML parser, not the XML parser:
from StringIO import StringIO
import lxml.html as LH
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = LH.parse(xml_str)
print(LH.tostring(tree))

You can try replacing the entity names by their numerical values.
http://www.w3schools.com/tags/ref_entities.asp

Python - use lxml to return value of title.text attrib

I'm trying to figure out how to use lxml to parse the xml from a url to return the value of the title attribute. Does anyone know what I have wrong or what would return the Title value/text? So in the example below I want to return the value of 'Weeds - S05E05 - Van Nuys - HD TV'
XML from URL:
<?xml version="1.0" encoding="UTF-8"?>
<subsonic-response xmlns="http://subsonic.org/restapi" status="ok" version="1.8.0">
<song id="11345" parent="11287" title="Weeds - S05E05 - Van Nuys - HD TV" album="Season 5" artist="Weeds" isDir="false" created="2009-07-06T22:21:16" duration="1638" bitRate="384" size="782304110" suffix="mkv" contentType="video/x-matroska" isVideo="true" path="Weeds/Season 5/Weeds - S05E05 - Van Nuys - HD TV.mkv" transcodedSuffix="flv" transcodedContentType="video/x-flv"/>
</subsonic-response>
My current Python code:
import lxml
from lxml import html
from urllib2 import urlopen
url = 'https://myurl.com'
tree = html.parse(urlopen(url))
songs = tree.findall('{*}song')
for song in songs:
print song.attrib['title']
With the above code I get no data return, any ideas?
print out of tree =
<lxml.etree._ElementTree object at 0x0000000003348F48>
print out of songs =
[]

First of all, you are not actually using lxml in your code. You import the lxml HTML parser, but otherwise ignore it and just use the standard library xml.etree.ElementTree module instead.
Secondly, you search for data/song but you do not have any data elements in your document, so no matches will be found. And last, but not least, you have a document there that uses namespaces. You'll have to include those when searching for elements, or use a {*} wildcard search.
The following finds songs for you:
from lxml import etree
tree = etree.parse(URL) # lxml can load URLs for you
songs = tree.findall('{*}song')
for song in songs:
print song.attrib['title']
To use an explicit namespace, you'd have to replace the {*} wildcard with the full namespace URL; the default namespace is available in the .nsmap namespace dict on the tree object:
namespace = tree.nsmap[None]
songs = tree.findall('{%s}song' % namespace)

The whole issue is with the fact that the subsonic-response tag has a xmlns attribute indicating that there is an xml namespace in effect. The below code takes that into account and correctly pigs up the song tags.
import xml.etree.ElementTree as ET
root = ET.parse('test.xml').getroot()
print root.findall('{http://subsonic.org/restapi}song')

Thanks for the help guys, I used a combination of both of yours to get it working.
import xml.etree.ElementTree as ET
from urllib2 import urlopen
url = 'https://myurl.com'
root = ET.parse(urlopen(url)).getroot()
for song in root:
print song.attrib['title']

Python BeautifulSoup XML Parsing

I've written a simple script to parse XML chat logs using the BeautifulSoup module. The standard soup.prettify() works ok except chat logs have a lot of fluff in them. You can see both the script code and some of the XML input file I'm working with below:
Code
import sys
from BeautifulSoup import BeautifulSoup as Soup
def parseLog(file):
file = sys.argv[1]
handler = open(file).read()
soup = Soup(handler)
print soup.prettify()
if __name__ == "__main__":
parseLog(sys.argv[1])
Test XML Input
<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='MessageLog.xsl'?>
<Log FirstSessionID="1" LastSessionID="2"><Message Date="10/31/2010" Time="3:43:48 PM" DateTime="2010-10-31T20:43:48.937Z" SessionID="1"><From><User FriendlyName="Jon"/></From> <To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, what's up?</Text></Message>
<Message Date="10/31/2010" Time="3:44:03 PM" DateTime="2010-10-15T20:44:03.421Z" SessionID="1"><From><User FriendlyName="Jon"/></From><To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">Got your message</Text></Message>
<Message Date="10/31/2010" Time="3:44:31 PM" DateTime="2010-10-15T20:44:31.390Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">oh, great</Text></Message>
<Message Date="10/31/2010" Time="3:44:59 PM" DateTime="2010-10-15T20:44:59.281Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, i gotta run</Text></Message>
I'm wanting to be able to output this into a format like the following or at least something that is more readable than pure XML:
Jon:
Hey, what's up? [10/31/10 # 3:43p]
Jon:
Got your message [10/31/10 # 3:44p]
Bill:
oh, great [10/31/10 # 3:44p]
etc.. I've heard some decent things about the PyParsing module, maybe it's time to give it a shot.

BeautifulSoup makes getting at attributes and values in xml really simple. I tweaked your example function to use these features.
import sys
from BeautifulSoup import BeautifulSoup as Soup
def parseLog(file):
file = sys.argv[1]
handler = open(file).read()
soup = Soup(handler)
for message in soup.findAll('message'):
msg_attrs = dict(message.attrs)
f_user = message.find('from').user
f_user_dict = dict(f_user.attrs)
print "%s: %s [%s # %s]" % (f_user_dict[u'friendlyname'],
message.find('text').decodeContents(),
msg_attrs[u'date'],
msg_attrs[u'time'])
if __name__ == "__main__":
parseLog(sys.argv[1])

I'd recommend using the builtin ElementTree module. BeautifulSoup is meant to handle unwell-formed code like hacked up HTML, whereas XML is well-formed and meant to be read by an XML library.
Update: some of my recent reading here suggests lxml as a library built on and enhancing the standard ElementTree.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get value from XML file? - python

Related

Parsing of xml in Python

Extracting list or dictionary from xml file

Python LXML parse error on Evernote XML

Python - use lxml to return value of title.text attrib

Python BeautifulSoup XML Parsing

Categories

Resources