Extracting list or dictionary from xml file

Extracting list or dictionary from xml file - python

Hello I never worked with xml.. Can someone help me with creating a list or dictionary in python which gives an ID a specific name (string) from the xml file.
Here is my xml file:
api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
I can show you a snippet:
<Response success="true" start_row="10001" num_rows="9990" total_rows="19991">
<objects>
<object>
<acronym>Hdac4</acronym>
<alias-tags>4932408F19Rik AI047285</alias-tags>
<chromosome-id>34</chromosome-id>
<ensembl-id nil="true"/>
<entrez-id>208727</entrez-id>
<genomic-reference-update-id>491928275</genomic-reference-update-id>
<homologene-id>55946</homologene-id>
<id>84010</id>
<legacy-ensembl-gene-id nil="true"/>
<name>histone deacetylase 4</name>
<organism-id>2</organism-id>
<original-name>histone deacetylase 4</original-name>
<original-symbol>Hdac4</original-symbol>
<reference-genome-id nil="true"/>
<sphinx-id>188143</sphinx-id>
<version-status>no change</version-status>
</object>
<object>
<acronym>Prss54</acronym>
<alias-tags>4931432M23Rik Klkbl4</alias-tags>
<chromosome-id>53</chromosome-id>
<ensembl-id nil="true"/>
<entrez-id>70993</entrez-id>
<genomic-reference-update-id>491928275</genomic-reference-update-id>
<homologene-id>19278</homologene-id>
<id>46834</id>
<legacy-ensembl-gene-id nil="true"/>
<name>protease, serine 54</name>
<organism-id>2</organism-id>
<original-name>protease, serine, 54</original-name>
<original-symbol>Prss54</original-symbol>
<reference-genome-id nil="true"/>
<sphinx-id>65991</sphinx-id>
<version-status>updated</version-status>
</object>
<object>
...
So in the end I want to have a dictionary or list that says:
208727 is Hdac4 and that for all in my 2 xml files..
So I need the entrez ID and the original symbol.
I want to have that out of two xml files:
http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=1&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
and
http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]
Can someone help me with that?
I am not sure in which format I should store it.. In the end I want to search for the ID and get the original name.

I see one question about something close to XML and you can try use them.
Using the lib of python lxml, with docs in link
You can start with:
import requests
from lxml import etree, html
# edit: Yes, BeautfulSoup works too, like your friend say before
from bs4 import BeautifulSoup
url = "http://api.brain-map.org/api/v2/data/query.xml?num_rows=10000&start_row=10001&&criteria=model::Gene,rma::criteria,products[abbreviation$eq%27Mouse%27]"
req = requests.get(url)
doc = req.text
root = etree.XML(doc) # Works with this or ...
soup = BeautifulSoup(doc) # works with this
them you need read to docs to see how to navigate by tags

If you have the XML stored in a file called results.xml
Then using BeautifulSoup is as simple as
from bs4 import BeautifulSoup
with open('results.xml') as f:
soup = BeautifulSoup(f.read(), 'xml')
final_dictionary = {}
for object in soup.find_all('object'):
final_dictionary[object.find('acronym').string] = object.find('entrez-id').string
print(final_dictionary)
If instead, you want to retrieve XML from a URL, then that is also simple
import requests
from bs4 import BeautifulSoup
url = "<your_url>"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
# Once you have the 'soup' variable assigned
# It's the same code as above example from here on
Output
{'Hdac4': '208727', 'Prss54': '70993'}

Related

use pyquery to filter html

I'm trying to use pyquery parse html. I'm facing one uncertain issue. My code as below:
from pyquery import PyQuery as pq
document = pq('<p id="hello">Hello</p><p id="world">World !!</p>')
p = document('p')
print(p.filter("#hello"))
And the expectation of print result should as following :
<p id="hello">Hello</p>
But the actual response as below:
<p id="hello">Hello</p><p id="world">World !!</p></div></html>
if I just want to the specify part html instead of the rest of the entire html content, how should I write it.
Thanks

You can use built in library ElementTree
import xml.etree.ElementTree as ET
html = '''<html><p id="hello">Hello</p><p id="world">World !!</p></html>'''
root = ET.fromstring(html)
p = root.find('.//p[#id="hello"]')
print(ET.tostring(p))
output
b'<p id="hello">Hello</p>'

Parsing of xml in Python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>

You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>

You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

How to get value from XML file?

I have that xml file, and I need only to get value from steamID64 (76561198875082603).
<profile>
<steamID64>76561198875082603</steamID64>
<steamID>...</steamID>
<onlineState>online</onlineState>
<stateMessage>...</stateMessage>
<privacyState>public</privacyState>
<visibilityState>3</visibilityState>
<avatarIcon>...</avatarIcon>
<avatarMedium>...</avatarMedium>
<avatarFull>...</avatarFull>
<vacBanned>0</vacBanned>
<tradeBanState>None</tradeBanState>
<isLimitedAccount>0</isLimitedAccount>
<customURL>...</customURL>
<memberSince>December 8, 2018</memberSince>
<steamRating/>
<hoursPlayed2Wk>0.0</hoursPlayed2Wk>
<headline>...</headline>
<location>...</location>
<realname>
<![CDATA[ THEMakci7m87 ]]>
</realname>
<summary>...</summary>
<mostPlayedGames>...</mostPlayedGames>
<groups>...</groups>
</profile>
Now I have only that code:
xml_url = f'{url}?xml=1'
then I don't know what to do.

It's fairly simple with lxml:
import lxml.html as lh
steam = """your html above"""
doc = lh.fromstring(steam)
doc.xpath('//steamid64/text()')
Output:
['76561198875082603']
Edit:
With the actual url, it's clear that the underlying data is xml; so the better way to do it is:
import requests
from lxml import etree
url = 'https://steamcommunity.com/id/themakci7m87/?xml=1'
req = requests.get(url)
doc = etree.XML(req.text.encode())
doc.xpath('//steamID64/text()')
Same output.

You better use builtin XML lib named ElementTree
lxml is an external XML lib that requires a separate installation.
See below
import requests
import xml.etree.ElementTree as ET
r = requests.get('https://steamcommunity.com/id/themakci7m87/?xml=1')
if r.status_code == 200:
root = ET.fromstring(r.text)
steam_id_64 = root.find('./steamID64').text
print(steam_id_64)
else:
print('Failed to read data.')
output:
76561198875082603

How to parse out xml from noisy file using python

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.
File to parse:
Requesting event notifications...
Receiving command objects...
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected
Command execution successful...
Python:
import re
with open('./output.out', 'r') as outFile:
data = outFile.read().replace('\n','')
regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)
if m:
print(m.group())
else:
print('No match')
Output:
No match
What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.

Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.
import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service"
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1"
# version="2016.1.2700-SNAPSHOT"></data></root>]
roots[0] is an XML document. You can do anything you want with it.

How to find the value in particular tag elemnet in xml using python?

I am trying to parse xml data received from RESTful interface. In error conditions (when query does not result anything on the server), I am returned the following text. Now, I want to parse this string to search for the value of status present in the fifth line in example given below. How can I find if the status is present or not and if it is present then what is its value.
content = """
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="3"/>
<exchange-documents>
<exchange-document system="ops.epo.org" country="US" doc-number="20060159695" status="not found">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="epodoc">
<doc-number>US20060159695</doc-number>
</document-id>
</publication-reference>
<parties/>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(content)
res = root.iterfind(".//{http://www.epo.org/exchange}exchange-documents[#status='not found']/..")

Just use BeautifulSoup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
print soup.findAll('exchange-document')["status"]
#> not found
If you store every xml output in a single file, would be useful to iterate them:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('xml.txt', 'r'))
for tag in soup.findAll('exchange-document'):
print tag["status"]
#> not found
This will display every [status] tag from [exchange-document] element.
Plus, if you want only useful status you should do:
for tag in soup.findAll('exchange-document'):
if tag["status"] not in "not found":
print tag["status"]

Try this:
from xml.dom.minidom import parse
xmldoc = parse(filename)
elementList = xmldoc.getElementsByTagName(tagName)
elementList will contain all elements with the tag name you specify, then you can iterate over those.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting list or dictionary from xml file - python

Related

use pyquery to filter html

Parsing of xml in Python

How to get value from XML file?

How to parse out xml from noisy file using python

How to find the value in particular tag elemnet in xml using python?

Categories

Resources