How to extract one value from an xml url using xml.etree - python

Im trying to print the value of just one field of a XML tree, here is the XML tree (e.g), the one that i get when i request it
<puco>
<resultado>OK</resultado>
<coberturaSocial>O.S.P. TIERRA DEL FUEGO(IPAUSS)</coberturaSocial>
<denominacion>DAMIAN GUTIERREZ DEL RIO</denominacion>
<nrodoc>32443324</nrodoc>
<rnos>924001</rnos>
<tipodoc>DNI</tipodoc>
</puco>
Now, i just want to print "coberturaSocial" value, here the request that i have in my views.py:
def get(request):
r = requests.get('https://sisa.msal.gov.ar/sisa/services/rest/puco/38785898')
dom = r.content
asd = etree.fromstring(dom)
If i print "asd" i get this error: The view didn't return an HttpResponse object. It returned None instead.
and also in the console i get this
I just want to print coberturaSocial, please help, new in xml parsing!

You need to extract the contents of the tag and then return it wrapped in a response, like so:
return HttpResponse(asd.find('coberturaSocial').text)

I'm guessing etree is import xml.etree.ElementTree as etree
You can use:
text = r.content
dom = etree.fromstring(text)
el = dom.find('coberturaSocial')
el.text # this is where the string is

Related

Get the URL from XML tag

My XML file:
<xml
xmlns="http://www.myweb.org/2003/instance"
xmlns:link="http://www.myweb.org/2003/linkbase"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:iso4217="http://www.myweb.org/2003/iso4217"
xmlns:utr="http://www.myweb.org/2009/utr">
<link:schemaRef xlink:type="simple" xlink:href="http://www.myweb.com/form/2020-01-01/test.xsd"></link:schemaRef>
I want to get the URL: http://www.myweb.com/folder/form/1/2020-01-01/test.xsd from the <link:schemaRef> tag.
My below python code finds the <link:schemaRef> tag. But I am unable to retrieve the URL.
from lxml import etree
with open(filepath,'rb') as f:
file = f.read()
root = etree.XML(file)
print(root.nsmap["link"]) #http://www.myweb.org/2003/linkbase
print(root.find(".//{"+root.nsmap["link"]+"}"+"schemaRef"))
Try it this way and see if it works:
for i in root.xpath('//*/node()'):
if isinstance(i,lxml.etree._Element):
print(i.values()[1])
Output:
http://www.myweb.com/form/2020-01-01/test.xsd
Use:
>>> child = root.getchildren()[0]
>>> child.attrib
{'{http://www.w3.org/1999/xlink}type': 'simple', '{http://www.w3.org/1999/xlink}href': 'http://www.myweb.com/form/2020-01-01/test.xsd'}
>>> url = child.attrib['{http://www.w3.org/1999/xlink}href']
However, I believe the challenge is would you know which is the right key (i.e. {http://www.w3.org/1999/xlink}href) to be used. If this is the issue, then we just need:
>>> print(root.nsmap['xlink']) # Notice that the requested url is a href to the xlink
'http://www.w3.org/1999/xlink'
>>> key_url = "{"+key_prefix+"}href"
>>> print(child.attrib[key_url])
'http://www.myweb.com/form/2020-01-01/test.xsd'

Getting the xml soap response instead of raw data

I am trying to get the soap response and read few tags and then put the key and values inside a dictionary.
Better would be if I could use the response generated directly and preform regd operations to it.
But since I was not able to do that, I tried storing the response in an xml file and then using that for operations.
My problem is that the response generated is in a raw form. How to resolve this.
Example: <medical:totEeCnt val="2" />
<medical:totMbrCnt val="2" />
<medical:totDepCnt val="0" />
def soapTest():
request = """<soapenv:Envelope.......
auth = HTTPBasicAuth('', '')
headers = {'content-type': 'application/soap+xml', 'SOAPAction': "", 'Host': 'bfx-b2b....com'}
url = "https://bfx-b2b....com/B2BWEB/services/IProductPort"
response = requests.post(url, data=request, headers=headers, auth=auth, verify=True)
# Open local file
fd = os.open('planRates.xml', os.O_RDWR|os.O_CREAT)
# Convert response object into string
response_str = str(response.content)
# Write response to the file
os.write(fd, response_str)
# Close the file
os.close(fd)
tree = ET.parse('planRates.xml')
root = tree.getroot()
dict = {}
print root
for plan in root.findall('.//{http://services.b2b.../types/rates/dental}dentPln'): # type: object
plan_id = plan.get('cd')
print plan
print plan_id
for rtGroup in plan.findall('.//{http://services.b2b....com/types/rates/dental}censRtGrp'):
#print rtGroup
for amt in rtGroup.findall('.//{http://services.b2b....com/types/rates/dental}totAnnPrem'):
# print amt
print amt.get('val')
amount = amt.get('val')
dict[plan_id] = amount
print dict
Update-:
I did few things, what I am not able to understand is that ,
using this, the operations further are working,
tree = ET.parse('data/planRates.xml')
root = tree.getroot()
dict = {}
print tree
print root
for plan in root.findall(..
output -
<xml.etree.ElementTree.ElementTree object at 0x100d7b910>
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x101500450>
But after using this ,it is not working
tree = ET.fromstring(response.text)
print tree
for plan in tree.findall(..
output-:
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x10d624910>
Basically I am using the same object only .
Supposing you get a response that you want as proper xml object:
rt = resp.text.encode('utf-8')
# printed rt is something like:
'<soap:Envelope xmlns:soap="http://...">
<soap:Envelope
<AirShoppingRS xmlns="http://www.iata.org/IATA/EDIST" Version="16.1">
<Document>...</soap:Envelope</soap:Envelope>'
# striping soapEnv
startTag = '<AirShoppingRS '
endTag = '</AirShoppingRS>'
trimmed = rt[rt.find(startTag): rt.find(endTag) + len(endTag)]
# parsing
from lxml import etree as et
root = et.fromstring(trimmed)
With this root element you can use find method, xpath or whatever you prefer.
Obviously you need to change the start and endtags to extract the correct element from the response but you get the idea, right?

OpenXML Tag Formatting

I am attempting to parse Open XML from a Microsoft Word document. However, whenever i go to look at any tag or attribute i receive the tag i want, preceded by the openxmlformats namespace. Examples below. Does anybody know how i can remove this, and only receive my tag id and value?
Current format:
for content in root.iter():
print(content.tag)
returns:
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'
and
for content in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'):
print(content.attrib)
returns
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'Orange'
Desired Output:
for content in root.iter():
print(content.tag)
returns
tag
and
for content in root.iter('tag'):
print(content.attrib)
returns
val : 'Orange'
You can write your own version of the iterator that does what you want:
from collections import namedtuple
import re
my_content = namedtuple('my_content', ['tag', 'attrib'])
def remove_namespace(name):
return re.sub('^\{[^\}]\}', '', name)
def my_iterator(root, tag=None, namespace='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'):
iterator = root.iter() if tag is None else root.iter(namespace + tag)
for content in iterator:
tag = remove_namespace(content.tag)
attrib = {remove_namespace(key): val for key, val in content.attrib.items()}
yield my_content(tag, attrib)
This will return objects that only have the tag and attrib attributes. You will have to write a more complex proxy object if you want more detailed functionality. You can use the generator as a replacement for the previous:
for content in my_iter(root):
print(content.tag)
and
for content in my_iter(root, 'tag'):
print(content.attrib)

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

How to parse shrink the web xml with ElementTree

I am trying to use the shrink the web service for site thumbnails. They have a API that returns XML telling you if the site thumbnail can be created. I am trying to use ElementTree to parse the xml, but not sure how to get to the information I need. Here is a example of the XML response:
<?xml version="1.0" encoding="UTF-8"?>
<stw:ThumbnailResponse xmlns:stw="http://www.shrinktheweb.com/doc/stwresponse.xsd">
<stw:Response>
<stw:ThumbnailResult>
<stw:Thumbnail Exists="false"></stw:Thumbnail>
<stw:Thumbnail Verified="false">fix_and_retry</stw:Thumbnail>
</stw:ThumbnailResult>
<stw:ResponseStatus>
<stw:StatusCode>Blank Detected</stw:StatusCode>
</stw:ResponseStatus>
<stw:ResponseTimestamp>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseTimestamp>
<stw:ResponseCode>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseCode>
<stw:CategoryCode>
<stw:StatusCode>none</stw:StatusCode>
</stw:CategoryCode>
<stw:Quota_Remaining>
<stw:StatusCode>1</stw:StatusCode>
</stw:Quota_Remaining>
</stw:Response>
</stw:ThumbnailResponse>
I need to get the "stw:StatusCode". If I try to do a find on "stw:StatusCode" I get a "expected path separator" syntax error. Is there a way to just get the status code?
Grrr namespaces ....try this:
STW_PREFIX = "{http://www.shrinktheweb.com/doc/stwresponse.xsd}"
(see line 2 of your sample XML)
Then when you want a tag like stw:StatusCode, use STW_PREFIX + "StatusCode"
Update: That XML response isn't the most brilliant design. It's not possible to guess from your single example whether there can be more than 1 2nd-level node. Note that each 3rd-level node has a "StatusCode" child. Here is some rough-and-ready code that shows you (1) why you need that STW_PREFIX caper (2) an extract of the usable info.
import xml.etree.cElementTree as et
def showtag(elem):
return repr(elem.tag.rsplit("}")[1])
def showtext(elem):
return None if elem.text is None else repr(elem.text.strip())
root = et.fromstring(xml_response) # xml_response is your input string
print repr(root.tag) # see exactly what tag is in the element
for child in root[0]:
print showtag(child), showtext(child)
for gc in child:
print "...", showtag(gc), showtext(gc), gc.attrib
Result:
'{http://www.shrinktheweb.com/doc/stwresponse.xsd}ThumbnailResponse'
'ThumbnailResult' ''
... 'Thumbnail' None {'Exists': 'false'}
... 'Thumbnail' 'fix_and_retry' {'Verified': 'false'}
'ResponseStatus' ''
... 'StatusCode' 'Blank Detected' {}
'ResponseTimestamp' ''
... 'StatusCode' None {}
'ResponseCode' ''
... 'StatusCode' None {}
'CategoryCode' ''
... 'StatusCode' 'none' {}
'Quota_Remaining' ''
... 'StatusCode' '1' {}

Categories

Resources