How to parse shrink the web xml with ElementTree

How to parse shrink the web xml with ElementTree - python

I am trying to use the shrink the web service for site thumbnails. They have a API that returns XML telling you if the site thumbnail can be created. I am trying to use ElementTree to parse the xml, but not sure how to get to the information I need. Here is a example of the XML response:
<?xml version="1.0" encoding="UTF-8"?>
<stw:ThumbnailResponse xmlns:stw="http://www.shrinktheweb.com/doc/stwresponse.xsd">
<stw:Response>
<stw:ThumbnailResult>
<stw:Thumbnail Exists="false"></stw:Thumbnail>
<stw:Thumbnail Verified="false">fix_and_retry</stw:Thumbnail>
</stw:ThumbnailResult>
<stw:ResponseStatus>
<stw:StatusCode>Blank Detected</stw:StatusCode>
</stw:ResponseStatus>
<stw:ResponseTimestamp>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseTimestamp>
<stw:ResponseCode>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseCode>
<stw:CategoryCode>
<stw:StatusCode>none</stw:StatusCode>
</stw:CategoryCode>
<stw:Quota_Remaining>
<stw:StatusCode>1</stw:StatusCode>
</stw:Quota_Remaining>
</stw:Response>
</stw:ThumbnailResponse>
I need to get the "stw:StatusCode". If I try to do a find on "stw:StatusCode" I get a "expected path separator" syntax error. Is there a way to just get the status code?

Grrr namespaces ....try this:
STW_PREFIX = "{http://www.shrinktheweb.com/doc/stwresponse.xsd}"
(see line 2 of your sample XML)
Then when you want a tag like stw:StatusCode, use STW_PREFIX + "StatusCode"
Update: That XML response isn't the most brilliant design. It's not possible to guess from your single example whether there can be more than 1 2nd-level node. Note that each 3rd-level node has a "StatusCode" child. Here is some rough-and-ready code that shows you (1) why you need that STW_PREFIX caper (2) an extract of the usable info.
import xml.etree.cElementTree as et
def showtag(elem):
return repr(elem.tag.rsplit("}")[1])
def showtext(elem):
return None if elem.text is None else repr(elem.text.strip())
root = et.fromstring(xml_response) # xml_response is your input string
print repr(root.tag) # see exactly what tag is in the element
for child in root[0]:
print showtag(child), showtext(child)
for gc in child:
print "...", showtag(gc), showtext(gc), gc.attrib
Result:
'{http://www.shrinktheweb.com/doc/stwresponse.xsd}ThumbnailResponse'
'ThumbnailResult' ''
... 'Thumbnail' None {'Exists': 'false'}
... 'Thumbnail' 'fix_and_retry' {'Verified': 'false'}
'ResponseStatus' ''
... 'StatusCode' 'Blank Detected' {}
'ResponseTimestamp' ''
... 'StatusCode' None {}
'ResponseCode' ''
... 'StatusCode' None {}
'CategoryCode' ''
... 'StatusCode' 'none' {}
'Quota_Remaining' ''
... 'StatusCode' '1' {}

Related

Parsing zap xml report (or how to get child node with xpath in python)

Have a simple question (not difficult for many developpers)
I would like to parse a zap xml report and get all nodes of an alert :
here an extract of the zap report in xml :
<?xml version="1.0"?><OWASPZAPReport version="D-2020-10-20" generated="Wed, 18 Nov 2020 16:51:34">
<site name="http://webmail.example.com" host="webmail.example.com" port="80" ssl="false"><alerts>
<alertitem>
<pluginid>3</pluginid>
<alert>Session ID in URL Rewrite</alert>
<name>Session ID in URL Rewrite</name>
<instances>
<instance>
<uri>http://webmail.example.com/dyn/login.seam;jsessionid=c2e851a8c7f47dcd4dea016fd1e0?cid=47</uri>
<method>GET</method>
<evidence>jsessionid=c2e851a8c7f47dcd4dea016fd1e0</evidence>
</instance>
<instance>
<uri>http://webmail.example.com/dyn/portal/index.seam;jsessionid=c2e851a8c7f47dcd4dea016fd1e0?aloId=21152&cid=47&page=alo</uri>
<method>GET</method>
<evidence>jsessionid=c2e851a8c7f47dcd4dea016fd1e0</evidence>
</instance>
</instances>
<count>6</count>
<solution><p>For secure content, put session ID in a cookie. To be even more secure consider using a combination of cookie and URL rewrite.</p></solution>
<reference><p>http://seclists.org/lists/webappsec/2002/Oct-Dec/0111.html</p></reference>
<cweid>200</cweid>
<wascid>13</wascid>
<sourceid>3</sourceid>
</alertitem>
</alerts>
Would like to get child nodes content (like uri / method /evidence).
Actually I am using this code (in python3) and be able to get all the alert item :
tree = etree.parse(report_file)
root = tree.getroot()
for site in tree.findall('site'):
sitename = site.attrib['name']
for alert in site.findall('.//alertitem'):
name_alert = alert.find('name').text
...
But i would like to parse the child nodes
and get the content of uri for example
http://webmail.example.com/[...]
Could you help me ?

Try this:
from lxml import etree
alerts = """[your xml above (make sure it's well formed)]"""
doc = etree.XML(alerts)
for instance in doc.xpath('//instance'):
print(instance.xpath('./uri')[0].text)
print(instance.xpath('./method')[0].text)
print(instance.xpath('./evidence')[0].text)
Output:
http://webmail.example.com/dyn/login.seam;jsessionid=c2e851a8c7f47dcd4dea016fd1e0?cid=47
GET
jsessionid=c2e851a8c7f47dcd4dea016fd1e0
http://webmail.example.com/dyn/portal/index.seam;jsessionid=c2e851a8c7f47dcd4dea016fd1e0?aloId=21152&cid=47&page=alo
GET
jsessionid=c2e851a8c7f47dcd4dea016fd1e0

Thank you Jack ! The final code (if somebody need it) :
tree = etree.parse(report_file)
root = tree.getroot()
for site in tree.xpath('site'):
sitename = site.attrib['name']
for alert in site.xpath('//alertitem'):
nom = alert.find('name').text
criticite = alert.find('riskdesc').text
try:otherinfo = alert.find('otherinfo').text
except:otherinfo=""
desc = alert.find('desc').text
try:url_trace = alert.find('uri').text
except:url_trace = ""
try:method = alert.find('method').text
except:method = ""
try:parametre = alert.find('param').text
except:parametre = ""
for instance in alert.xpath('.//instance'):
print(instance.xpath('./uri')[0].text)
#print(instance.xpath('./method')[0].text)
try:
print(instance.xpath('./evidence')[0].text)
except:pass

Getting the xml soap response instead of raw data

I am trying to get the soap response and read few tags and then put the key and values inside a dictionary.
Better would be if I could use the response generated directly and preform regd operations to it.
But since I was not able to do that, I tried storing the response in an xml file and then using that for operations.
My problem is that the response generated is in a raw form. How to resolve this.
Example: <medical:totEeCnt val="2" />
<medical:totMbrCnt val="2" />
<medical:totDepCnt val="0" />
def soapTest():
request = """<soapenv:Envelope.......
auth = HTTPBasicAuth('', '')
headers = {'content-type': 'application/soap+xml', 'SOAPAction': "", 'Host': 'bfx-b2b....com'}
url = "https://bfx-b2b....com/B2BWEB/services/IProductPort"
response = requests.post(url, data=request, headers=headers, auth=auth, verify=True)
# Open local file
fd = os.open('planRates.xml', os.O_RDWR|os.O_CREAT)
# Convert response object into string
response_str = str(response.content)
# Write response to the file
os.write(fd, response_str)
# Close the file
os.close(fd)
tree = ET.parse('planRates.xml')
root = tree.getroot()
dict = {}
print root
for plan in root.findall('.//{http://services.b2b.../types/rates/dental}dentPln'): # type: object
plan_id = plan.get('cd')
print plan
print plan_id
for rtGroup in plan.findall('.//{http://services.b2b....com/types/rates/dental}censRtGrp'):
#print rtGroup
for amt in rtGroup.findall('.//{http://services.b2b....com/types/rates/dental}totAnnPrem'):
# print amt
print amt.get('val')
amount = amt.get('val')
dict[plan_id] = amount
print dict
Update-:
I did few things, what I am not able to understand is that ,
using this, the operations further are working,
tree = ET.parse('data/planRates.xml')
root = tree.getroot()
dict = {}
print tree
print root
for plan in root.findall(..
output -
<xml.etree.ElementTree.ElementTree object at 0x100d7b910>
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x101500450>
But after using this ,it is not working
tree = ET.fromstring(response.text)
print tree
for plan in tree.findall(..
output-:
<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x10d624910>
Basically I am using the same object only .

Supposing you get a response that you want as proper xml object:
rt = resp.text.encode('utf-8')
# printed rt is something like:
'<soap:Envelope xmlns:soap="http://...">
<soap:Envelope
<AirShoppingRS xmlns="http://www.iata.org/IATA/EDIST" Version="16.1">
<Document>...</soap:Envelope</soap:Envelope>'
# striping soapEnv
startTag = '<AirShoppingRS '
endTag = '</AirShoppingRS>'
trimmed = rt[rt.find(startTag): rt.find(endTag) + len(endTag)]
# parsing
from lxml import etree as et
root = et.fromstring(trimmed)
With this root element you can use find method, xpath or whatever you prefer.
Obviously you need to change the start and endtags to extract the correct element from the response but you get the idea, right?

OpenXML Tag Formatting

I am attempting to parse Open XML from a Microsoft Word document. However, whenever i go to look at any tag or attribute i receive the tag i want, preceded by the openxmlformats namespace. Examples below. Does anybody know how i can remove this, and only receive my tag id and value?
Current format:
for content in root.iter():
print(content.tag)
returns:
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'
and
for content in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tag'):
print(content.attrib)
returns
'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'Orange'
Desired Output:
for content in root.iter():
print(content.tag)
returns
tag
and
for content in root.iter('tag'):
print(content.attrib)
returns
val : 'Orange'

You can write your own version of the iterator that does what you want:
from collections import namedtuple
import re
my_content = namedtuple('my_content', ['tag', 'attrib'])
def remove_namespace(name):
return re.sub('^\{[^\}]\}', '', name)
def my_iterator(root, tag=None, namespace='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'):
iterator = root.iter() if tag is None else root.iter(namespace + tag)
for content in iterator:
tag = remove_namespace(content.tag)
attrib = {remove_namespace(key): val for key, val in content.attrib.items()}
yield my_content(tag, attrib)
This will return objects that only have the tag and attrib attributes. You will have to write a more complex proxy object if you want more detailed functionality. You can use the generator as a replacement for the previous:
for content in my_iter(root):
print(content.tag)
and
for content in my_iter(root, 'tag'):
print(content.attrib)

How to extract one value from an xml url using xml.etree

Im trying to print the value of just one field of a XML tree, here is the XML tree (e.g), the one that i get when i request it
<puco>
<resultado>OK</resultado>
<coberturaSocial>O.S.P. TIERRA DEL FUEGO(IPAUSS)</coberturaSocial>
<denominacion>DAMIAN GUTIERREZ DEL RIO</denominacion>
<nrodoc>32443324</nrodoc>
<rnos>924001</rnos>
<tipodoc>DNI</tipodoc>
</puco>
Now, i just want to print "coberturaSocial" value, here the request that i have in my views.py:
def get(request):
r = requests.get('https://sisa.msal.gov.ar/sisa/services/rest/puco/38785898')
dom = r.content
asd = etree.fromstring(dom)
If i print "asd" i get this error: The view didn't return an HttpResponse object. It returned None instead.
and also in the console i get this
I just want to print coberturaSocial, please help, new in xml parsing!

You need to extract the contents of the tag and then return it wrapped in a response, like so:
return HttpResponse(asd.find('coberturaSocial').text)

I'm guessing etree is import xml.etree.ElementTree as etree
You can use:
text = r.content
dom = etree.fromstring(text)
el = dom.find('coberturaSocial')
el.text # this is where the string is

Trying to parse an XML file with Python - what am I doing wrong?

I'm working with XML and Python for the first time. The ultimate goal is to send a request to a REST service, receive a response in XML, and parse the values and send emails depending on what was returned. However, the REST service is not yet in place, so for now I'm experimenting with an XML file saved on my C drive.
I have a simple bit of code, and I'm confused about why it isn't working.
This is my xml file ("XMLTest.xml"):
<Response>
<exitCode>1</exitCode>
<fileName>C:/Something/</fileName>
<errors>
<error>Error generating report</error>
</errors>
</Response>
This is my code so far:
from xml.dom import minidom
something = open("C:/XMLTest.xml")
something = minidom.parse(something)
nodeList = []
for node in something.getElementsByTagName("Response"):
nodeList.extend(t.nodeValue for t in node.childNodes)
print nodeList
But the results that print out are...
[u'\n\t', None, u'\n\t', None, u'\n\t', None, u'\n']
What am I doing wrong?
I'm trying to get the node values. Is there a better way to do this? Is there a built-in method in Python to convert an xml file to an object or dictionary? I'd like to get all the values, preferably with the names attached.

Does this help?
doc = '''<Response>
<exitCode>1</exitCode>
<fileName>C:/Something/</fileName>
<errors>
<error>Error generating report</error>
</errors>
</Response>'''
from xml.dom import minidom
something = minidom.parseString( doc )
nodeList = [ ]
for node in something.getElementsByTagName( "Response" ):
response = { }
response[ "exit code" ] = node.getElementsByTagName( "exitCode" )[ 0 ].childNodes[ 0 ].nodeValue
response[ "file name" ] = node.getElementsByTagName( "fileName" )[ 0 ].childNodes[ 0 ].nodeValue
errors = node.getElementsByTagName( "errors" )[ 0 ].getElementsByTagName( "error" )
response[ "errors" ] = [ error.childNodes[ 0 ].nodeValue for error in errors ]
nodeList.append( response )
import pprint
pprint.pprint( nodeList )
yields
[{'errors': [u'Error generating report'],
'exit code': u'1',
'file name': u'C:/Something/'}]

If you are just starting out with xml and python, and have no compelling reason to use DOM, I strongly suggest you have a look at the ElementTree api (implemented in the standard library in xml.etree.ElementTree)
To give you a taste:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/XMLTest.xml')
response = tree.getroot()
exitcode = response.find('exitCode').text
filename = response.find('fileName').text
errors = [i.text for i in response.find('errors')]
(if you need more power - xpath, validation, xslt etc... - you can even switch to lxml, which implements the same API, but with a lot of extras)

You're not thinking about the XML from the DOM standpoint. Namely, 'C:/Something' isn't the nodevalue of the element whose tagname is 'fileName'; it's the nodevalue of the text node that is the first child of the element whose tagname is 'fileName'.
What I recommend you do is play around with it a little more in python itself: start python.
from xml.dom import minidom
x = minidom.parseString('<Response><filename>C:/</filename>>')
x.getElementsByTagName('Response')
...
x.getElementsByTagName('Response')[0].childNodes[0]
...
and so forth. You'll get a quick appreciation for how the document is being parsed.

I recommend my library xml2obj. It is way cleaner than DOM. The "library" has only 84 lines of code you can embed anywhere.
In [185]: resp = xml2obj(something)
In [186]: resp.exitCode
Out[186]: u'1'
In [187]: resp.fileName
Out[187]: u'C:/Something/'
In [188]: len(resp.errors)
Out[188]: 1
In [189]: for node in resp.errors:
.....: print node.error
.....:
.....:
Error generating report

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse shrink the web xml with ElementTree - python

Related

Parsing zap xml report (or how to get child node with xpath in python)

Getting the xml soap response instead of raw data

OpenXML Tag Formatting

How to extract one value from an xml url using xml.etree

Trying to parse an XML file with Python - what am I doing wrong?

Categories

Resources