Fix namespace with regular expression - python

I have the following name spaces coming from a certain service
<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>
Trying to parse this request I receive the following error
xml.etree.ElementTree.ParseError: not well-formed
I noticed there is no "" on namespace value. How can I add them with regular expression
Proper format
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Note double quotes

Using regex:
import re
namespace = "<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>"
FIND_URL = re.compile(r"((?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+)")
print(FIND_URL.sub(r'"\1"', namespace))
Output:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Note that the regex isn't perfect. It works for this case but if the urls become more "unique" it may fail.
Credit to this answer

This regex seems to do the trick:
import re
nsmap = "<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>"
nsmap = re.sub(r"(https?://.*?)(?=\sxmlns|>)", r'"\1"', nsmap)
print(nsmap)
Output:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Check it out online here.

Related

Extract XML data from SOAP Response in Python

I am using a Python script to receive an XML response from a SOAP web service, and I'd like to extract specific values from the XML response. I'm trying to use the 'untangle' library, but keep getting the following error:
AttributeError: 'None' has no attribute 'Envelope'
Below is a sample of my code. I'm trying to extract the RequestType value from the below
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Header/>
<soap:Body>
<Response>\n
<RequestType>test</RequestType>
</Response>
</soap:Body>
</soap:Envelope>
Sample use of untangle
parsed_xml = untangle.parse(xml)
print(parsed_xml.Envelope.Response.RequestType.cdata)
I've also tried parsed_xml.Envelope.Body.Response.RequestType.cdata
This will solve your problem, assuming you want to extract 'test'. By the way, i think your response should not have 'soap:Header/':
import xmltodict
stack_d = xmltodict.parse(response.content)
stack_d['soap:Envelope']['soap:Body']['Response']['RequestType']
I think you will find the xml.etree library to be more usable in this context.
import requests
from xml.etree import ElementTree
Then we need to define the namespaces for the SOAP Response
namespaces = {
'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
'a': 'http://www.etis.fskab.se/v1.0/ETISws',
}
dom = Element.tree.fromstring(response.context)
Then simply find all the DOMs
names = dom.findall('./soap:Body',namespaces)

Parsing XML for specific item using ElementTree

I am making a request to the Salesforce merge API and getting a response like this:
xml_result = '<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns="urn:partner.soap.sforce.com">
<soapenv:Header>
<LimitInfoHeader>
<limitInfo>
<current>62303</current>
<limit>2680000</limit><type>API REQUESTS</type></limitInfo>
</LimitInfoHeader>
</soapenv:Header>
<soapenv:Body>
<mergeResponse>
<result>
<errors>
<message>invalid record type</message>
<statusCode>INSUFFICIENT_ACCESS_ON_CROSS_REFERENCE_ENTITY</statusCode>
</errors>
<id>003skdjf494244</id>
<success>false</success>
</result>
</mergeResponse>
</soapenv:Body>
</soapenv:Envelope>'
I'd like to be able to parse this response and if success=false, return the errors, statusCode, and the message text.
I've tried the following:
import xml.etree.ElementTree as ET
tree = ET.fromstring(xml_result)
root.find('mergeResponse')
root.find('{urn:partner.soap.sforce.com}mergeResponse')
root.findtext('mergeResponse')
root.findall('{urn:partner.soap.sforce.com}mergeResponse')
...and a bunch of other variations of find, findtext and findall but I can't seem to get these to return any results. Here's where I get stuck. I've tried to follow the ElementTree docs, but I don't understand how to parse the tree for specific elements.
Element.find() finds the first child with a particular tag
https://docs.python.org/2/library/xml.etree.elementtree.html#finding-interesting-elements
Since mergeResponse is a descendant, not a child, you should use XPath-syntax in this case:
root.find('.//{urn:partner.soap.sforce.com}mergeResponse')
will return your node. .// searches all descendants starting with the current node (in this case the root).

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

I'm trying to locate fields in a SOAP xml file using lxml (3.6.0)
...
<soap:Body>
<Request xmlns="http://localhost/">
<Test>
<field1>hello</field1>
<field2>world</field2>
</Test>
</Request>
</soap:Body>
...
In this example I'm trying to find field1 and field2.
I need to add a path to the search term, to find the field:
print (myroot.find(".//{http://localhost/}field1").tag) # prints 'field1'
without it, I don't find anything
print (myroot.find("field1").tag) # finds 'None'
Is there any other way to search for the field tag (here field1) without giving path info?
Full example below:
from lxml import etree
example = """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><Request xmlns="http://localhost/">
<Test><field1>hello</field1><field2>world</field2></Test>
</Request></soap:Body></soap:Envelope>
"""
myroot = etree.fromstring(example)
# this works
print (myroot.find(".//{http://localhost/}field1").text)
print (myroot.find(".//{http://localhost/}field2").text)
# this fails
print (myroot.find(".//field1").text)
print (myroot.find("field1").text)
Comment: The input of the SOAP request is given, I can't change any of it in real live to make things easier.
There is a way to ignore namespace when selecting element using XPath, but that isn't a good practice. Namespace is there for a reason. Anyway, there is a cleaner way to reference element in namespace i.e by using namespace prefix that was mapped to the namespace uri, instead of using the actual namespace uri every time :
.....
>>> ns = {'d': 'http://localhost/'}
>>> print (myroot.find(".//d:field1", ns).text)
hello
>>> print (myroot.find(".//d:field2", ns).text)
world

How to parse tiered XML String

I have an xml string that I need to parse in python that looks like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<PostLoadsResponse xmlns="http://webservices.truckstop.com/v11">
<PostLoadsResult xmlns:a="http://schemas.datacontract.org/2004/07/WebServices.Objects" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Errors xmlns="http://schemas.datacontract.org/2004/07/WebServices">
<Error>
<ErrorMessage>Invalid Location</ErrorMessage>
</Error>
</Errors>
</PostLoadsResult>
</PostLoadsResponse>
</s:Body>
</s:Envelope>'
I'm having trouble using xmltree to get to the error message of this tree without something like:
import xml.etree.ElementTree as ET
ET.fromstring(text).findall('{http://schemas.xmlsoap.org/soap/envelope/}Body')[0].getchildren()[0].getchildren()[0].getchildren()
You need to handle namespaces and you can do it with xml.etree.ElementTree:
tree = ET.fromstring(data)
namespaces = {
's': 'http://schemas.xmlsoap.org/soap/envelope/',
'd': "http://schemas.datacontract.org/2004/07/WebServices"
}
print(tree.find(".//d:ErrorMessage", namespaces=namespaces).text)
Prints Invalid Location.
Using the partial XPath support:
ET.fromstring(text).find('.//{http://schemas.datacontract.org/2004/07/WebServices}ErrorMessage')
That will instruct it to find the first element named ErrorMessage with namespace http://schemas.datacontract.org/2004/07/WebServices at any depth.
However, it may be faster to use something like
ET.fromstring(text).find('{http://schemas.xmlsoap.org/soap/envelope/}Body').find('{http://webservices.truckstop.com/v11}PostLoadsResponse').find('{http://webservices.truckstop.com/v11}PostLoadsResult').find('{http://schemas.datacontract.org/2004/07/WebServices}Errors').find('{http://schemas.datacontract.org/2004/07/WebServices}Error').find('{http://schemas.datacontract.org/2004/07/WebServices}ErrorMessage'
If you know your message will always contain those elements.
You can use the getiterator method on the tree to iterate through the items in it. You can check the tag on each item to see if it's the right one.
>>> err = [node.text for node in tree.getiterator() if node.tag.endswith('ErrorMessage')]
>>> err
['Invalid Location']

xml parsing error special characters

I have following xml that I want to parse with xml.dom.minidom module
<?xml version="1.0" encoding="UTF-8"?>
<RootTag>
<InnerTag>
<MyValue>"< here is special char."</MyValue>
</InnerTag>
</RootTag>
I have following snippet for parsing above xml
import xml.dom.minidom
xml.dom.minidom.parse('input_xml')
But I get following error:
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 26
Above error occurs only when I provide '&' or '<' provided in MyValue tags
So,
How to resolve this issue?
I am not wishing to change my XML by using escape sequence < etc..
and I want to use "" (quotes)
Your example is not well-formed XML. < is not allowed in XML anywhere else other than the tags. Your data needs to be wrapped in CDATA or escaped as <
<![CDATA[< here is special char.]]>

Categories

Resources