Matching the XML Req and Res - python

I need advice on the below
Below are the request and response XML's. Request XML contains the words to be translated in the Foriegn language [String attribute inside Texts node] and the response XML contains the translation of these words in English [inside ].
REQUEST XML
<TranslateArrayRequest>
<AppId />
<From>ru</From>
<Options>
<Category xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Category>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
<ReservedFlags xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" />
<State xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></State>
<Uri xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Uri>
<User xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></User>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">вк азиза и ринат</string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">скачать кайда кайдк кайрат нуртас бесплатно</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
RESPONSE XML
<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>16</a:int>
</OriginalTextSentenceLengths>
<State/>
<TranslatedText>BK Aziza and Rinat</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>18</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>43</a:int> </OriginalTextSentenceLengths>
<State/>
<TranslatedText>Kairat kajdk Qaeda nurtas download free</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>39</a:int></TranslatedTextSentenceLengths>
</TranslateArrayResponse
</ArrayOfTranslateArrayResponse>

So there are two ways to relate the translated text to the original text:
Length of the original text; and
Order in the XML file
Relating by length being the probably unreliable because the probability of translating 2 or more phrases with the same number of characters is relatively significant.
So it comes down to order. I think it is relatively safe to assume that the files were processed and written in the same order. So I'll show you a way to relate the phrases using the order of the XML files.
This is relatively simple. We simply iterate through the trees and grab the words in the list. Also, for the translated XML due to its structure, we need to grab the root's namespace:
import re
import xml.etree.ElementTree as ElementTree
def map_translations(origin_file, translate_file):
origin_tree = ElementTree.parse(origin_file)
origin_root = origin_tree.getroot()
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
translate_tree = ElementTree.parse(translate_file)
translate_root = translate_tree.getroot()
namespace = re.match('{.*}', translate_root.tag).group()
translate_text = [text.text for text in translate_root.findall(
'.//{}TranslatedText'.format(namespace))]
return dict(zip(origin_text, translate_text))
origin_file = 'some_file_path.xml'
translate_file = 'some_other_path.xml'
mapping = map_translations(origin_file, translate_file)
print(mapping)
Update
The above code is applicable for Python 2.7+. In Python 2.6 it changes slightly:
ElementTree objects do not have an iter function. Instead they have a getiterator function.
Change the appropriate line above to this:
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
XPath syntax is (most likely) not supported. In order to get down to the TranslatedText nodes we need to use the same strategy as we do above:
Change the appropriate line above to this:
translate_text = [string.text for text in translate_root.getiterator(
'{0}TranslateArrayResponse'.format(namespace))
for string in text.getiterator(
'{0}TranslatedText'.format(namespace))]

Related

Modify XML Custom Part Word Document Server Properties using XML Element Tree and or XML Minidom

I am reformatting and restructuring our document management to use Sharepoint. Our SOPs, Forms and Records were previously contained in SharePoint migrated to a major Document Management System and now need to be migrated back into Sharepoint. The other DMS utilized Document Variables to store key document information and previously this information was stored in custom XML Part "documentManagement" properties. I have already developed python scripts to modify the core_properties, extended_properties and custom_properties that exist. However, my attempt to use docx, aspose and xml.dom.minidom libraries has yet to provide a script to read or edit the XML Part "documentManagment" properties.
I have unzipped the word document and located the XML Part "documentManagment" properties in the \customXML\item1.xml, \customXML\item1.xml, \customXML\item3.xml and sometimes \customXML\item4.xml files. These files contain the schema, elements, and restrictions for these properties usually in the \customXML\item1.xml file and the property values usually stored in the \customXML\item2.xml. I have included here the item2.xml file for reference.
Item2.xml
<p:properties xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
<documentManagement>
<qetp xmlns="71220325-c405-4751-a4f1-a91992783649">
<UserInfo>
<DisplayName/>
<AccountId xsi:nil="true" />
<AccountType/>
</UserInfo>
</qetp>
<IconOverlay xmlns="http://schemas.microsoft.com/sharepoint/v4" xsi:nil="true" />
<Revision xmlns="8db16272-67aa-4515-a58c-977707b42560">1</Revision>
<Review_x0020_Date xmlns="8db16272-67aa-4515-a58c-977707b42560">2019-04-10T07:00:00+00:00</Review_x0020_Date>
<Site xmlns="8db16272-67aa-4515-a58c-977707b42560">
<Value>Franklin</Value>
</Site>
<Status xmlns="8db16272-67aa-4515-a58c-977707b42560">Approved</Status>
<Effective_x0020_Date xmlns="8db16272-67aa-4515-a58c-977707b42560">2017-04-10T07:00:00+00:00</Effective_x0020_Date>
<Document_x0020_Number xmlns="8db16272-67aa-4515-a58c-977707b42560">EQIP-0033-00</Document_x0020_Number>
<Module xmlns="8db16272-67aa-4515-a58c-977707b42560">4</Module>
</documentManagement>
</p:properties>
Libraries such as docx and aspose.word have not been able to access these custom XML Part properties, even though they were used to access/edit the core, extended and custom properties. I am new to the xml.etree.ElementTree library and running into many failures. I hope someone might give me a starting point and direction.
I would like to change the value of Revision from 1 to a value of 5
I would like to change the value of Site from Franklin to Liverpool
I would like to change the value of Document_x0020_Number from EQIP-0033-00 to GOV-0112
I would like to change the Value of Status from Approved to Effective
I mad progress using etree.ElementTree, but it has caused an problem I now need help with.
I used the following code to parse and edit the element text values in the tree. However, since the XML file was using namespaces, the parse resulted in the "tag" being {url}name instead of the tag being name xmlns={url}.
Code:
import xml.etree.ElementTree as ET
tree = ET.parse('D:\DTONAS01_DATAPART2_DriveE_Shares\Data\Quality\Veeva Export\XLSX_docProps\extracted\customXml\item2.xml')
root = tree.getroot()
print(root.tag, root.attrib, root.text)
for child in root:
print(child.tag, child.attrib, child.text)
for grandchild in child:
print(grandchild.tag, grandchild.attrib, grandchild.text)
if grandchild.tag == '{8db16272-67aa-4515-a58c-977707b42560}Revision':
grandchild.text = str(5)
print(grandchild.tag, grandchild.attrib, grandchild.text)
if grandchild.tag == '{8db16272-67aa-4515-a58c-977707b42560}Document_x0020_Number':
grandchild.text = "GOV-0112"
print(grandchild.tag, grandchild.attrib, grandchild.text)
if grandchild.tag == '{8db16272-67aa-4515-a58c-977707b42560}Status':
grandchild.text = "Effective"
print(grandchild.tag, grandchild.attrib, grandchild.text)
if grandchild.tag == '{8db16272-67aa-4515-a58c-977707b42560}Site':
for subelement in grandchild:
print(subelement.tag, subelement.attrib, subelement.text)
subelement.text = "Liverpool, England"
print(grandchild.tag, grandchild.attrib, grandchild.text, subelement.text)
tree.write('D:\DTONAS01_DATAPART2_DriveE_Shares\Data\Quality\Veeva Export\XLSX_docProps\extracted\customXml\item2.xml', encoding="utf-8")
<ns0:properties xmlns:ns0="http://schemas.microsoft.com/office/2006/metadata/properties" xmlns:ns1="71220325-c405-4751-a4f1-a91992783649" xmlns:ns3="http://schemas.microsoft.com/sharepoint/v4" xmlns:ns4="8db16272-67aa-4515-a58c-977707b42560" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<documentManagement>
<ns1:qetp>
<ns1:UserInfo>
<ns1:DisplayName/>
<ns1:AccountId xsi:nil="true" />
<ns1:AccountType/>
</ns1:UserInfo>
</ns1:qetp>
<ns3:IconOverlay xsi:nil="true" />
<ns4:Revision>5</ns4:Revision>
<ns4:Review_x0020_Date>2019-04-10T07:00:00+00:00</ns4:Review_x0020_Date>
<ns4:Site>
<ns4:Value>Liverpool, England</ns4:Value>
</ns4:Site>
<ns4:Status>Effective</ns4:Status>
<ns4:Effective_x0020_Date>2017-04-10T07:00:00+00:00</ns4:Effective_x0020_Date>
<ns4:Document_x0020_Number>GOV-0112</ns4:Document_x0020_Number>
<ns4:Module>4</ns4:Module>
</documentManagement>
</ns0:properties>
As a result the XML file I write back out looks very different than the original XML file. Unfortunately, Word does not like the new XML when zipped back together. Word is able to open the document, but Word no longer displays the properties in File\Info\Properties and shows an error in File\Info\Properties.
What do I need to do so that the output XML file looks the same as the input XML file regarding namespace notation?

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

Parser XML in python

I have some database like the next one in XML and im trying to parser it with Python 2.7:
<team>
<generator>
<team_name>TeamMaster</team_name>
<team_year>2000</team_year>
<team_city>NewYork</team_city>
</generator>
<players>
<definition name="John V." number="4" age="25">
<criteria position="fow" side="right">
<criterion website="www.johnV.com" version="1" result="true"/>
</criteria>
<object debut="2003" version="3" flag="complete">
<history item_ref="team34"/>
<history item_ref="mainteam"/>
</definition>
<definition name="Emma" number="2" age="19">
<criteria position="mid" side="left">
<criterion website="www.emma.net" version="7" result="true"/>
</criteria>
<object debut="2008" version="1" flag="complete">
<history item_ref="newteam"/>
<history item_ref="youngteam"/>
<history item_ref="oldteam"/>
</definition>
</players>
</team>
With this small scrip I can parse easily the first part "generator" from my xml, where I know all elements that contains:
from xml.dom.minidom import parseString
mydb = {
"team_name": ,
"team_year": ,
"team_data":
}
file = open('mydb.xml','r')
data = file.read()
file.close()
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('team_name')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<team_name>','').replace('</team_name>','')
mydb["team_name"] = xmlData # TeamMaster
But my real problem came when I tried to parse the "players" elements, where attributes appears in "definition" and an unknown numbers of elements in "history".
Maybe there is another module that would help me for this better than minidon?
Better use xml.etree.ElementTree, it has a more pythonic syntax. Get the text of team_name by root.findtext('team_name') or iterate over all definitions with root.finditer('definitions').
You can use either Element Tree - XML Parser or use BeautifulSoup XML Parser.
I have created repo for usage of XML parser here XML Parsers Collection
Snippet code below:
#Get the data from XML parser.
users = xml_parser(users_file,'user')
#Iterate through root element.
for user in users:
print(user.find('country').text)
print(user.find('city').text)

Drop all namespaces in lxml?

I'm working with some of google's data APIs, using the lxml library in python. Namespaces are a huge hassle here. For a lot of the work I'm doing (xpath stuff, mainly), it would be nice to just plain ignore them.
Is there a simple way to ignore xml namespaces in python/lxml?
thanks!
If you'd like to remove all namespaces from elements and attributes, I suggest the code shown below.
Context: In my application I'm obtaining XML representations of SOAP response streams, but I'm not interested on building objects on the client side; I'm only interested on XML representations themselves. Moreover, I'm not interested on any namespace thing, which only makes things more complicated than they need to be, for my purposes. So, I simply remove namespaces from elements and I drop all attributes which contain namespaces.
def dropns(root):
for elem in root.iter():
parts = elem.tag.split(':')
if len(parts) > 1:
elem.tag = parts[-1]
entries = []
for attrib in elem.attrib:
if attrib.find(':') > -1:
entries.append(attrib)
for entry in entries:
del elem.attrib[entry]
# Test case
name = '~/tmp/mantisbt/test.xml'
f = open(name, 'rb')
import lxml.etree as etree
parser = etree.XMLParser(ns_clean=True, recover=True)
root = etree.parse(f, parser=parser)
print('=====================================================================')
print etree.tostring(root, pretty_print = True)
print('=====================================================================')
dropns(root)
print etree.tostring(root, pretty_print = True)
print('=====================================================================')
which prints:
=====================================================================
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<ns1:mc_issue_getResponse>
<return xsi:type="tns:IssueData">
<id xsi:type="xsd:integer">356</id>
<view_state xsi:type="tns:ObjectRef">
<id xsi:type="xsd:integer">10</id>
<name xsi:type="xsd:string">public</name>
</view_state>
</return>
</ns1:mc_issue_getResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
=====================================================================
<Envelope>
<Body>
<mc_issue_getResponse>
<return>
<id>356</id>
<view_state>
<id>10</id>
<name>public</name>
</view_state>
</return>
</mc_issue_getResponse>
</Body>
</Envelope>
=====================================================================
In lxml some_element.tag is a string like {namespace-uri}local-name if there is a namespace, just local-name otherwise. Beware that it is a non string value on non-element nodes (such as comments).
Try this:
for node in some_tree.iter():
startswith = getattr(node 'startswith', None)
if startswith and startswith('{'):
node.tag = node.tag.rsplit('}', 1)[-1]
On Python 2.x the tag can be either an ASCII byte-string or an Unicode string. The existence of a startswith method tests for either.

How to validate XML with multiple namespaces in Python?

I'm trying to write some unit tests in Python 2.7 to validate against some extensions I've made to the OAI-PMH schema: http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd
The problem that I'm running into is business with multiple nested namespaces is caused by this specification in the above mentioned XSD:
<complexType name="metadataType">
<annotation>
<documentation>Metadata must be expressed in XML that complies
with another XML Schema (namespace=#other). Metadata must be
explicitly qualified in the response.</documentation>
</annotation>
<sequence>
<any namespace="##other" processContents="strict"/>
</sequence>
</complexType>
Here's a snippet of the code I'm using:
import lxml.etree, urllib2
query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm"
schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r")
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
response_doc = etree.fromstring(body)
try:
oaischema.assertValid(response_doc)
except etree.DocumentInvalid as e:
line = 1;
for i in body.split("\n"):
print "{0}\t{1}".format(line, i)
line += 1
print(e.message)
I end up with the following error:
AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm
Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22
I understand the error, in that the schema is requiring that the child element of the metadata element be strictly validated, which the sample xml does.
Now I've written a validator in Java that works - however it would be helpful for this to be in Python, since the rest of the solution I'm building is Python based. To make my Java variant work, I had to make my DocumentFactory namespace aware, otherwise I got the same error. I've not found any working example in python that performs this validation correctly.
Does anyone have an idea how I can get an XML document with multiple nested namespaces as my sample doc validate with Python?
Here is the sample XML document that i'm trying to validate:
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-02-08T08:55:46Z</responseDate>
<request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017"
metadataPrefix="oai_dc">http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv.org:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Using Structural Metadata to Localize Experience of
Digital Content</dc:title>
<dc:creator>Dushay, Naomi</dc:creator>
<dc:subject>Digital Libraries</dc:subject>
<dc:description>With the increasing technical sophistication of
both information consumers and providers, there is
increasing demand for more meaningful experiences of digital
information. We present a framework that separates digital
object experience, or rendering, from digital object storage
and manipulation, so the rendering can be tailored to
particular communities of users.
</dc:description>
<dc:description>Comment: 23 pages including 2 appendices,
8 figures</dc:description>
<dc:date>2001-12-14</dc:date>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>
Found this in lxml's doc on validation:
>>> schema_root = etree.XML('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="xsd:integer"/>
... </xsd:schema>
... ''')
>>> schema = etree.XMLSchema(schema_root)
>>> parser = etree.XMLParser(schema = schema)
>>> root = etree.fromstring("<a>5</a>", parser)
So, perhaps, what you need is this? (See last two lines.):
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)
request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
parser = etree.XMLParser(schema = oaischema)
response_doc = etree.fromstring(body, parser)

Categories

Resources