Parser XML in python - python

I have some database like the next one in XML and im trying to parser it with Python 2.7:
<team>
<generator>
<team_name>TeamMaster</team_name>
<team_year>2000</team_year>
<team_city>NewYork</team_city>
</generator>
<players>
<definition name="John V." number="4" age="25">
<criteria position="fow" side="right">
<criterion website="www.johnV.com" version="1" result="true"/>
</criteria>
<object debut="2003" version="3" flag="complete">
<history item_ref="team34"/>
<history item_ref="mainteam"/>
</definition>
<definition name="Emma" number="2" age="19">
<criteria position="mid" side="left">
<criterion website="www.emma.net" version="7" result="true"/>
</criteria>
<object debut="2008" version="1" flag="complete">
<history item_ref="newteam"/>
<history item_ref="youngteam"/>
<history item_ref="oldteam"/>
</definition>
</players>
</team>
With this small scrip I can parse easily the first part "generator" from my xml, where I know all elements that contains:
from xml.dom.minidom import parseString
mydb = {
"team_name": ,
"team_year": ,
"team_data":
}
file = open('mydb.xml','r')
data = file.read()
file.close()
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('team_name')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<team_name>','').replace('</team_name>','')
mydb["team_name"] = xmlData # TeamMaster
But my real problem came when I tried to parse the "players" elements, where attributes appears in "definition" and an unknown numbers of elements in "history".
Maybe there is another module that would help me for this better than minidon?

Better use xml.etree.ElementTree, it has a more pythonic syntax. Get the text of team_name by root.findtext('team_name') or iterate over all definitions with root.finditer('definitions').

You can use either Element Tree - XML Parser or use BeautifulSoup XML Parser.
I have created repo for usage of XML parser here XML Parsers Collection
Snippet code below:
#Get the data from XML parser.
users = xml_parser(users_file,'user')
#Iterate through root element.
for user in users:
print(user.find('country').text)
print(user.find('city').text)

Related

store content of a tag in a string with elementtree in python3

I'm using Python 3.7.2 and elementtree to copy the content of a tag in an XML file.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.003.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>nBblsUR-uH..6jmGgZNHLQAAAXgXN1Lu</MsgId>
<CreDtTm>2016-11-10T12:00:00.000+01:00</CreDtTm>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>6</CtrlSum>
<InitgPty>
<Nm>TC 03000 Kunde 55 Protokollr ckf hrung</Nm>
</InitgPty>
</GrpHdr>
</CstmrCdtTrfInitn>
</Document>
I want to copy the content of the 'MsgId' tag and save it as a string.
I've manage to do this with minidom before, but due to new circumstances, I have to settle with elementtree for now.
This is that code with minidom:
dom = xml.dom.minidom.parse('H:\\app_python/in_spsh/{}'.format(filename_string))
message = dom.getElementsByTagName('MsgId')
for MsgId in message:
print(MsgId.firstChild.nodeValue)
Now I want to do the exact same thing with elementtree. How can I achieve this?
To get the text value of a single element, you can use the findtext() method. The namespace needs to be taken into account.
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml") # Your XML document
msgid = tree.findtext('.//{urn:iso:std:iso:20022:tech:xsd:pain.001.003.03}MsgId')
With Python 3.8 and later, it is possible to use a wildcard for the namespace:
msgid = tree.findtext('.//{*}MsgId')

parsing XML file in python2.7

I know this is a very common question, but the kind of XML file and the kind of extraction of data i need is a little unique due to the nature of the xml file. So appreciate any help on the steps to extract the required data, with pyhton2.7
I have the below XML
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>Mango.XYZ_DIG_Team_ABCDEF_Mango_Review</members>
<members>Mango.XYZ_DIG_Team_Reporting_Mango_Review</members>
<members>Opportunity.A_T_Occupier_City_Job_List</members>
<name>ListView</name>
</types>
<types>
<members>Modify_All_Data_Permission</members>
<members>Opportunity_Alerts_Implementation</members>
<members>Process_Builder_Permission</members>
<members>Regional_Business_Support</members>
<members>Reports_Dashboards_Data_Export_for_Super_Users</members>
<name>PermissionSet</name>
</types>
<types>
<members>SolutionManager</members>
<members>Standard</members>
<name>Profile</name>
</types>
<types>
<members>Mango.Set Verified Date and System Id</members>
<members>Mango.Update Mango Site With Billing Street%2C City%2C Country</members>
<members>Mango.Update Family Id on Mango when created</members>
<members>Opportunity.Set Opportunity Name</members>
<name>WorkflowRule</name>
</types>
<version>38.0</version>
</Package>
i am trying to extract only the members from the PermissionSet block. So that eventually i will have a file, that only have the entries like
Modify_All_Data_Permission
Opportunity_Alerts_Implementation
Process_Builder_Permission
Regional_Business_Support
Reports_Dashboards_Data_Export_for_Super_Users
I have been able to extract only the 'name' tag by
from xml.dom import minidom
doc = minidom.parse("path_to_xmlFile")
t = doc.getElementsByTagName("types")
for n in t:
name = n.getElementsByTagName("name")[0]
print name.firstChild.data
How can i extract the members and save that to a file?
Note: the number of 'members' are not fixed they varies.
I can also try with a different library, if it serves the purpose.
Probably easiest to use XPath
import xml.etree.ElementTree as ET
root = ET.parse('file.xml').getroot()
for member in root.findall(".//members/")
print(member.text)
This may help you!
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
for data in root[1]:
print data.text

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

Python ElementTree won't update new file after parsing

Using ElementTree to parse attribute's value in an XML and writing a new XML file. It will console the new updated value and write a new file. But won't update any changes in the new file. Please help me understand what I am doing wrong. Here is XML & Python code:
XML
<?xml version="1.0"?>
<!--
-->
<req action="get" msg="1" rank="1" rnklst="1" runuf="0" status="1" subtype="list" type="60" univ="IL" version="fhf.12.000.00" lang="ENU" chunklimit="1000" Times="1">
<flds>
<f i="bond(long) hff" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fgg" start="2016-02-29"/>
<f i="bond(short) ggg" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fhf" start="2016-02-29"/>
</flds>
<dat>
<r i="hello" CalculationType="3" Calculate="1" />
</dat>
</req>
Python
import xml.etree.ElementTree as ET
with open('test.xml', 'rt') as f:
tree = ET.parse(f)
for node in tree.iter('r'):
port_id = node.attrib.get('i')
new_port_id = port_id.replace(port_id, "new")
print node
tree.write('./new_test.xml')
When you get the attribute i, and assign it to port_id, you just have a regular Python string. Calling replace on it is just the Python string .replace() method.
You want to use the .set() method of the etree node:
for node in tree.iter('r'):
node.set('i', "new")
print node

Matching the XML Req and Res

I need advice on the below
Below are the request and response XML's. Request XML contains the words to be translated in the Foriegn language [String attribute inside Texts node] and the response XML contains the translation of these words in English [inside ].
REQUEST XML
<TranslateArrayRequest>
<AppId />
<From>ru</From>
<Options>
<Category xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Category>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
<ReservedFlags xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" />
<State xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></State>
<Uri xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Uri>
<User xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></User>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">вк азиза и ринат</string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">скачать кайда кайдк кайрат нуртас бесплатно</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
RESPONSE XML
<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>16</a:int>
</OriginalTextSentenceLengths>
<State/>
<TranslatedText>BK Aziza and Rinat</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>18</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>43</a:int> </OriginalTextSentenceLengths>
<State/>
<TranslatedText>Kairat kajdk Qaeda nurtas download free</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>39</a:int></TranslatedTextSentenceLengths>
</TranslateArrayResponse
</ArrayOfTranslateArrayResponse>
So there are two ways to relate the translated text to the original text:
Length of the original text; and
Order in the XML file
Relating by length being the probably unreliable because the probability of translating 2 or more phrases with the same number of characters is relatively significant.
So it comes down to order. I think it is relatively safe to assume that the files were processed and written in the same order. So I'll show you a way to relate the phrases using the order of the XML files.
This is relatively simple. We simply iterate through the trees and grab the words in the list. Also, for the translated XML due to its structure, we need to grab the root's namespace:
import re
import xml.etree.ElementTree as ElementTree
def map_translations(origin_file, translate_file):
origin_tree = ElementTree.parse(origin_file)
origin_root = origin_tree.getroot()
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
translate_tree = ElementTree.parse(translate_file)
translate_root = translate_tree.getroot()
namespace = re.match('{.*}', translate_root.tag).group()
translate_text = [text.text for text in translate_root.findall(
'.//{}TranslatedText'.format(namespace))]
return dict(zip(origin_text, translate_text))
origin_file = 'some_file_path.xml'
translate_file = 'some_other_path.xml'
mapping = map_translations(origin_file, translate_file)
print(mapping)
Update
The above code is applicable for Python 2.7+. In Python 2.6 it changes slightly:
ElementTree objects do not have an iter function. Instead they have a getiterator function.
Change the appropriate line above to this:
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
XPath syntax is (most likely) not supported. In order to get down to the TranslatedText nodes we need to use the same strategy as we do above:
Change the appropriate line above to this:
translate_text = [string.text for text in translate_root.getiterator(
'{0}TranslateArrayResponse'.format(namespace))
for string in text.getiterator(
'{0}TranslatedText'.format(namespace))]

Categories

Resources