Python parsing xml/opf file - python

i want to read the entry between
<dc:title> </dc:title>
This is xml:
<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="calibre-uuid">
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/">
<meta name="calibre:series_index" content="1"/>
<dc:language>UND</dc:language>
<dc:creator opf:file-as="Unbekannt" opf:role="aut">Johann Wolfgang von Goethe</dc:creator>
<meta name="calibre:timestamp" content="2009-10-08T07:26:21"/>
<dc:title>Faust_I_</dc:title>
<meta name="cover" content="cover"/>
<dc:date>2009-10-08T07:26:21</dc:date>
<dc:contributor opf:role="bkp">calibre (0.6.13) [http://calibre-ebook.com]</dc:contributor>
<dc:identifier id="calibre-uuid">urn:uuid:3cd4b26f-39a3-4783-9730-a86c26b30818</dc:identifier>
And that's my code:
from xml.etree import ElementTree as ET
tree = ET.parse('content.opf')
root = tree.getroot()
dc_namespace = "http://purl.org/dc/elements/1.1/"
print (root.attrib[ET.QName(dc_namespace, 'title')])
Output Error:
Traceback (most recent call last):
File "C:\Users\User\Documents\Visual Studio 2017\Projects\PythonApplication1\Modul1.py", line 8, in <module>
print (root.attrib[ET.QName(dc_namespace, 'title')])
KeyError: <QName '{xmlns:dc}title'>
What's wrong?

What you are looking for (<dc:title>) is an element, not an attribute. Here is how you can get its value:
from xml.etree import ElementTree as ET
tree = ET.parse('content.opf')
title = tree.find(".//{http://purl.org/dc/elements/1.1/}title")
print(title.text)
Output:
Faust_I_
Relevant references:
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax

you can use:
root[number][number]
to access the elements.
for example in
<base>
<element1>
<element2>asdada</element2>
</element>
</base>
root[0][0] will give u element 2

Related

Get text inside xml tags by their name

I had a xml code and i want to get text in exact elements(xml tags) using python language .
I have tried couple of solutions and didnt work.
import xml.etree.ElementTree as ET
tree = ET.fromstring(xml)
for node in tree.iter('Model'):
print node
How can i do that ?
Xml Code :
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetVehicleLimitedInfoResponse
xmlns="http://schemas.conversesolutions.com/xsd/dmticta/v1">
<return>
<ResponseMessage xsi:nil="true" />
<ErrorCode xsi:nil="true" />
<RequestId> 2012290007705 </RequestId>
<TransactionCharge>150</TransactionCharge>
<VehicleNumber>GF-0176</VehicleNumber>
<AbsoluteOwner>SIYAPATHA FINANCE PLC</AbsoluteOwner>
<EngineNo>GA15-483936F</EngineNo>
<ClassOfVehicle>MOTOR CAR</ClassOfVehicle>
<Make>NISSAN</Make>
<Model>PULSAR</Model>
<YearOfManufacture>1998</YearOfManufacture>
<NoOfSpecialConditions>0</NoOfSpecialConditions>
<SpecialConditions xsi:nil="true" />
</return>
</GetVehicleLimitedInfoResponse>
</soap:Body>
</soap:Envelope>
Edited and improved answer:
import xml.etree.ElementTree as ET
import re
ns = {"veh": "http://schemas.conversesolutions.com/xsd/dmticta/v1"}
tree = ET.parse('test.xml') # save your xml as test.xml
root = tree.getroot()
def get_tag_name(tag):
return re.sub(r'\{.*\}', '',tag)
for node in root.find(".//veh:return", ns):
print(get_tag_name(node.tag)+': ', node.text)
It should produce something like this:
ResponseMessage: None
ErrorCode: None
RequestId: 2012290007705
TransactionCharge: 150
VehicleNumber: GF-0176
AbsoluteOwner: SIYAPATHA FINANCE PLC
EngineNo: GA15-483936F
ClassOfVehicle: MOTOR CAR
Make: NISSAN
Model: PULSAR
YearOfManufacture: 1998
NoOfSpecialConditions: 0
SpecialConditions: None

Why do I get an error while iterating through a dictionary

I'm a beginner in Python and struggling to understand why while iterating through a dictionary obtained from an XML file I get an error when I try to search for required keys.I should also mention that, I still get the result I want, but somehow I keep receiving an error.
import os
import shelve
import xml.etree.ElementTree as et
shelfFile = shelve.open('xml_data')
base_path = os.path.dirname(os.path.realpath(__file__))
xml_file = os.path.join(base_path, "data\\nrm_icg_catalog.xml")
tree = et.parse(xml_file)
root = tree.getroot()
elements = ['Name','WBS']
for child in root:
for itemGroup1 in child:
for item in elements:
print(itemGroup1.attrib[item])
Result with Error Message:
Facilitating Works
0
Substructure
1
Superstructure
2
Internal Finish
3
Fittings
4
Services
5
Prefabs
6
Works to Existing Building
7
External Works
8
MC Prelims
9
MC OH and P
10
Traceback (most recent call last):
File "c:/Users/Dodzi Agbenorku/OneDrive/Training Files/Programming Lessons/Python/xmlExcel/app.py", line 22, in <module>
print(itemGroup1.attrib[item])
KeyError: 'Name'
Here's a small section of the xml file I am using:
<?xml version="1.0" encoding="utf-8"?>
<Takeoff xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://download.autodesk.com/us/navisworks/schemas/nw-TakeoffCatalog-10.0.xsd">
<Catalog>
<ItemGroup Name="Facilitating Works" WBS="0" CatalogId="32b4ab2d-6fe8-4c45-9872-c8ea68c0c4de">
<ItemGroup Name="Hazardous Materials" WBS="1" CatalogId="2fdb6bd1-b2d1-4167-a74d-da818183a156">
<ItemGroup Name="Material Removal" WBS="1" CatalogId="ccc8a515-4152-400c-a72e-6fd78561325e">
<Item Name="Material Details" WBS="1" Transparency="0.3" Color="-15161029" LineThickness="0.1" CatalogId="c0a7de26-6bc3-491e-b3c7-3ff5b560eeaf">
<VariableCollection>
<Variable Name="Length" Formula="=ModelLength" Units="Meter" />
<Variable Name="Width" Formula="=ModelWidth" Units="Meter" />
<Variable Name="Thickness" Formula="=ModelThickness" Units="Meter" />
<Variable Name="Height" Formula="=ModelHeight" Units="Meter" />
<Variable Name="Perimeter" Formula="=ModelPerimeter" Units="Meter" />
<Variable Name="Area" Formula="=ModelArea" Units="SquareMeter" />
Any help will be extremely appreciated.
The error occurs because for some nodes, Name is not in attrib which is a dictionary.
Instead of itemGroup1.attrib[item], use itemGroup1.attrib.get(item). It will return the value None if the key does not exist, and it will not throw an error.

how to parse unstructured xml file using python?

How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.
Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3
Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))

Extract elements from an XML file and write to another using cElementTree module

I have a large XML file and I want to extract some tags and write them in another xml file. I wrote this code:
import xml.etree.cElementTree as CE
tree = CE.ElementTree()
root = CE.Element("root")
i = 0
for event, elem in CE.iterparse('data.xml'):
if elem.tag == "ActivityRef":
print(elem.tag)
a = CE.Element(elem.tag)
root.append(elem)
elem.clear()
i += 1
if i == 200:
break
But I don't get the desired result, I got this:
<root>
<ActivityRef />
<ActivityRef />
<ActivityRef />
<ActivityRef />
...
</root>
instead of this:
<root>
<ActivityRef>
<Id>2008-12-11T20:43:07Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2008-10-11T20:43:07Z</Id>
</ActivityRef>
...
</root>
Edit
Input file:
<?xml version="1.0" encoding="UTF-8"?>
<Folders>
<History>
<Running>
<ActivityRef>
<Id>2009-03-14T17:05:55Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-13T06:12:42Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-08T09:00:29Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-04T19:39:39Z</Id>
</ActivityRef>
...
</Running>
</History>
</Folders>
And also I need to remove the element from the source file.
Use XPATH
import xml.etree.ElementTree as ET
data = '''<?xml version="1.0" encoding="UTF-8"?>
<Folders>
<History>
<Running>
<ActivityRef>
<Id>2009-03-14T17:05:55Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-13T06:12:42Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-08T09:00:29Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-04T19:39:39Z</Id>
</ActivityRef>
</Running>
</History>
</Folders>'''
root = ET.fromstring(data)
# 'activities' contains the elements you are looking for
activities = root.findall('.//ActivityRef')

XML Parse attributes using namespace

I'm using the following XML:
<feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>
https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml
</id>
<title>iTunes Store: Top Free Apps</title>
<updated>2016-12-05T12:37:06-07:00</updated>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewTop?cc=in&id=134581&popId=27"/>
<link rel="self" href="https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml"/>
<icon>http://itunes.apple.com/favicon.ico</icon>
<author>
<name>iTunes Store</name>
<uri>http://www.apple.com/uk/itunes/</uri>
</author>
<rights>Copyright 2008 Apple Inc.</rights>
<entry>
<updated>2016-12-05T12:37:06-07:00</updated>
<id im:id="473941634" im:bundleId="com.one97.paytm">https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2</id>
<title>Recharge, Bill Payment & Wallet - Paytm Mobile Solutions</title>
<summary></summary>
<im:name>Recharge, Bill Payment & Wallet</im:name>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2"/>
<im:contentType term="Application" label="Application"/>
<category im:id="6024" term="Shopping" scheme="https://itunes.apple.com/in/genre/ios-shopping/id6024?mt=8&uo=2" label="Shopping"/>
<im:artist href="https://itunes.apple.com/in/developer/paytm-mobile-solutions/id473941637?mt=8&uo=2">Paytm Mobile Solutions</im:artist>
<im:price amount="0.00000" currency="INR">Get</im:price>
<im:image height="53">http://is1.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/53x53bb-85.png</im:image>
<im:image height="75">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/75x75bb-85.png</im:image>
<im:image height="100">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/100x100bb-85.png</im:image>
<rights>© One97 Communications Ltd</rights>
<im:releaseDate label="24 October 2011">2011-10-24T16:18:48-07:00</im:releaseDate>
<content type="html"></content>
</entry>
</feed>
I would like to extract the id information for each entry value:
the attribute is as follows: "im:id"
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('link')
print(len(itemlist))
print(itemlist[0].attributes.keys())
I get information:
1
[u'href', u'type', u'rel']
But when I do the same of id, nothing returns.
Here is a version using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('topIN.xml')
root = tree.getroot()
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
for id_ in root.findall('atom:entry/atom:id', ns):
print (id_.attrib['{' + ns['im'] + '}id'])
Here is a version using lxml:
from lxml import etree
root=etree.parse('topIN.xml')
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
print('\n'.join(root.xpath('atom:entry/atom:id/#im:id', namespaces=ns)))
This worked:
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('entry')
print(len(itemlist))
for s in itemlist:
print s.getElementsByTagName('id')[0].attributes['im:id'].value

Categories

Resources