i want to read the entry between
<dc:title> </dc:title>
This is xml:
<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="calibre-uuid">
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/">
<meta name="calibre:series_index" content="1"/>
<dc:language>UND</dc:language>
<dc:creator opf:file-as="Unbekannt" opf:role="aut">Johann Wolfgang von Goethe</dc:creator>
<meta name="calibre:timestamp" content="2009-10-08T07:26:21"/>
<dc:title>Faust_I_</dc:title>
<meta name="cover" content="cover"/>
<dc:date>2009-10-08T07:26:21</dc:date>
<dc:contributor opf:role="bkp">calibre (0.6.13) [http://calibre-ebook.com]</dc:contributor>
<dc:identifier id="calibre-uuid">urn:uuid:3cd4b26f-39a3-4783-9730-a86c26b30818</dc:identifier>
And that's my code:
from xml.etree import ElementTree as ET
tree = ET.parse('content.opf')
root = tree.getroot()
dc_namespace = "http://purl.org/dc/elements/1.1/"
print (root.attrib[ET.QName(dc_namespace, 'title')])
Output Error:
Traceback (most recent call last):
File "C:\Users\User\Documents\Visual Studio 2017\Projects\PythonApplication1\Modul1.py", line 8, in <module>
print (root.attrib[ET.QName(dc_namespace, 'title')])
KeyError: <QName '{xmlns:dc}title'>
What's wrong?
What you are looking for (<dc:title>) is an element, not an attribute. Here is how you can get its value:
from xml.etree import ElementTree as ET
tree = ET.parse('content.opf')
title = tree.find(".//{http://purl.org/dc/elements/1.1/}title")
print(title.text)
Output:
Faust_I_
Relevant references:
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax
you can use:
root[number][number]
to access the elements.
for example in
<base>
<element1>
<element2>asdada</element2>
</element>
</base>
root[0][0] will give u element 2
Related
I had a xml code and i want to get text in exact elements(xml tags) using python language .
I have tried couple of solutions and didnt work.
import xml.etree.ElementTree as ET
tree = ET.fromstring(xml)
for node in tree.iter('Model'):
print node
How can i do that ?
Xml Code :
<soap:Envelope
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetVehicleLimitedInfoResponse
xmlns="http://schemas.conversesolutions.com/xsd/dmticta/v1">
<return>
<ResponseMessage xsi:nil="true" />
<ErrorCode xsi:nil="true" />
<RequestId> 2012290007705 </RequestId>
<TransactionCharge>150</TransactionCharge>
<VehicleNumber>GF-0176</VehicleNumber>
<AbsoluteOwner>SIYAPATHA FINANCE PLC</AbsoluteOwner>
<EngineNo>GA15-483936F</EngineNo>
<ClassOfVehicle>MOTOR CAR</ClassOfVehicle>
<Make>NISSAN</Make>
<Model>PULSAR</Model>
<YearOfManufacture>1998</YearOfManufacture>
<NoOfSpecialConditions>0</NoOfSpecialConditions>
<SpecialConditions xsi:nil="true" />
</return>
</GetVehicleLimitedInfoResponse>
</soap:Body>
</soap:Envelope>
Edited and improved answer:
import xml.etree.ElementTree as ET
import re
ns = {"veh": "http://schemas.conversesolutions.com/xsd/dmticta/v1"}
tree = ET.parse('test.xml') # save your xml as test.xml
root = tree.getroot()
def get_tag_name(tag):
return re.sub(r'\{.*\}', '',tag)
for node in root.find(".//veh:return", ns):
print(get_tag_name(node.tag)+': ', node.text)
It should produce something like this:
ResponseMessage: None
ErrorCode: None
RequestId: 2012290007705
TransactionCharge: 150
VehicleNumber: GF-0176
AbsoluteOwner: SIYAPATHA FINANCE PLC
EngineNo: GA15-483936F
ClassOfVehicle: MOTOR CAR
Make: NISSAN
Model: PULSAR
YearOfManufacture: 1998
NoOfSpecialConditions: 0
SpecialConditions: None
I'm a beginner in Python and struggling to understand why while iterating through a dictionary obtained from an XML file I get an error when I try to search for required keys.I should also mention that, I still get the result I want, but somehow I keep receiving an error.
import os
import shelve
import xml.etree.ElementTree as et
shelfFile = shelve.open('xml_data')
base_path = os.path.dirname(os.path.realpath(__file__))
xml_file = os.path.join(base_path, "data\\nrm_icg_catalog.xml")
tree = et.parse(xml_file)
root = tree.getroot()
elements = ['Name','WBS']
for child in root:
for itemGroup1 in child:
for item in elements:
print(itemGroup1.attrib[item])
Result with Error Message:
Facilitating Works
0
Substructure
1
Superstructure
2
Internal Finish
3
Fittings
4
Services
5
Prefabs
6
Works to Existing Building
7
External Works
8
MC Prelims
9
MC OH and P
10
Traceback (most recent call last):
File "c:/Users/Dodzi Agbenorku/OneDrive/Training Files/Programming Lessons/Python/xmlExcel/app.py", line 22, in <module>
print(itemGroup1.attrib[item])
KeyError: 'Name'
Here's a small section of the xml file I am using:
<?xml version="1.0" encoding="utf-8"?>
<Takeoff xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://download.autodesk.com/us/navisworks/schemas/nw-TakeoffCatalog-10.0.xsd">
<Catalog>
<ItemGroup Name="Facilitating Works" WBS="0" CatalogId="32b4ab2d-6fe8-4c45-9872-c8ea68c0c4de">
<ItemGroup Name="Hazardous Materials" WBS="1" CatalogId="2fdb6bd1-b2d1-4167-a74d-da818183a156">
<ItemGroup Name="Material Removal" WBS="1" CatalogId="ccc8a515-4152-400c-a72e-6fd78561325e">
<Item Name="Material Details" WBS="1" Transparency="0.3" Color="-15161029" LineThickness="0.1" CatalogId="c0a7de26-6bc3-491e-b3c7-3ff5b560eeaf">
<VariableCollection>
<Variable Name="Length" Formula="=ModelLength" Units="Meter" />
<Variable Name="Width" Formula="=ModelWidth" Units="Meter" />
<Variable Name="Thickness" Formula="=ModelThickness" Units="Meter" />
<Variable Name="Height" Formula="=ModelHeight" Units="Meter" />
<Variable Name="Perimeter" Formula="=ModelPerimeter" Units="Meter" />
<Variable Name="Area" Formula="=ModelArea" Units="SquareMeter" />
Any help will be extremely appreciated.
The error occurs because for some nodes, Name is not in attrib which is a dictionary.
Instead of itemGroup1.attrib[item], use itemGroup1.attrib.get(item). It will return the value None if the key does not exist, and it will not throw an error.
How can i parse unstructured xml file? i need to get data inside patient tag and title using elementTree.
<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>
i want given name , family name , gender and title.
Using BeautifulSoup bs4 and lxml parser library to scrape xml data.
from bs4 import BeautifulSoup
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
<templateId root="2.16.840.1.113883.10.20.22.1.1"/>
<id extension="4b78219a-1d02-4e7c-9870-dc7ce3b8a8fb" root="1.2.840.113619.21.1.3214775361124994304.5.1"/>
<code code="34133-9" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Summarization of episode note"/>
<title>Summary</title>
<effectiveTime value="20170919160921ddfdsdsdsd31-0400"/>
<confidentialityCode code="N" codeSystem="2.16.840.dwdwddsd1.113883.5.25"/>
<recordTarget>
<patientRole><id extension="0" root="1.2.840.113619.21.1.3214775361124994304.2.1.1.2"/>
<addr use="HP"><streetAddressLine>addd2 </streetAddressLine><city>fgfgrtt</city><state>tr</state><postalCode>121213434</postalCode><country>rere</country></addr>
<patient>
<name><given>fname</given><family>lname</family></name>
<administrativeGenderCode code="F" codeSystem="2.16.840.1.113883.5.1" displayName="Female"/>
<birthTime value="19501025"/>
<maritalStatusCode code="M" codeSystem="2434.16.840.1.143434313883.5.2" displayName="M"/>
<languageCommunication>
<languageCode code="eng"/>
<proficiencyLevelCode nullFlavor="NI"/>
<preferenceInd value="true"/>
</languageCommunication>
</patient>'''
soup = BeautifulSoup(xml_data, "lxml")
title = soup.find("title")
print(title.text.strip())
patient = soup.find("patient")
given = patient.find("given").text.strip()
family = patient.find("family").text.strip()
gender = patient.find("administrativegendercode")['displayname'].strip()
print(given)
print(family)
print(gender)
O/P:
Summary
fname
lname
Female
Install library dependency:
pip3 install beautifulsoup4==4.7.1
pip3 install lxml==4.3.3
Or you can simply use lxml. Here is tutorial that I used: https://lxml.de/tutorial.html
But it should be similar to:
from lxml import etree
root = etree.Element("patient")
print(root.find("given"))
print(root.find("family"))
print(root.find("give"))
I have a large XML file and I want to extract some tags and write them in another xml file. I wrote this code:
import xml.etree.cElementTree as CE
tree = CE.ElementTree()
root = CE.Element("root")
i = 0
for event, elem in CE.iterparse('data.xml'):
if elem.tag == "ActivityRef":
print(elem.tag)
a = CE.Element(elem.tag)
root.append(elem)
elem.clear()
i += 1
if i == 200:
break
But I don't get the desired result, I got this:
<root>
<ActivityRef />
<ActivityRef />
<ActivityRef />
<ActivityRef />
...
</root>
instead of this:
<root>
<ActivityRef>
<Id>2008-12-11T20:43:07Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2008-10-11T20:43:07Z</Id>
</ActivityRef>
...
</root>
Edit
Input file:
<?xml version="1.0" encoding="UTF-8"?>
<Folders>
<History>
<Running>
<ActivityRef>
<Id>2009-03-14T17:05:55Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-13T06:12:42Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-08T09:00:29Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-04T19:39:39Z</Id>
</ActivityRef>
...
</Running>
</History>
</Folders>
And also I need to remove the element from the source file.
Use XPATH
import xml.etree.ElementTree as ET
data = '''<?xml version="1.0" encoding="UTF-8"?>
<Folders>
<History>
<Running>
<ActivityRef>
<Id>2009-03-14T17:05:55Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-13T06:12:42Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-08T09:00:29Z</Id>
</ActivityRef>
<ActivityRef>
<Id>2009-03-04T19:39:39Z</Id>
</ActivityRef>
</Running>
</History>
</Folders>'''
root = ET.fromstring(data)
# 'activities' contains the elements you are looking for
activities = root.findall('.//ActivityRef')
I'm using the following XML:
<feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>
https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml
</id>
<title>iTunes Store: Top Free Apps</title>
<updated>2016-12-05T12:37:06-07:00</updated>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewTop?cc=in&id=134581&popId=27"/>
<link rel="self" href="https://itunes.apple.com/IN/rss/topfreeapplications/limit=200/xml"/>
<icon>http://itunes.apple.com/favicon.ico</icon>
<author>
<name>iTunes Store</name>
<uri>http://www.apple.com/uk/itunes/</uri>
</author>
<rights>Copyright 2008 Apple Inc.</rights>
<entry>
<updated>2016-12-05T12:37:06-07:00</updated>
<id im:id="473941634" im:bundleId="com.one97.paytm">https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2</id>
<title>Recharge, Bill Payment & Wallet - Paytm Mobile Solutions</title>
<summary></summary>
<im:name>Recharge, Bill Payment & Wallet</im:name>
<link rel="alternate" type="text/html" href="https://itunes.apple.com/in/app/recharge-bill-payment-wallet/id473941634?mt=8&uo=2"/>
<im:contentType term="Application" label="Application"/>
<category im:id="6024" term="Shopping" scheme="https://itunes.apple.com/in/genre/ios-shopping/id6024?mt=8&uo=2" label="Shopping"/>
<im:artist href="https://itunes.apple.com/in/developer/paytm-mobile-solutions/id473941637?mt=8&uo=2">Paytm Mobile Solutions</im:artist>
<im:price amount="0.00000" currency="INR">Get</im:price>
<im:image height="53">http://is1.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/53x53bb-85.png</im:image>
<im:image height="75">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/75x75bb-85.png</im:image>
<im:image height="100">http://is5.mzstatic.com/image/thumb/Purple71/v4/9b/37/bf/9b37bf75-6b4d-9c95-a8a4-ea369f05ae7e/pr_source.png/100x100bb-85.png</im:image>
<rights>© One97 Communications Ltd</rights>
<im:releaseDate label="24 October 2011">2011-10-24T16:18:48-07:00</im:releaseDate>
<content type="html"></content>
</entry>
</feed>
I would like to extract the id information for each entry value:
the attribute is as follows: "im:id"
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('link')
print(len(itemlist))
print(itemlist[0].attributes.keys())
I get information:
1
[u'href', u'type', u'rel']
But when I do the same of id, nothing returns.
Here is a version using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse('topIN.xml')
root = tree.getroot()
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
for id_ in root.findall('atom:entry/atom:id', ns):
print (id_.attrib['{' + ns['im'] + '}id'])
Here is a version using lxml:
from lxml import etree
root=etree.parse('topIN.xml')
ns={'im':"http://itunes.apple.com/rss", 'atom':"http://www.w3.org/2005/Atom"}
print('\n'.join(root.xpath('atom:entry/atom:id/#im:id', namespaces=ns)))
This worked:
from xml.dom import minidom
xmldoc = minidom.parse('topIN.xml')
itemlist = xmldoc.getElementsByTagName('entry')
print(len(itemlist))
for s in itemlist:
print s.getElementsByTagName('id')[0].attributes['im:id'].value