I have a XML file downloaded from Wordpress that is structured like this:
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
my goals is to look through the XML file for all the country keys and print the value. I'm completely new to the XML library so I'm looking where to take it from here.
# load libraries
# importing os to handle directory functions
import os
# import XML handlers
from xml.etree import ElementTree
# importing json to handle structured data saving
import json
# dictonary with namespaces
ns = {'wp:meta_key', 'wp:meta_value'}
tree = ElementTree.parse('/var/www/python/file.xml')
root = tree.getroot()
# item
for item in root.findall('wp:post_meta', ns):
print '- ', item.text
print "Finished running"
this throws me a error about using wp as a namespace but I'm not sure where to go from here the documentation is unclear to me. Any help is appreciated.
Downvoters please let me know how I can improve my question.
I don't know XML, but I can treat it as a string like this.
from simplified_scrapy import SimplifiedDoc, req, utils
xml = '''
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
'''
doc = SimplifiedDoc(xml)
kvs = doc.select('wp:postmeta').selects('wp:meta_key|wp:meta_value').html
print (kvs)
Result:
['<![CDATA[country]]>', '<![CDATA[Germany]]>']
Related
I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])
I am a beginner in coding and have this question below. I would gladly appreciate any help.
I have this python code below that request for information regarding a organization.
Note: The Commented "target" variable is for future use when i pass the user input from php to this python script.
import requests, sys
#target = sys.argv[1]
target = "logitech"
request = requests.get('http://whois.arin.net/rest/nets;name={}'.format(target))
print(request.text)
The output is similar to this but the number of "netRef" tags may vary depending on the organization.
<?xml version='1.0'?><?xml-stylesheet type='text/xsl' href='http://whois.arin.net/xsl/website.xsl' ?><nets xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns:ns3="http://www.arin.net/whoisrws/netref/v2" copyrightNotice="Copyright 1997-2020, American Registry for Internet Numbers, Ltd." inaccuracyReportUrl="https://www.arin.net/resources/registry/whois/inaccuracy_reporting/" termsOfUse="https://www.arin.net/resources/registry/whois/tou/"><limitExceeded limit="256">false</limitExceeded>
<netRef endAddress="173.8.217.111" startAddress="173.8.217.96" handle="NET-173-8-217-96-1" name="LOGITECH">https://whois.arin.net/rest/net/NET-173-8-217-96-1</netRef>
<netRef endAddress="50.193.49.47" startAddress="50.193.49.32" handle="NET-50-193-49-32-1" name="LOGITECH">https://whois.arin.net/rest/net/NET-50-193-49-32-1</netRef></nets>
I was wondering, is it possible to only display all of the endAddress and startAddress attributes in PHP?
I've tried using the xml.etree.ElementTree module but because the request variable is a "response" instead of a "byte", i can't parse the XML directly into an element.
My PHP code currently looks like this as i am unsure of how to proceed. testapi.py refers to the python code above.
<?php
$output1 = shell_exec('python testapi.py');
echo $output1;
?>
My desired output on the PHP side is as follow:
IP range: 173.8.217.96-173.8.217.111, 50.193.49.32-50.193.49.47
I would gladly appreciate any help, Thank You.
Python's etree maintains the fromstring method to parse XML trees from text. From there, you can parse content and be sure to assign prefixes to the default namespace in XML:
xmlns="http://www.arin.net/whoisrws/core/v1"
import requests as rq
import xml.etree.ElementTree as ET
request = rq.get('http://whois.arin.net/rest/nets;name=logitech')
tree = ET.fromstring(request.text)
nmsp = {"doc": "http://www.arin.net/whoisrws/core/v1"}
for elem in tree.findall(".//doc:netRef", nmsp):
print(f"endAddress: {elem.attrib['endAddress']}")
print(f"startAddress: {elem.attrib['startAddress']}")
print("---------------------------\n")
# endAddress: 173.8.217.111
# startAddress: 173.8.217.96
# ---------------------------
# endAddress: 50.193.49.47
# startAddress: 50.193.49.32
# ---------------------------
I'm trying to get a full list of xpaths from a device config in xml.
When I run it though I get:
AttributeError: 'Element' object has no attribute 'getpath'
Code is just a few lines
import xml.etree.ElementTree
import os
from lxml import etree
file1 = 'C:\Users\test1\Desktop\test.xml'
file1_path = file1.replace('\\','/')
e = xml.etree.ElementTree.parse(file1_path).getroot()
for entry in e.iter():
print e.getpath(entry)
anyone come across this before ?
Thanks
Richie
You are doing it incorrectly, don't call getroot just parse and iter using lxml.etree:
import lxml.etree as et
file1 = 'C:/Users/test1/Desktop/test.xml'
root = et.parse(file1)
for e in root.iter():
print root.getpath(e)
If you are dealing with namespaces you may find getelementpath usefule:
root.getelementpath(e)
I am working on CityGML data right now and try to parse CityGML in Python.
To do so, I use ElementTree, which is working fine with any XML files. But whenever I try to parse the CItyGML file I don't get any results.
As one example I want to get a list with all child tags named "creationDate" in the CityGML file. Here is the code:
import xml.etree.ElementTree as ET
tree = ET.parse('Gasometer.xml')
root = tree.getroot()
def child_list(child):
list_child = list(tree.iter(child))
return list_child
date = child_list('creationDate')
print (date)
I only get an empty list [].
Here is the the very first part of the CityGML file (the "creationDate"-tag you can find at the end):
<?xml version="1.0" encoding="UTF-8"?>
<CityModel>
<cityObjectMember>
<bldg:Building gml:id="UUID_899cac3f-e0b6-41e6-ae30-a91ce51d6d95">
<gml:description>Wohnblock in geschlossener Bauweise</gml:description>
<gml:boundedBy>
<gml:Envelope srsName="urn:ogc:def:crs,crs:EPSG::3068,crs:EPSG::5783" srsDimension="3">
<gml:lowerCorner>21549.6537889055 17204.3479916992 38.939998626709</gml:lowerCorner>
<gml:upperCorner>21570.6420902953 17225.660050148 60.6840192923434</gml:upperCorner>
</gml:Envelope>
</gml:boundedBy>
<creationDate>2014-03-28</creationDate>
This appears not only when I try to get lists of child tags. I can't print any attributes or tag names. It looks like the way I parse the file is wrong. I hope somebody can help me out with my problem and tell me what I should do! Thanks!
Since this is an old post I'll just leave this here in case someone else might need it.
To parse CityGML try the following code, it should help getting a general idea how to fetch the information.
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
print "ENV tag", envelope.tag
print "ENV attrib", envelope.attrib
print "ENV text", envelope.text
lCorner = envelope.find('{http://www.opengis.net/gml}lowerCorner').text
uCorner = envelope.find('{http://www.opengis.net/gml}upperCorner').text
print "lC",lCorner
print "uC",uCorner
if __name__== "__main__":
loadfile()
To get the srsName try following:
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
key = envelope.attrib
srsName = key.get('srsName')
print "SRS Name: ", srsName
if __name__== "__main__":
loadfile()
I hope this helps you or anyone else who might try parsing CityGML with ElementTree.
i make plugin in QGIS to open and parse xml from local disk or removable disk, this is code i use to open xml file :
from PyQt4 import QtCore, QtGui
from ui_testparse import Ui_testparse
import xml.etree.ElementTree as ETree
# create the dialog for zoom to point
class testparseDialog(QtGui.QDialog):
def __init__(self):
QtGui.QDialog.__init__(self)
# Set up the user interface from Designer.
self.ui = Ui_testparse()
self.ui.setupUi(self)
opendata = self.ui.btnCari
QtCore.QObject.connect(opendata, QtCore.SIGNAL('clicked()'),self.openxml)
def openxml(self, event=None):
#open dialog
openfile = QtGui.QFileDialog.getOpenFileName(self, 'Open File', '*.xml')
self.ui.lineLokasi.setText(openfile)
#call XML data
self.isiData(openfile)
def isiData(self, nmsatu):
#open teks with read mode
openteks = open(nmsatu, 'r').read()
self.ui.textXml.setText(openteks)
and to parse xml after that i try use Element Tree, this code i use to parse xml from code above :
#Parse XML from Above
self.parsenow(openteks)
def parsenow(self, parse):
element = ETree.fromstring(parse)
xml_obj = ETree.ElementTree(element)
for title_obj in xml_obj.findall('./{gmd#}dateStamp/{gco#}Date'):
print element
self.ui.lineSkala.setText(element)
and xml i want to parse have format like this :
<gmd:datestamp>
<gco:Date> XML Date </gco:Date>
i try to show XML Date in LineSkala(lineEdit) in QT but when i run it, it can open and read xml but failed to show XML Date in lineSkala, it just blank and didn't give me any error message
What i miss?
Thanks for your help in advance
The XPath syntax supported by etree is quite limited. Also, you must either supply a prefix dictionary when using find/findall (although this is not properly documented in python2), or use the full namespace uri.
So try something like:
ns = {
'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
}
tree.findall('.//gmd:dateStamp/gco:Date', ns)
or:
tree.findall('.//{http://www.isotc211.org/2005/gmd}dateStamp/'
'{http://www.isotc211.org/2005/gco}Date')
PS:
If you need to use more sophisticated XPath syntax, try lxml, which has a very similar API to ElementTree, but many more features.