I have an absolute path for the values of XML files I want to retrieve. The absolute path is in the format of "A/B/C". How can I do this in Python?
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
# Basic
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
print (doc.select('A>B>C'))
# Multiple
xml = '''<ROOT><A><B><C>The Value 1</C></B></A><A><B><C>The Value 2</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
# print (doc.selects('A').select('B').select('C'))
print (doc.selects('A').select('B>C'))
# Mixed structure
xml = '''<ROOT><A><other>no B</other></A><A><other></other><B>no C</B></A><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
nodes = doc.selects('A').selects('B').select('C')
for node in nodes:
for c in node:
if c:
print (c)
Result:
{'tag': 'C', 'html': 'The Value'}
[{'tag': 'C', 'html': 'The Value 1'}, {'tag': 'C', 'html': 'The Value 2'}]
{'tag': 'C', 'html': 'The Value'}
Using ElementTree library (Note that my answer uses core python library while the other answers are using external libraries.)
import xml.etree.ElementTree as ET
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
root = ET.fromstring(xml)
print(root.find('./A/B/C').text)
output
The Value
You can use lxml which you can install via pip install lxml.
See also https://lxml.de/xpathxslt.html
from io import StringIO
from lxml import etree
data = '''\
<prestashop>
<combination>
<id>a</id>
<id_product>b</id_product>
<location>c</location>
<ean13>d</ean13>
<isbn>e</isbn>
<upc>f</upc>
<mpn>g</mpn>
</combination>
</prestashop>
'''
xpath = '/prestashop/combination/ean13'
f = StringIO(data)
tree = etree.parse(f)
matches = tree.xpath(xpath)
for e in matches:
print(e.text)
Related
There is a jsx file with contents
<import name="abcd" color="green" age="25" />
<View color={dsdssd}>
<IBG
color={[color.imagecolor, color.image125]}
imageStyle={[styles.imageStyle, styles.image125]}
source={{ uri: contents.aimeecard }} >
<View color={styles.titleContainer}>
<Text color={[{green: 45}, styles.mainTileText]}</Text>
<View color={[abcde.text]} />
</View>
</View>
I need to fetch the details of first line using python script:
Expected output
name="abcd"
color="green"
age="25"
Also the path of jsx file is passed through list
ex: [abcd/file1.jsx , dcef/file2.jsx]
Python code tried for fetching jsx file through the list
for file in jsx_path:
data = md.parse("file")
print( file.firstChild.tagName )
Values are not fetched and getting error.
Can anyone help me in resolving this?
Assuming jsx_path is the list containing all the paths to the jsx files, you can iterate over each and use a context manager to avoid closing explicitly the files like so:
data = ""
for file in jsx_path:
with open(file) as f:
data += f.readline()[8:-4] + "\n"
print(data) # name="abcd" color="green" age="25"
Following your comment, if you want to output it as a dict, you can tweak the previous code:
import re
data = []
for file in jsx_path:
with open(file) as f:
data.append(re.split('\W+|=', f.readline()[8:-4]))
data_dict = []
for d in data:
data_dict.append({key:value for (key, value) in zip(d[::2], d[1::2])})
print(data_dict) # {'name': 'abcd', 'color': 'green', 'age': '25'}
Note that this is a hack. I only read the JSX file sequentially because your use case is simple enough to do so. You can also use a dedicated parser by extending the stlib class HTMLParser:
from html.parser import HTMLParser
class JSXImportParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "import":
self._import_attrs = {key:value for (key, value) in attrs}
#property
def import_attrs(self):
return self._import_attrs
parser = JSXImportParser()
data = []
for file in jsx_path:
with open(file) as f:
parser.feed(f.read())
data.append(parser.import_attrs)
print(data) # [{'name': 'abcd', 'color': 'green', 'age': '25'}]
Note that this only extracts the details of the last import tag in each file, you can alter this behavior by tweaking the _import_attrs class attribute.
Edit: Following your additional comment about the requirement to use an XML parser library, the same thing can be achieved using ElementTree by sampling the file to extract only what's interesting for you (the import tag):
import xml.etree.ElementTree as ET
data = []
for file in jsx_path:
with open(file) as f:
import_statement = ET.XML(f.readline())
data.append(import_statement.attrib)
print(data) # [{'name': 'abcd', 'color': 'green', 'age': '25'}]
Of course this only works if the import statement is on the first line, if it's not the case, you'll have to locate it first before calling ET.XML.
My XML file:
<xml
xmlns="http://www.myweb.org/2003/instance"
xmlns:link="http://www.myweb.org/2003/linkbase"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:iso4217="http://www.myweb.org/2003/iso4217"
xmlns:utr="http://www.myweb.org/2009/utr">
<link:schemaRef xlink:type="simple" xlink:href="http://www.myweb.com/form/2020-01-01/test.xsd"></link:schemaRef>
I want to get the URL: http://www.myweb.com/folder/form/1/2020-01-01/test.xsd from the <link:schemaRef> tag.
My below python code finds the <link:schemaRef> tag. But I am unable to retrieve the URL.
from lxml import etree
with open(filepath,'rb') as f:
file = f.read()
root = etree.XML(file)
print(root.nsmap["link"]) #http://www.myweb.org/2003/linkbase
print(root.find(".//{"+root.nsmap["link"]+"}"+"schemaRef"))
Try it this way and see if it works:
for i in root.xpath('//*/node()'):
if isinstance(i,lxml.etree._Element):
print(i.values()[1])
Output:
http://www.myweb.com/form/2020-01-01/test.xsd
Use:
>>> child = root.getchildren()[0]
>>> child.attrib
{'{http://www.w3.org/1999/xlink}type': 'simple', '{http://www.w3.org/1999/xlink}href': 'http://www.myweb.com/form/2020-01-01/test.xsd'}
>>> url = child.attrib['{http://www.w3.org/1999/xlink}href']
However, I believe the challenge is would you know which is the right key (i.e. {http://www.w3.org/1999/xlink}href) to be used. If this is the issue, then we just need:
>>> print(root.nsmap['xlink']) # Notice that the requested url is a href to the xlink
'http://www.w3.org/1999/xlink'
>>> key_url = "{"+key_prefix+"}href"
>>> print(child.attrib[key_url])
'http://www.myweb.com/form/2020-01-01/test.xsd'
I am trying to parse an XML file in python and seems like my XML is different from the normal nomenclature.
Below is my XML snippet:
<records>
<record>
<parameter>
<name>Server</name>
<value>Application_server_01</value>
</parameter
</record>
</records>
I am trying to get the value of "parameter" name and value however i seem to get empty value.
I checked the online documentation and almost all XML seems to be in the below format
<neighbor name="Switzerland" direction="W"/>
I am able to parse this fine, how can i get the values for my XML attributes without changing the formatting.
working code
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for neighbor in root.iter('neighbor'):
print(neighbor.attrib)
output
C:/Users/xxxxxx/PycharmProjects/default/parse.py
{'direction': 'E', 'name': 'Austria'}
{'direction': 'W', 'name': 'Switzerland'}
{'direction': 'N', 'name': 'Malaysia'}
{'direction': 'W', 'name': 'Costa Rica'}
{'direction': 'E', 'name': 'Colombia'}
PS: I will be using the XML to fire an API call and doubt if the downstream application would like the second way of formatting.
Below is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('at.xml')
root = tree.getroot()
for name in root.iter('name'):
print(name.attrib)
Output for the above code
C:/Users/xxxxxx/PycharmProjects/default/learning.py
{}
{}
{}
{}
{}
{}
{}
{}
Use lxml and XPath:
from lxml import etree as et
tree = et.parse(open("/tmp/so.xml"))
name = tree.xpath("/records/record/parameter/name/text()")[0]
value = tree.xpath("/records/record/parameter/value/text()")[0]
print(name, value)
Output:
Server Application_server_01
If I've got an XML file like this:
<root
xmlns:a="http://example.com/a"
xmlns:b="http://example.com/b"
xmlns:c="http://example.com/c"
xmlns="http://example.com/base">
...
</root>
How can I get a list of the namespace definitions (ie, the xmlns:a="…", etc)?
Using:
import xml.etree.ElementTree as ET
tree = ET.parse('foo.xml')
root = tree.getroot()
print root.attrib()
Shows an empty attribute dictionary.
Via #mzjn, in the comments, here's how to do it with stock ElementTree: https://stackoverflow.com/a/42372404/407651 :
import xml.etree.ElementTree as ET
my_namespaces = dict([
node for (_, node) in ET.iterparse('file.xml', events=['start-ns'])
])
You might find it easier to use lxml.
from lxml import etree
xml_data = '<root xmlns:a="http://example.com/a" xmlns:b="http://example.com/b" xmlns:c="http://example.com/c" xmlns="http://example.com/base"></root>'
root_node = etree.fromstring(xml_data)
print root_node.nsmap
This outputs
{None: 'http://example.com/base',
'a': 'http://example.com/a',
'b': 'http://example.com/b',
'c': 'http://example.com/c'}
I'm working with Elementtree to parse an XML file (Nessus data). Ive identified the item.attrib which looks to be a dictionary with a 'name': 'IPaddress'. I'd like to add this data into a dictionary, or if I can access just the ipaddress into a list. How can I access the value for name only? Ive tried using variations on item[0]/[1]/.attrib/text/ but still no luck.
Current Code
import elementtree.ElementTree as ET
def getDetails(nessus_file):
host_list = []
host_dict = {}
try:
tree = ET.parse(nessus_file)
doc = tree.getroot()
reporthost = doc.getiterator('ReportHost')
for child in doc:
if child.tag == 'Report':
for item in child:
if item.tag == 'ReportHost':
print item.attrib
except Exception as e:
print e
exit()
getDetails('file.nessus')
Example Output From Current Code
{'name': '172.121.26.80'}
{'name': '172.121.26.42'}
{'name': '172.121.26.41'}
{'name': '172.121.26.21'}
{'name': '172.121.26.15'}
{'name': '172.121.26.14'}
Use item.get('name'). See https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.get for details.