getting the node attribute of an XML file with LXML parsing - python

I cant get my mind around this nor working properly:
data='''<?xml version="1.0" encoding="UTF-8"?>\n<div type="docs" xml:base="/kime-api/prod/api/emi/2" xml:lang="ja" xml:id="39532e30"> <div n="0001" type="doc" xml:id="_5738d00002"></div></div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
# I tried with and without this following line
#data = data.replace('<?xml version="1.0" encoding="UTF-8"?>','')
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('.//div[#xml:lang]')
lang
lang is an empty list and there is ONE element like: xml:lang="ja" in the XML.
What am I doing wrong please?

You could just do xpath(#xml:lang).
XML_tree = etree.fromstring(data.encode() , parser=parser)
lang = XML_tree.xpath('#xml:lang')
print(lang)
Output:
['ja']

XML_tree represents the root element (the <div> with an xml:lang attribute).
If you want to get the language, use the following:
lang = XML_tree.xpath('#xml:lang')

Related

How to modify the value under a designate path with python?

There is a xml file like below:
<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>
I'm trying to modify the value "Tom" and "David", but I can't get any value in <dd>. Then I try to get the value in <bb>, but I got the response "None" from my code.
My code as below:
import xml.etree.ElementTree as ET
tree = ET.parse("abc.xml")
root = tree.getroot()
a = root.find('aa/bb')
print(a)
Does someone could help me to correct my code to get and modify the value of <dd> ? Many thanks.
Your top level object is aa. So root is element aa
To get bb, just do root.find('bb')
>>> root
<Element 'aa' at 0x7fb1df5f0278>
>>> a = root.find('bb')
>>> a
<Element 'bb' at 0x7fb1df5f0228>
So to edit the names, try something like this
for dd in root.findall('cc/dd'):
if dd.text in ["Tom", "David"]:
dd.text = "something else"
Using ElementTree
Demo:
import xml.etree.ElementTree
et = xml.etree.ElementTree.parse(filename)
root = et.getroot()
for cc in root.findall('cc'): #Find all cc tags
print(cc.find("dd").text) #Print current text
cc.find("dd").text = "NewValue" #Update dd tags with new value
et.write(filename) #Write back to xml
If you don't mind using BeautifulSoup, you can modify your XML through it:
data = """<aa>
<bb>BB</bb>
<cc>
<dd>Tom</dd>
</cc>
<cc>
<dd>David</dd>
</cc>
</aa>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
for dd in soup.select('cc > dd'): # using CSS selectors
dd.clear()
dd.append('XXX')
print(soup.prettify())
Output:
<?xml version="1.0" encoding="utf-8"?>
<aa>
<bb>
BB
</bb>
<cc>
<dd>
XXX
</dd>
</cc>
<cc>
<dd>
XXX
</dd>
</cc>
</aa>

XML not returning correct child tags/data in Python

Hello I am making a requests call to return order data from a online store. My issue is that once I have passed my data to a root variable the method iter is not returning the correct results. e.g. Display multiple tags of the same name rather than one and not showing the data within the tag.
I thought this was due to the XML not being correctly formatted so I formatted it by saving it to a file using pretty_print but that hasn't fixed the error.
How do I fix this? - Thanks in advance
Code:
import requests, xml.etree.ElementTree as ET, lxml.etree as etree
url="http://publicapi.ekmpowershop24.com/v1.1/publicapi.asmx"
headers = {'content-type': 'application/soap+xml'}
body = """<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetOrders xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersRequest>
<APIKey>my_api_key</APIKey>
<FromDate>01/07/2018</FromDate>
<ToDate>04/07/2018</ToDate>
</GetOrdersRequest>
</GetOrders>
</soap12:Body>
</soap12:Envelope>"""
#send request to ekm
r = requests.post(url,data=body,headers=headers)
#save output to file
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(r.text)
file.close()
#take the file and format the xml
x = etree.parse("C:/Users/Mark/Desktop/test.xml")
newString = etree.tostring(x, pretty_print=True)
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(newString.decode('utf-8'))
file.close()
#parse the file to get the roots
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
#access elements names in the data
for child in root.iter('*'):
print(child.tag)
#show orders elements attributes
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
out = {}
for child in order:
if child.tag in ('OrderID'):
out[child.tag] = child.text
print(out)
Elements output:
{http://publicapi.ekmpowershop.com/}Orders
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
Orders Output:
{http://publicapi.ekmpowershop.com/}Order {}
{http://publicapi.ekmpowershop.com/}Order {}
XML Structure after formating:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetOrdersResponse xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersResult>
<Status>Success</Status>
<Errors/>
<Date>2018-07-10T13:47:00.1682029+01:00</Date>
<TotalOrders>10</TotalOrders>
<TotalCost>100</TotalCost>
<Orders>
<Order>
<OrderID>100</OrderID>
<OrderNumber>102/040718/67</OrderNumber>
<CustomerID>6910</CustomerID>
<CustomerUserID>204</CustomerUserID>
<FirstName>TestFirst</FirstName>
<LastName>TestLast</LastName>
<CompanyName>Test Company</CompanyName>
<EmailAddress>test#Test.com</EmailAddress>
<OrderStatus>Dispatched</OrderStatus>
<OrderStatusColour>#00CC00</OrderStatusColour>
<TotalCost>85.8</TotalCost>
<OrderDate>10/07/2018 14:30:43</OrderDate>
<OrderDateISO>2018-07-10T14:30:43</OrderDateISO>
<AbandonedOrder>false</AbandonedOrder>
<EkmStatus>SUCCESS</EkmStatus>
</Order>
</Orders>
<Currency>GBP</Currency>
</GetOrdersResult>
</GetOrdersResponse>
</soap:Body>
</soap:Envelope>
You need to consider the namespace when checking for tags.
>>> # Include the namespace part of the tag in the tag values that we check.
>>> tags = ('{http://publicapi.ekmpowershop.com/}OrderID', '{http://publicapi.ekmpowershop.com/}OrderNumber')
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag] = child.text
... print(out)
...
{'{http://publicapi.ekmpowershop.com/}OrderID': '100', '{http://publicapi.ekmpowershop.com/}OrderNumber': '102/040718/67'}
If you don't want the namespace prefixes in the output, you can strip them by only including that part of the tag after the } character.
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag[child.tag.index('}')+1:]] = child.text
... print(out)
...
{'OrderID': '100', 'OrderNumber': '102/040718/67'}

How to add attribute to lxml Element

I would like to add attribute to a lxml Element like this
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Here is what I have
E = lxml.builder.ElementMaker()
outer = E.outer
header = E.Header
FIELD1 = E.field1
FIELD2 = E.field2
the_doc = outer(
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
XML_2_HEADER(
FIELD1('some value1', name='blah'),
FIELD2('some value2', name='asdfasd'),
),
)
seems like this line is causing some problem
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance",
even if I replace it with
'xmlns:xsi'="http://www.w3.org/2001/XMLSchema-instance",
it won't work.
What is a way to add attribute to lxml Element?
That's a namespace definition, not an ordinary XML attribute. You can pass namespace information to ElementMaker() as a dictionary, for example :
from lxml import etree as ET
import lxml.builder
nsdef = {'xsi':'http://www.w3.org/2001/XMLSchema-instance'}
E = lxml.builder.ElementMaker(nsmap=nsdef)
doc = E.outer(
E.Header(
E.field1('some value1', name='blah'),
E.field2('some value2', name='asdfasd'),
),
)
print ET.tostring(doc, pretty_print=True)
output :
<outer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<field1 name="blah">some value1</field1>
<field2 name="asdfasd">some value2</field2>
</Header>
</outer>
Link to the docs: http://lxml.de/api/lxml.builder.ElementMaker-class.html

Extracting similar XML attributes with BeautifulSoup

Let's assume I have the following XML:
<time from="2017-07-29T08:00:00" to="2017-07-29T09:00:00">
<!-- Valid from 2017-07-29T08:00:00 to 2017-07-29T09:00:00 -->
<symbol number="4" numberEx="4" name="Cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T08:00:00 -->
<windDirection deg="300.9" code="WNW" name="West-northwest"/>
<windSpeed mps="1.3" name="Light air"/>
<temperature unit="celsius" value="15"/>
<pressure unit="hPa" value="1002.4"/>
</time>
<time from="2017-07-29T09:00:00" to="2017-07-29T10:00:00">
<!-- Valid from 2017-07-29T09:00:00 to 2017-07-29T10:00:00 -->
<symbol number="4" numberEx="4" name="Partly cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T09:00:00 -->
<windDirection deg="293.2" code="WNW" name="West-northwest"/>
<windSpeed mps="0.8" name="Light air"/>
<temperature unit="celsius" value="17"/>
<pressure unit="hPa" value="1002.6"/>
</time>
And I want to collect time from, symbol name and temperature value from it, and then print it out in the following manner: time from: symbol name, temperaure value -- like this: 2017-07-29, 08:00:00: Cloudy, 15°.
(And there are a few name and value attributes in this XML, as you see.)
As of now, my approach was quite straightforward:
#!/usr/bin/env python
# coding: utf-8
import re
from BeautifulSoup import BeautifulSoup
# data is set to the above XML
soup = BeautifulSoup(data)
# collect the tags of interest into lists. can it be done wiser?
time_l = []
symb_l = []
temp_l = []
for i in soup.findAll('time'):
i_time = str(i.get('from'))
time_l.append(i_time)
for i in soup.findAll('symbol'):
i_symb = str(i.get('name'))
symb_l.append(i_symb)
for i in soup.findAll('temperature'):
i_temp = str(i.get('value'))
temp_l.append(i_temp)
# join the forecast lists to a dict
forc_l = []
for i, j in zip(symb_l, temp_l):
forc_l.append([i, j])
rez = dict(zip(time_l, forc_l))
# combine and format the rezult. can this dict be printed simpler?
wew = ''
for key in sorted(rez):
wew += re.sub("T", ", ", key) + str(rez[key])
wew = re.sub("'", "", wew)
wew = re.sub("\[", ": ", wew)
wew = re.sub("\]", "°\n", wew)
# print the rezult
print wew
But I imagine there must be some better, more intelligent approach? Mostly, I'm interested in collecting the attributes from the XML, my way seems rather dumb to me, actually. Also, is there any simpler way to print out a dict {'a': '[b, c]'} nicely?
Would be grateful for any hints or suggestions.
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read() # xml content stored in this variable
soup = BeautifulSoup(content, "lxml")
for values in soup.findAll("time"):
print("{} : {}, {}°".format(values["from"], values.find("symbol")["name"], values.find("temperature")["value"]))
Output:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°
One more, also you can fetch xml data by importing xml.dom.minidom module.
Here is the data you want:
from xml.dom.minidom import parse
doc = parse("path/to/xmlfile.xml") # parse an XML file by name
itemlist = doc.getElementsByTagName('time')
for items in itemlist:
from_tag = items.getAttribute('from')
symbol_list = items.getElementsByTagName('symbol')
symbol_name = [d.getAttribute('name') for d in symbol_list ][0]
temperature_list = items.getElementsByTagName('temperature')
temp_value = [d.getAttribute('value') for d in temperature_list ][0]
print ("{} : {}, {}°". format(from_tag, symbol_name, temp_value))
Output will be as follows:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°
Hope it is useful.
Here you can also use an alternate way using builtin module(i'm using python 3.6.2):
import xml.etree.ElementTree as et # this is built-in module in python3
tree = et.parse("sample.xml")
root = tree.getroot()
for temp in root.iter("time"): # iterate time element in xml
print(temp.attrib["from"], end=": ") # prints attribute of time element
for sym in temp.iter("symbol"): # iterate symbol element within time element
print(sym.attrib["name"], end=", ")
for t in temp.iter("temperature"): # iterate temperature element within time element
print(t.attrib["value"], end="°\n")

Get Values from child nodes from XML | Python

I have the following XML.
I am using ElementTree library to scrape the values.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc> Test1</loc>
</url>
<url>
<loc>Test 2</loc>
</url>
<url>
<loc>Test 3</loc>
</url>
</urlset>
I need to get the values out of 'loc tag'.
Desired Output:
Test 1
Test 2
Test 3
Tried Code:
tree = ET.parse('sitemap.xml')
root = tree.getroot()
for atype in root.findall('url'):
rank = atype.find('loc').text
print (rank)
Any suggestions on where am I wrong ?
Your XML has a default namespace (http://www.sitemaps.org/schemas/sitemap/0.9) so you either have to address all your tags as:
tree = ET.parse('sitemap.xml')
root = tree.getroot()
for atype in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
rank = atype.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
print(rank)
Or to define a namespace map:
nsmap = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
tree = ET.parse('sitemap.xml')
root = tree.getroot()
for atype in root.findall('ns:url', nsmap):
rank = atype.find('ns:loc', nsmap).text
print(rank)
from lxml import etree
tree = etree.parse('sitemap.xml')
for element in tree.iter('*'):
if element.text.find('Test') != -1:
print element.text
Probably isn't the most beautiful solution, but it works :)

Categories

Resources