Parsing xml with etree - python

I am trying to parse an XML response from Amazon's Product Advertising API, this is the xml
<?xml version="1.0" ?>
<ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01"> <OperationRequest>
<HTTPHeaders>
<Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
</HTTPHeaders>
<RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
<Arguments>
<Argument Name="Operation" Value="ItemLookup"></Argument>
<Argument Name="Service" Value="AWSECommerceService"></Argument>
<Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument><Argument Name="AssociateTag" Value="sneakick-20"></Argument>
<Argument Name="Version" Value="2010-11-01"></Argument>
<Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
<Argument Name="IdType" Value="UPC"></Argument>
<Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
<Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
<Argument Name="ResponseGroup" Value="ItemIds"></Argument>
<Argument Name="SearchIndex" Value="Apparel"></Argument>
</Arguments>
<RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
</OperationRequest>
<Items>
<Request>
<IsValid>True</IsValid>
<ItemLookupRequest>
<IdType>UPC</IdType>
<ItemId>810056013349</ItemId>
<ItemId>810056013264</ItemId>
<ResponseGroup>ItemIds</ResponseGroup>
<SearchIndex>Apparel</SearchIndex>
<VariationPage>All</VariationPage>
</ItemLookupRequest>
</Request>
<Item>
<ASIN>B000XR4K6U</ASIN>
</Item>
<Item>
<ASIN>B000XR2UU8</ASIN>
</Item>
</Items>
</ItemLookupResponse>
All i am interested in is the Item tags inside Items , so basically all that xml was returned by amazon in a string which i parsed like so:
from xml.etree.ElementTree import fromstring
response = "xml string returned by amazon"
parsed = fromstring(response)
items = parsed[1] # This is how i get the Items element
# These were my attempts at getting the Item element
items.find('Item')
items.findall('Item')
items being the Items element, but so far no success, it keeps returning None/Empty , im i missing something , or is there another way to go about this ?

It is a namespace issue. This works:
from xml.etree import ElementTree as ET
XML = """<?xml version="1.0" ?>
<ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01">
<OperationRequest>
<HTTPHeaders>
<Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
</HTTPHeaders>
<RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
<Arguments>
<Argument Name="Operation" Value="ItemLookup"></Argument>
<Argument Name="Service" Value="AWSECommerceService"></Argument>
<Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument>
<Argument Name="AssociateTag" Value="sneakick-20"></Argument>
<Argument Name="Version" Value="2010-11-01"></Argument>
<Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
<Argument Name="IdType" Value="UPC"></Argument>
<Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
<Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
<Argument Name="ResponseGroup" Value="ItemIds"></Argument>
<Argument Name="SearchIndex" Value="Apparel"></Argument>
</Arguments>
<RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
</OperationRequest>
<Items>
<Request>
<IsValid>True</IsValid>
<ItemLookupRequest>
<IdType>UPC</IdType>
<ItemId>810056013349</ItemId>
<ItemId>810056013264</ItemId>
<ResponseGroup>ItemIds</ResponseGroup>
<SearchIndex>Apparel</SearchIndex>
<VariationPage>All</VariationPage>
</ItemLookupRequest>
</Request>
<Item>
<ASIN>B000XR4K6U</ASIN>
</Item>
<Item>
<ASIN>B000XR2UU8</ASIN>
</Item>
</Items>
</ItemLookupResponse>"""
NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"
doc = ET.fromstring(XML)
Item_elems = doc.findall(".//" + NS + "Item") # All Item elements in document
print Item_elems
Output:
[<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0c50>,
<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0cd0>]
Variation closer to your own code:
NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"
doc = ET.fromstring(XML)
items = doc[1] # Items element
first_item = items.find(NS + 'Item') # First direct Item child
all_items = items.findall(NS + 'Item') # List of all direct Item children

Namespace issue.
You can put the namespace in front of all of your items as spelled out in the first answer to either this question or this question. A possibly simpler solution is to ignore the namespace with a quick hack like this:
xml_hacked_namespace = raw_xml.replace(' xmlsn=', ' xmlnamespace=')
doc = fromstring(xml_hacked_namespace)
item_list = doc.findall('.//Item')
If you find that you are doing a lot of work with xml you may also be interested in checking out lxml. It is faster and provides a few extra methods that some find nice to have.

Related

Get items from xml Python

I have an xml in python, need to obtain the elements of the "Items" tag in an iterable list.
I need get a iterable list from this XML, for example like it:
Item 1: Bicycle, value $250, iva_tax: 50.30
Item 2: Skateboard, value $120, iva_tax: 25.0
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data>
<info>Listado de items</info>
<detalle>
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<tienda id="tiendaProd" version="1.1.0">
<items>
<item>
<nombre>Bicycle</nombre>
<valor>250</valor>
<data>
<tax name="iva" value="50.30"></tax>
</data>
</item>
<item>
<nombre>Skateboard</nombre>
<valor>120</valor>
<data>
<tax name="iva" value="25.0"></tax>
</data>
</item>
<item>
<nombre>Motorcycle</nombre>
<valor>900</valor>
<data>
<tax name="iva" value="120.50"></tax>
</data>
</item>
</items>
</tienda>]]>
</detalle>
</data>
I am working with
import xml.etree.ElementTree as ET
for example
import xml.etree.ElementTree as ET
xml = ET.fromstring(stringBase64)
ite = xml.find('.//detalle').text
tixml = ET.fromstring(ite)
You can use BeautifulSoup4 (BS4) to do this.
from bs4 import BeautifulSoup
#Read XML file
with open("example.xml", "r") as f:
contents = f.readlines()
#Create Soup object
soup = BeautifulSoup(contents, 'xml')
#find all the item tags
item_tags = soup.find_all("item") #returns everything in the <item> tags
#find the nombre and valor tags within each item
results = {}
for item in item_tags:
num = item.find("nombre").text
val = item.find("valor").text
results[str(num)] = val
#Prints dictionary with key value pairs from the xml
print(results)

read the text of a file between 2 words in python

I am trying to open, read and extract the content (fragment) that is between 2 words (which are opening and closing profile, also included) of an .xml locating the fragment by means of a keyword that I introduce and write only that fragment (between 2 tags) in another new .xml that I generate.
Currently the python script that I have allows me to open, read the source .xml file, search for the keyword that I introduce in the text and return those complete lines where the keyword is found by writing them in a new .xml file that I generate as follows:
keyword = 'Georgia'
occurrences = []
with open('test_input.xml') as lines:
for line in lines:
if keyword in line:
occurrences.append(line)
archi1=open("test_output.xml","w")
archi1.write(''.join(occurrences))
archi1.close()
The result I get is a "test_output.xml" file that contains the following:
<id>Georgia-1</id>
<profile>Georgia-p1</profile>
<id>Georgia-2</id>
<profile>Georgia-p2</profile>
And the problem is that I not only need it to return the complete lines that contain the keyword (in this case 'Georgia') but also the entire fragment that contains those two words and that is delimited between the opening and the closing of the word or tag 'profile', that is, I need it to return the following result:
<profile>
<id>Georgia-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p1</profile>
<showtitle>Georgia_s1</showtitle>
<ip>000.000.0.3</ip>
<port>00003</port>
<persistencePort>00033</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_3</webstart.server.name>
<codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p2</profile>
<showtitle>Georgia_s2</showtitle>
<ip>000.000.0.4</ip>
<port>00004</port>
<persistencePort>00044</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_4</webstart.server.name>
<codebaseProtocolServer>T4</codebaseProtocolServer>
</properties>
</profile>
The full source .xml I am using is as follows:
<project>
<profile>
<id>Azerbaiyan-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Azerbaiyan-p1</profile>
<showtitle>Azerbaiyan_s1</showtitle>
<ip>000.000.0.1</ip>
<port>00001</port>
<persistencePort>00011</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_1</webstart.server.name>
<codebaseProtocolServer>T1</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Azerbaiyan-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Azerbaiyan-p2</profile>
<showtitle>Azerbaiyan_s2</showtitle>
<ip>000.000.0.2</ip>
<port>00002</port>
<persistencePort>00022</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_2</webstart.server.name>
<codebaseProtocolServer>T2</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p1</profile>
<showtitle>Georgia_s1</showtitle>
<ip>000.000.0.3</ip>
<port>00003</port>
<persistencePort>00033</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_3</webstart.server.name>
<codebaseProtocolServer>T3</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>Georgia-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>Georgia-p2</profile>
<showtitle>Georgia_s2</showtitle>
<ip>000.000.0.4</ip>
<port>00004</port>
<persistencePort>00044</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_4</webstart.server.name>
<codebaseProtocolServer>T4</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>USA-1</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>USA-p1</profile>
<showtitle>USA1_s1</showtitle>
<ip>000.000.0.5</ip>
<port>00005</port>
<persistencePort>00055</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_5</webstart.server.name>
<codebaseProtocolServer>T5</codebaseProtocolServer>
</properties>
</profile>
<profile>
<id>USA-2</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
<properties>
<profile>USA-p2</profile>
<showtitle>USA1_s2</showtitle>
<ip>000.000.0.6</ip>
<port>00006</port>
<persistencePort>00066</persistencePort>
<defaultLocale>en_GB</defaultLocale>
<webstart.server.name>host_6</webstart.server.name>
<codebaseProtocolServer>T6</codebaseProtocolServer>
</properties>
</profile>
Parse the input as XML and capture the profile elements that have an id child element whose text value contains the string "Georgia".
The following program uses the ElementTree standard library and outputs the wanted result:
import xml.etree.ElementTree as ET
tree = ET.parse("input.xml")
# Iterate over all 'profile' elements
for profile in tree.findall("profile"):
id = profile.find("id").text
if "Georgia" in id:
print(ET.tostring(profile).decode())

Python xml - Add tags in the same line with parent tag

I am trying to make an MyXml.xml file by parsing other Source.xml file. Current structure of MyXml is:
<tag atrib="true" atrib2="false" atrib3="1" atrib4="7">
<tag1 txt="CONTENT">
<tag2 name="Category">1</Field>
<tag3 name="Wallet"> </Field>
<tag4 name="Increase">1</Field>
<tag5 name="Text">
<div />
</tag5>
</tag1>
</tag>
But my output should be like this (tags of tag5 should be in same line):
<tag atrib="true" atrib2="false" atrib3="1" atrib4="7">
<tag1 txt="CONTENT">
<tag2 name="Category">1</Field>
<tag3 name="Wallet"> </Field>
<tag4 name="Increase">1</Field>
<tag5 name="Text"><div><h2>SomeTxt</h2></div></tag5>
</tag1>
</tag>
current code is this:
MDroot = minidom.Document()
tag = MDroot.createElement('tag')
MDroot.appendChild(tag)
# Other tags
root = ET.Element('tag')
tag1 = ET.SubElement(root, 'tag1', txt= 'CONTENT')
ET.SubElement(tag1, "tag2", name='Category').text = "Heading"
ET.SubElement(tag1, "tag3", name='Wallet').text = ' '
ET.SubElement(tag1, "tag4", name='Increase').text = 1
tag5 = ET.SubElement(tag1, "tag5 ", name='Text')
div = ET.SubElement(tag5 , "div",)
root1 = ET.Element(tag5)
root1.insert(1, div)
But this code always creates normal xml structure with parenting. Any idea how can I put those in the same line?
Thanks!
in xml lines are NOT important!
<tag5 name="Text"><div><h2>SomeTxt</h2></div></tag5>
has the same meaning as:
<tag5 name="Text">
<div>
<h2>SomeTxt</h2>
</div>
</tag5>
So just ignore the lines.

Large XML parsing in Python

I am a novice in python and have the following task on hand.
I have a large xml file like the one below:
<Configuration>
<Parameters>
<Component Name='ABC'>
<Group Name='DEF'>
<Parameter Name='GHI'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
<Parameter Name='JKL'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
</Group>
<Group Name='MNO'>
<Parameter Name='PQR'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
<Parameter Name='TUV'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
</Group>
</Component>
</Parameters>
</Configuration>
In this xml file I have to parse through the component "ABC" go to group "MNO" and then to the parameter "TUV" and under this I have to change the item value to 10.
I have tried using xml.etree.cElementTree but to no use. And lxml dosent support on the server as its running a very old version of python. And I have no permissions to upgrade the version
I have been using the following code to parse and edit a relatively small xml:
def fnXMLModification(ArgStr):
argList = ArgStr.split()
strXMLPath = argList[0]
if not os.path.exists(strXMLPath):
fnlogs("XML File: " + strXMLPath + " does not exist.\n")
return False
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
f=open(strXMLPath, 'rt')
tree = ET.parse(f)
ValueSetFlag = False
AttrSetFlag = False
for strXPath in argList[1:]:
strXPathList = strXPath.split("[")
sxPath = strXPathList[0]
if len(strXPathList)==3:
# both present
AttrSetFlag = True
ValueSetFlag = True
valToBeSet = strXPathList[1].strip("]")
sAttr = strXPathList[2].strip("]")
attrList = sAttr.split(",")
elif len(strXPathList) == 2:
#anyone present
if "=" in strXPathList[1]:
AttrSetFlag = True
sAttr = strXPathList[1].strip("]")
attrList = sAttr.split(",")
else:
ValueSetFlag = True
valToBeSet = strXPathList[1].strip("]")
node = tree.find(sxPath)
if AttrSetFlag:
for att in attrList:
slist = att.split("=")
node.set(slist[0].strip(),slist[1].strip())
if ValueSetFlag:
node.text = valToBeSet
tree.write(strXMLPath)
fnlogs("XML File: " + strXMLPath + " has been modified successfully.\n")
return True
Using this function I am not able to traverse the current xml as it has lot of children attributes or sub groups.
import statement
import xml.etree.cElementTree as ET
Parse content by fromstring method.
root = ET.fromstring(data)
Iterate according our requirement and get target Item tag and change value of Value attribute
for component_tag in root.iter("Component"):
if "Name" in component_tag.attrib and component_tag.attrib['Name']=='ABC':
for group_tag in component_tag.iter("Group"):
if "Name" in group_tag.attrib and group_tag.attrib['Name']=='MNO':
#for value_tag in group_tag.iter("Value"):
for item_tag in group_tag.findall("Parameter[#Name='TUV']/Value/Item"):
item_tag.attrib["Value"] = "10"
We can use Xpath to get target Item tag
for item_tag in root.findall("Parameters/Component[#Name='ABC']/Group[#Name='MNO']/Parameter[#Name='TUV']/Value/Item"):
item_tag.attrib["Value"] = "10"
Use tostring method to get content.
data = ET.tostring(root)

Modify XML file using ElementTree

I am trying to do the folowing with Python:
get "price" value and change it
find "price_qty" and insert new line with new tier and different price based on the "price".
so far I could only find the price and change it and insert line in about correct place but I can't find a way how to get there "item" and "qty" and "price" attributes, nothing has worked so far...
this is my original xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<body start="20.04.2014 10:02:60">
<pricelist>
<item>
<name>LEO - red pen</name>
<price>31,4</price>
<price_snc>0</price_snc>
<price_ao>0</price_ao>
<price_qty>
<item qty="150" price="28.20" />
<item qty="750" price="26.80" />
<item qty="1500" price="25.60" />
</price_qty>
<stock>50</stock>
</item>
</pricelist>
the new xml should look this way:
<pricelist>
<item>
<name>LEO - red pen</name>
<price>31,4</price>
<price_snc>0</price_snc>
<price_ao>0</price_ao>
<price_qty>
<item qty="10" price="31.20" /> **-this is the new line**
<item qty="150" price="28.20" />
<item qty="750" price="26.80" />
<item qty="1500" price="25.60" />
</price_qty>
<stock>50</stock>
</item>
</pricelist>
my code so far:
import xml.etree.cElementTree as ET
from xml.etree.ElementTree import Element, SubElement
tree = ET.ElementTree(file='pricelist.xml')
root = tree.getroot()
pos=0
# price - raise the main price and insert new tier
for elem in tree.iterfind('pricelist/item/price'):
price = elem.text
newprice = (float(price.replace(",", ".")))*1.2
newtier = "NEW TIER"
SubElement(root[0][pos][5], newtier)
pos+=1
tree.write('pricelist.xml', "UTF-8")
result:
...
<price_qty>
<item price="28.20" qty="150" />
<item price="26.80" qty="750" />
<item price="25.60" qty="1500" />
<NEW TIER /></price_qty>
thank you for any help.
Don't use fixed indexing. You already have the item element, so why don't use it?
tree = ET.ElementTree(file='pricelist.xml')
root = tree.getroot()
for elem in tree.iterfind('pricelist/item'):
price = elem.findtext('price')
newprice = float(price.replace(",", ".")) * 1.2
newtier = ET.Element("item", qty="10", price="%.2f" % newprice)
elem.find('price_qty').insert(0, newtier)
tree.write('pricelist.xml', "UTF-8")

Categories

Resources