str = """<ROOT>
<ITEM>
<REVENUE_YEAR>2554-02</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
<ITEM>
<REVENUE_YEAR>2552-02</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
<ITEM>
<REVENUE_YEAR>2552-03</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
</ROOT>"""
xml = etree.fromstring(str)
xpath_str = ".//ITEM[starts-with(REVENUE_YEAR,'2554')]"
result = xml.find(xpath_str)
print(result)
Hi, the code above raised SyntaxError: invalid predicate, does it mean lxml do not support starts-with? Any other way to locate the REVENUE_YEAR element(2554-02) by xpath with lxml? Thanks!
It supports xpath but you need to use xpath:
str = """<ROOT>
<ITEM>
<REVENUE_YEAR>2554-02</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
<ITEM>
<REVENUE_YEAR>2552-02</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
<ITEM>
<REVENUE_YEAR>2552-03</REVENUE_YEAR>
<REGION>Central</REGION>
</ITEM>
</ROOT>"""
xml = etree.fromstring(str)
xpath_str = ".//ITEM[starts-with(REVENUE_YEAR,'2554')]"
result = xml.xpath(xpath_str)
print(result) # which is a list containing only one element
Related
I have an xml in python, need to obtain the elements of the "Items" tag in an iterable list.
I need get a iterable list from this XML, for example like it:
Item 1: Bicycle, value $250, iva_tax: 50.30
Item 2: Skateboard, value $120, iva_tax: 25.0
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data>
<info>Listado de items</info>
<detalle>
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<tienda id="tiendaProd" version="1.1.0">
<items>
<item>
<nombre>Bicycle</nombre>
<valor>250</valor>
<data>
<tax name="iva" value="50.30"></tax>
</data>
</item>
<item>
<nombre>Skateboard</nombre>
<valor>120</valor>
<data>
<tax name="iva" value="25.0"></tax>
</data>
</item>
<item>
<nombre>Motorcycle</nombre>
<valor>900</valor>
<data>
<tax name="iva" value="120.50"></tax>
</data>
</item>
</items>
</tienda>]]>
</detalle>
</data>
I am working with
import xml.etree.ElementTree as ET
for example
import xml.etree.ElementTree as ET
xml = ET.fromstring(stringBase64)
ite = xml.find('.//detalle').text
tixml = ET.fromstring(ite)
You can use BeautifulSoup4 (BS4) to do this.
from bs4 import BeautifulSoup
#Read XML file
with open("example.xml", "r") as f:
contents = f.readlines()
#Create Soup object
soup = BeautifulSoup(contents, 'xml')
#find all the item tags
item_tags = soup.find_all("item") #returns everything in the <item> tags
#find the nombre and valor tags within each item
results = {}
for item in item_tags:
num = item.find("nombre").text
val = item.find("valor").text
results[str(num)] = val
#Prints dictionary with key value pairs from the xml
print(results)
What would be the best way to construct a DF from the below nested XML data?
Each "properties" element has three "property" elements nested containing the "name" and "value" of our data. I tried doing two for loops, pandas read_xml option, and a few other pieces but haven't quite gotten the nested logic figured out. My current approach below is closer, but does not keep the names and values together.
Using Python 3.7+ in Jupyter on windows
Sample XML Data:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<env:Header
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<wsa:Action>RetrieveResponse</wsa:Action>
<wsa:MessageID>urn:uuid:1234</wsa:MessageID>
<wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
<wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
<wsse:Security>
<wsu:Timestamp wsu:Id="Timestamp-45333">
<wsu:Created>2022-11-07T17:02:44Z</wsu:Created>
<wsu:Expires>2022-11-07T17:07:44Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</env:Header>
<soap:Body>
<RetrieveResponseMsg
xmlns="http://exacttarget.com/wsdl/partnerAPI">
<OverallStatus>MoreDataAvailable</OverallStatus>
<RequestID>asdfds455</RequestID>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>asdfdfd12</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdf</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>fasdsa</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>fasd123</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>asdfd</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0A4f</Value>
</Property>
</Properties>
</Results>
<Results xsi:type="DataExtensionObject">
<PartnerKey xsi:nil="true" />
<ObjectID xsi:nil="true" />
<Type>DataExtensionObject</Type>
<Properties>
<Property>
<Name>FIELD_NAME</Name>
<Value>0034P00</Value>
</Property>
<Property>
<Name>FIELD_NAME_2</Name>
<Value>fasdfs</Value>
</Property>
<Property>
<Name>FIELD_NAME_3</Name>
<Value>a0fasd</Value>
</Property>
</Properties>
</Results>
</RetrieveResponseMsg>
</soap:Body>
</soap:Envelope>
What I've Attempted So Far:
data_output = []
for el in soup_de.find_all('Property'):
dict_ = {el.find('Name').text:el.find('Value').text}
data_output.append(dict_)
print(len(data_output))
# print(data_output)
testing_de_df = pd.DataFrame(data_output)
display(testing_de_df.info())
display(testing_de_df.head(25))
Desired Output:
details = {'FIELD_NAME': ['asdfdfd12', 'fasd123', '0034P00'],
'FIELD_NAME_2': ['asdf', 'asdfd', 'fasdfs'],
'FIELD_NAME_3': ['fasdsa', 'a0A4f', 'a0fasd']}
desired_output = pd.DataFrame(details)
print(desired_output)
Since <Property> sits at a shallow part of the XML, simply call pandas.read_xml narrowing in on that set of nodes while acknowledging the default namespace (http://exacttarget.com/wsdl/partnerAPI):
property_df = pd.read_xml(
"Input.xml",
xpath = ".//rrm:Property",
namespaces = {"rrm": "http://exacttarget.com/wsdl/partnerAPI"}
)
print(property_df)
# Name Value
# 0 FIELD_NAME asdfdfd12
# 1 FIELD_NAME_2 asdf
# 2 FIELD_NAME_3 fasdsa
# 3 FIELD_NAME fasd123
# 4 FIELD_NAME_2 asdfd
# 5 FIELD_NAME_3 a0A4f
# 6 FIELD_NAME 0034P00
# 7 FIELD_NAME_2 fasdfs
# 8 FIELD_NAME_3 a0fasd
To delineate by property, consider creating a property group number with groupby().cumcount() and reshaping data wide with pivot_table:
property_wide_df = (
property_df
.assign(property_no = lambda x: x.groupby("Name").cumcount().add(1))
.pivot_table(index="property_no", columns="Name", values="Value", aggfunc="sum")
)
print(property_wide_df)
# Name FIELD_NAME FIELD_NAME_2 FIELD_NAME_3
# property_no
# 1 asdfdfd12 asdf fasdsa
# 2 fasd123 asdfd a0A4f
# 3 0034P00 fasdfs a0fasd
I thought this was easy, but for some reason, not able to append dict within the list. Overwriting previous data.
for child in data.find_all("item"):
if "Traffic" in child.find("name").string:
self.output["Name"] = child.find("name").string
self.output["LastValue"] = child.find("lastvalue").string
self.results.append(self.output)
print(self.results)
Here is the following output
data = """
<item>
<name>In</name>
<lastvalue>5,000 MByte</lastvalue>
</item>
<item>
<name>Out</name>
<lastvalue>155 MByte</lastvalue>
</item>
<item>
<name>Total</name>
<lastvalue>5,000 MByte</lastvalue>
</item>
I tried running the code, but it always prints the last item.
as it is overwriting the previous data.
output = [{"Name": "In", "LastValue": "5,000 MByte",
"Name": "Out", "LastValue": "5,000 MByte",
"Name": "Total", "LastValue": "5,000 MByte"}]
You can use zip() function to zip values from <name> and <lastvalue>. Then use dict comprehension:
data = """<item>
<name>In</name>
<lastvalue>5,000 MByte</lastvalue>
</item>
<item>
<name>Out</name>
<lastvalue>155 MByte</lastvalue>
</item>
<item>
<name>Total</name>
<lastvalue>5,000 MByte</lastvalue>
</item>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
results = []
results.append( {name.text: lastvalue.text for name, lastvalue in zip(soup.select('name'), soup.select('lastvalue'))} )
print(results)
Prints:
[{'In': '5,000 MByte', 'Out': '155 MByte', 'Total': '5,000 MByte'}]
EDIT: If there are more <lastvalue>:
data = """<item>
<name>In</name>
<lastvalue>5,000 MByte</lastvalue>
</item>
<item>
<name>Out</name>
<lastvalue>155 MByte</lastvalue>
<lastvalue>10,100 MByte</lastvalue>
</item>
<item>
<name>Total</name>
<lastvalue>5,000 MByte</lastvalue>
</item>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
results = []
for name in soup.select('name'):
results.append(
{name.text: [lv.text for lv in name.find_next_siblings('lastvalue')]}
)
print(results)
Prints:
[{'In': ['5,000 MByte']},
{'Out': ['155 MByte', '10,100 MByte']},
{'Total': ['5,000 MByte']}]
I have the following xml:
<Item>
<Platform>itunes</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>Commander In Chief</Name>
<Studio>abc</Studio>
</Info>
<Type>TVSeries</Type>
</Item>
What would be the quickest way to UPPER all the values? For example:
<Item>
<Platform>ITUNES</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>COMMANDER IN CHIEF</Name>
<Studio>ABC</Studio>
</Info>
<Type>TVSERIES</Type>
</Item>
You can find all elements and call upper() on each element's text:
import lxml.etree as ET
data = """<Item>
<Platform>itunes</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>Commander In Chief</Name>
<Studio>abc</Studio>
</Info>
<Type>TVSeries</Type>
</Item>
"""
root = ET.fromstring(data)
for elm in root.xpath("//*"): # //* would find all elements recursively
elm.text = elm.text.upper()
print(ET.tostring(root))
Prints:
<Item>
<Platform>ITUNES</Platform>
<PlatformID>102224185</PlatformID>
<Info>
<LanguageOfMetadata>EN</LanguageOfMetadata>
<Name>COMMANDER IN CHIEF</Name>
<Studio>ABC</Studio>
</Info>
<Type>TVSERIES</Type>
</Item>
This though does not cover cases when you, for example, have a tail of an element - e.g. have <Studio>ABC</Studio>test instead of just <Studio>ABC</Studio>. To support that as well, put the following under the for loop as well:
elm.tail = elm.tail.upper() if elm.tail else None
Here is a way to upper everything, though note that this will include the tags as well:
node = etree.fromstring(etree.tostring(item).upper())
print etree.tostring(node, pretty_print=True)
<ITEM>
<PLATFORM>ITUNES</PLATFORM>
<PLATFORMID>102224185</PLATFORMID>
<INFO>
<LANGUAGEOFMETADATA>EN</LANGUAGEOFMETADATA>
<NAME>COMMANDER IN CHIEF</NAME>
<STUDIO>ABC</STUDIO>
</INFO>
<TYPE>TVSERIES</TYPE>
</ITEM>
Assuming you can parse the XML file you can just rewrite the contents using the .upper() function that is built into python for strings. You can call it like that:
"mystring".upper().
I am trying to do the folowing with Python:
get "price" value and change it
find "price_qty" and insert new line with new tier and different price based on the "price".
so far I could only find the price and change it and insert line in about correct place but I can't find a way how to get there "item" and "qty" and "price" attributes, nothing has worked so far...
this is my original xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<body start="20.04.2014 10:02:60">
<pricelist>
<item>
<name>LEO - red pen</name>
<price>31,4</price>
<price_snc>0</price_snc>
<price_ao>0</price_ao>
<price_qty>
<item qty="150" price="28.20" />
<item qty="750" price="26.80" />
<item qty="1500" price="25.60" />
</price_qty>
<stock>50</stock>
</item>
</pricelist>
the new xml should look this way:
<pricelist>
<item>
<name>LEO - red pen</name>
<price>31,4</price>
<price_snc>0</price_snc>
<price_ao>0</price_ao>
<price_qty>
<item qty="10" price="31.20" /> **-this is the new line**
<item qty="150" price="28.20" />
<item qty="750" price="26.80" />
<item qty="1500" price="25.60" />
</price_qty>
<stock>50</stock>
</item>
</pricelist>
my code so far:
import xml.etree.cElementTree as ET
from xml.etree.ElementTree import Element, SubElement
tree = ET.ElementTree(file='pricelist.xml')
root = tree.getroot()
pos=0
# price - raise the main price and insert new tier
for elem in tree.iterfind('pricelist/item/price'):
price = elem.text
newprice = (float(price.replace(",", ".")))*1.2
newtier = "NEW TIER"
SubElement(root[0][pos][5], newtier)
pos+=1
tree.write('pricelist.xml', "UTF-8")
result:
...
<price_qty>
<item price="28.20" qty="150" />
<item price="26.80" qty="750" />
<item price="25.60" qty="1500" />
<NEW TIER /></price_qty>
thank you for any help.
Don't use fixed indexing. You already have the item element, so why don't use it?
tree = ET.ElementTree(file='pricelist.xml')
root = tree.getroot()
for elem in tree.iterfind('pricelist/item'):
price = elem.findtext('price')
newprice = float(price.replace(",", ".")) * 1.2
newtier = ET.Element("item", qty="10", price="%.2f" % newprice)
elem.find('price_qty').insert(0, newtier)
tree.write('pricelist.xml', "UTF-8")