I'm receiving the below XML response from an API call and am looking to iterate through the "Results" and store all of the data points as a pandas dataframe.
I was successfully able to grab my data points of interest by chaining .find() methods shown below, but don't know how to loop through all of the Results block within the body given the structure of the XML response.
I am using Python 3.7+ in Jupyter on Windows.
What I've Tried:
import pandas as pd
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
soup = BeautifulSoup(soap_response.text, "xml")
# print(soup.prettify())
objectid_field = soup.find('Results').find('ObjectID').text
customerkey_field = soup.find('Results').find('CustomerKey').text
name_field = soup.find('Results').find('Name').text
issendable_field = name_field = soup.find('Results').find('IsSendable').text
sendablesubscribe_field = soup.find('Results').find('SendableSubscriberField').text
# for de in soup:
# de_name = soup.find('Results').find('Name').text
# print(de_name)
# test_df = pd.read_xml(soup,
# xpath="//Results",
# namespaces={""})
Sample XML Data Structure:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2003/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema"
xmlns:xsd="http://www.w3.org/XMLSchema"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-201-wss-wssecurity-secext-1.0.xsd"
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-201-wss-security-1.0.xsd">
<env:Header
xmlns:env="http://www.w3.org/2003/05/soap-envelope">
<wsa:Action>RetrieveResponse</wsa:Action>
<wsa:MessageID>urn:uuid:1234</wsa:MessageID>
<wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
<wsa:To>http://schemas.xmlsoap.org/ws/2004/08/dressing/role/anonymous</wsa:To>
<wsse:Security>
<wsu:Timestamp wsu:Id="Timestamp-1234">
<wsu:Created>2021-11-07T13:10:54Z</wsu:Created>
<wsu:Expires>2021-11-07T13:15:54Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</env:Header>
<soap:Body>
<RetrieveResponseMsg
xmlns="http://partnerAPI">
<OverallStatus>OK</OverallStatus>
<RequestID>f9876</RequestID>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object1</ObjectID>
<CustomerKey>Customer1</CustomerKey>
<Name>Test1</Name>
<IsSendable>true</IsSendable>
<SendableSubscriberField>
<Name>_Something1</Name>
</SendableSubscriberField>
</Results>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object2</ObjectID>
<CustomerKey>Customer2</CustomerKey>
<Name>Name2</Name>
<IsSendable>true</IsSendable>
<SendableSubscriberField>
<Name>_Something2</Name>
</SendableSubscriberField>
</Results>
<Results xsi:type="Data">
<PartnerKey xsi:nil="true" />
<ObjectID>Object3</ObjectID>
<CustomerKey>AnotherKey</CustomerKey>
<Name>Something3</Name>
<IsSendable>false</IsSendable>
</Results>
</RetrieveResponseMsg>
</soap:Body>
</soap:Envelope>'
You're super close, you need to find all of the Results tags, then iterate over them, last grabbing the elements you want:
for el in soup.find_all('Results'):
objectid_field = el.find('ObjectID').text
customerkey_field = el.find('CustomerKey').text
name_field = el.find('Name').text
issendable_field = name_field = el.find('IsSendable').text
sendablesubscribe_field = el.find('SendableSubscriberField').text
However, SendableSubscriberField isn't always there, so you might need to check if sendable is True first:
for el in soup.find_all('Results'):
objectid_field = el.find('ObjectID').text
customerkey_field = el.find('CustomerKey').text
name_field = el.find('Name').text
issendable_field = el.find('IsSendable').text
# skip if not sendable
if issendable_field == 'false':
sendablesubscribe_field = None
continue
sendablesubscribe_field = el.find('SendableSubscriberField').find('Name').text
Edit: Constructing the dataframe
To build the dataframe from this, I'd collect everything into a list of dictionaries:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(...)
data = []
for el in soup.find_all('Results'):
record = {}
record['ObjectID'] = el.find('ObjectID').text
record['CustomerKey'] = el.find('CustomerKey').text
record['Name'] = el.find('Name').text
record['IsSendable'] = el.find('IsSendable').text
# skip if not sendable
if record['IsSendable'] == 'false':
record['SendableSubscriberField'] = None
continue
record['SendableSubscriberField'] = el.find('SendableSubscriberField').find('Name').text
data.append(record)
df = pd.DataFrame(data)
Reconsider use of pandas.read_xml by acknowledging the default namespace (http://partnerAPI). Also, since you need a lower-level value, run read_xml twice and join the results. Notice all attribute and element values are returned even if missing.
soap_df = (
pd.read_xml(
soap_response.text,
xpath = ".//rrm:RetrieveResponseMsg/rrm:Results",
namespaces = {"rrm": "http://partnerAPI"}
).join(
pd.read_xml(
soap_response.text,
xpath = ".//rrm:RetrieveResponseMsg/rrm:Results/rrm:SendableSubscriberField",
namespaces = {"rrm": "http://partnerAPI"},
names = ["SendableSubscriberField_Name", ""]
),
)
)
print(soap_df)
# type PartnerKey ObjectID CustomerKey Name IsSendable SendableSubscriberField SendableSubscriberField_Name
# 0 Data NaN Object1 Customer1 Test1 True NaN _Something1
# 1 Data NaN Object2 Customer2 Name2 True NaN _Something2
# 2 Data NaN Object3 AnotherKey Something3 False NaN NaN
Related
I have a directory of XML files, and I need to extract 4 values from each file and store to a dataframe/CSV.
The problem is some of the data I need to extract uses redundant tags (e.g., <PathName>) so I'm not sure of the best way to do this. I could specify the exact line # to extract, because it appears consistent with the files I have seen; but I am not certain that will always be the case, so doing it that way is too brittle.
<?xml version="1.0" encoding="utf-8"?>
<BxfMessage xsi:schemaLocation="http://smpte-ra.org/schemas/2021/2019/BXF BxfSchema.xsd" id="jffsdfs" dateTime="2023-02-02T20:11:38Z" messageType="Info" origin="url" originType="Delivery" userName="ABC Corp User" destination=" System" xmlns="http://sffe-ra.org/schema/1999/2023/BXF" xmlns:xsi="http://www.w9.org/4232/XMLSchema-instance">
<BxfData action="Spotd">
<Content timestamp="2023-02-02T20:11:38Z">
<NonProgramContent>
<Details>
<SpotType>Paid</SpotType>
<SpotType>Standard</SpotType>
<Spotvertiser>
<SpotvertiserName>Spot Plateau</SpotvertiserName>
</Spotvertiser>
<Agency>
<AgencyName>Spot Plateau</AgencyName>
</Agency>
<Product>
<Name></Name>
<BrandName>zzTop</BrandName>
<DirectResponse>
<PhoneNo></PhoneNo>
<PCode></PCode>
<DR_URL></DR_URL>
</DirectResponse>
</Product>
</Details>
<ContentMetSpotata>
<ContentId>
<BHGXId idType="CISC" auth="Agency">AAAA1111999Z</BHGXId>
</ContentId>
<Name>Pill CC Dutch</Name>
<Policy>
<PlatformType>Spotcast</PlatformType>
</Policy>
<Media>
<BaseBand>
<Audio VO="true">
<AnalogAudio primAudio="false" />
<DigitalAudio>
<MPEGLayerIIAudio house="false" audioId="1" dualMono="false" />
</DigitalAudio>
</Audio>
<Video withlate="false" sidebend="false">
<Format>1182v</Format>
<CCs>true</CCs>
</Video>
<AccessServices>
<AudioDescription_DVS>false</AudioDescription_DVS>
</AccessServices>
<QC>Passed QC (AAAA1111103H )</QC>
</BaseBand>
<MediaLocation sourceType="Primary">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb.mp4</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Proxy" qualifer="Low-res">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://app.url.com/DMM/DL/wew52f</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Preview" qualifer="Thumbnail">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://f9-int-5.rainxyz.com/url.com/media/t43fs/423gs-389a-40a4.jpg?inline</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
</Media>
</ContentMetSpotata>
</NonProgramContent>
</Content>
</BxfData>
</BxfMessage>
Is there a more flexible method so that I can get consistent output like:
FileName Brand ID URL
zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb zzTop AAAA1111999Z https://app.url.com/DMM/DL/wew52f
zzTap_zzTab_BAAA1111999Z_30s_Pill_aa-cc zzTab BAAA1111999Z https://app.url.com/DMM/DL/wew52c
zzTap_zzTan_CAAA1111999Z_30s_Pill_aa-dd zzTan CAAA1111999Z https://app.url.com/DMM/DL/wew523
zzTap_zzTon_DAAA1111999Z_30s_Pill_aa-zz zzTon DAAA1111999Z https://app.url.com/DMM/DL/wew52y
To parse one XML file using beautifulsoup you can use this example:
from bs4 import BeautifulSoup
def get_info(xml_file):
with open(xml_file, 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'xml')
file_name = soup.find(lambda tag: tag.name == 'PathName' and '.mp4' in tag.text).text.rsplit('.mp4', maxsplit=1)[0]
url = soup.select_one('[sourceType="Proxy"] PathName').text
brand_name = soup.select_one('BrandName').text
id_ = soup.select_one('BHGXId').text
return file_name, brand_name, id_, url
print(get_info('your_file.xml'))
Prints:
('zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb', 'zzTop', 'AAAA1111999Z', 'https://app.url.com/DMM/DL/wew52f')
How looks your code? Here is my try.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("zzTab.xml")
root = tree.getroot()
ns = "{http://sffe-ra.org/schema/1999/2023/BXF}"
list_of_interest = [f"{ns}PathName", f"{ns}BHGXId", f"{ns}BrandName"]
PathName_dir_list = []
PathName_file_list = []
BHGXId_list = []
BrandName_list = []
for elem in root.iter():
#print(elem.tag, elem.text)
if elem.tag in list_of_interest:
if elem.tag == f"{ns}PathName" and '.mp4' not in elem.text:
#print("Dir:",elem.text)
PathName_dir_list.append(elem.text)
if elem.tag == f"{ns}PathName" and '.mp4' in elem.text:
#print("File:",elem.text)
PathName_file_list.append(elem.text)
if elem.tag == f"{ns}BHGXId":
#print("ID", elem.text)
BHGXId_list.append(elem.text)
if elem.tag == f"{ns}BrandName":
print("Brand", elem.text)
BrandName_list.append(elem.text)
t = zip(PathName_dir_list, PathName_file_list, BHGXId_list, BrandName_list,)
list_of_tuples = list(t)
df = pd.DataFrame(list_of_tuples, columns = ['Path', 'File', 'ID', 'Brand'])
df.to_csv('file_list.csv')
print(df)
If working with BeautifulSoup, I suggest looking into using .select with CSS selectors so that you can do something like
# from bs4 import BeautifulSoup
def getXMLdata(xmlFile:str, defaultVal=None):
with open(xmlFile, 'r') as f: xSoup = BeautifulSoup(f, 'xml')
selRef = {
'FileName': 'MediaLocation[sourceType="Primary"] Location',
'Brand': 'BrandName', 'ID': 'ContentId',
'URL': 'MediaLocation[sourceType="Proxy"] Location'
}
xfDets = {} # {'fromFile': xmlFile}
for k, sel in selRef.items():
t = xSoup.select_one(sel)
xfDets[k] = t.get_text(' ').strip() if t else defaultVal
fn = xfDets.get('FileName')
if isinstance(fn, str) and '.' in fn: # remove extensions like ".mp4"
xfDets['FileName'] = '.'.join(fn.split('.')[:-1])
return xfDets
Since I've seen only one example, I can't know for sure if the selectors in selRef will apply for all your files; but I saved the snippet from your question to a file name x.xml, and getXMLdata('x.xml') returned
{'FileName': 'zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb',
'Brand': 'zzTop',
'ID': 'AAAA1111999Z',
'URL': 'https://app.url.com/DMM/DL/wew52f'}
If you had a list of paths to XML files (let's say filesList), you could tabulate their outputs with pandas like
# import pandas
# filesList = ['x.xml', ...] ## LIST OF XML FILES
xDF = pandas.DataFrame([getXMLdata(x) for x in filesList])
[ If you wanted to save that output to a csv file, you can use .to_csv like xDF.to_csv('xmldata.csv'). ]
I was trying to parse the following xml and fetch specific tags that i'm interested in around my business need. and i guess i'm doing something wrong. Not sure how to parse my required tags?? Wanted to leverage pandas, so that i can further filter for specifics. Apprentice all the support
My XMl coming from URI
<couponfeed>
<TotalMatches>1459</TotalMatches>
<TotalPages>3</TotalPages>
<PageNumberRequested>1</PageNumberRequested>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys' Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=ZZvk49eM&offerid=777210.100474695&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZZvk49NAwbids=777210.100474695&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
My Code
from xml.dom import minidom
import urllib
import pandas as pd
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500"
xmldoc = minidom.parse(urllib.request.urlopen(url))
#itemlist = xmldoc.getElementsByTagName('clickurl')
df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
rows = []
for entry in xmldoc.couponfeed:
s_promotiontype = couponfeed.get("promotiontype","")
s_category = couponfeed.get("category","")
s_offerdescription = couponfeed.get("offerdescription", "")
s_offerstartdate = couponfeed.get("offerstartdate", "")
s_offerenddate = couponfeed.get("offerenddate", "")
s_clickurl = couponfeed.get("clickurl", "")
s_impressionpixel = couponfeed.get("impressionpixel", "")
s_advertisername = couponfeed.get("advertisername","")
s_network = couponfeed.get ("network","")
rows.append({"promotiontype":s_promotiontype, "category": s_category, "offerdescription": s_offerdescription,
"offerstartdate": s_offerstartdate, "offerenddate": s_offerenddate,"clickurl": s_clickurl,"impressionpixel":s_impressionpixel,
"advertisername": s_advertisername,"network": s_network})
out_df = pd.DataFrame(rows, columns=df_cols)
out_df.to_csv(r"C:\\Users\rai\Downloads\\merchants_offers_share.csv", index=False)
Trying easy way but i dont get any results
import lxml.etree as ET
import urllib
response = urllib.request.urlopen('http://couponfeed.synergy.com/coupon?token=xxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500')
xml = response.read()
root = ET.fromstring(xml)
for item in root.findall('.//item'):
title = item.find('category').text
print (title)
another try
from lxml import etree
import pandas as pd
import urllib
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500"
xtree = etree.parse(urllib.request.urlopen(url))
for value in xtree.xpath("/root/couponfeed/categories"):
print(value.text)
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
# html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
html = '''
<couponfeed>
<TotalMatches>1459</TotalMatches>
<TotalPages>3</TotalPages>
<PageNumberRequested>1</PageNumberRequested>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
</couponfeed>
'''
doc = SimplifiedDoc(html)
df_cols = [
"promotiontype", "category", "offerdescription", "offerstartdate",
"offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
]
rows = [df_cols]
links = doc.couponfeed.links # Get all links
for link in links:
row = []
for col in df_cols:
row.append(link.select(col).text) # Get col text
rows.append(row)
utils.save2csv('merchants_offers_share.csv', rows) # Save to csv file
Result:
promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Remove the last empty row
import io
with io.open('merchants_offers_share.csv', "rb+") as f:
f.seek(-1,2)
l = f.read()
if l == b"\n":
f.seek(-2,2)
f.truncate()
First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts & into &.
As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.
from xml.dom import minidom
import pandas as pd
import urllib
limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"
xmldoc = minidom.parse(urllib.request.urlopen(url))
df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
# create an object for each row
rows = [{} for i in range(limit)]
nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]
for row_name in df_cols:
# get results for each row_name
nodes = xmldoc.getElementsByTagName(row_name)
for i, node in enumerate(nodes):
rows[i][row_name] = node.firstChild.nodeValue
out_df = pd.DataFrame(rows, columns=df_cols)
nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]
for row_name in df_cols:
nodes = et.getElementsByTagName(row_name)
for i, node in enumerate(nodes):
rows[i][row_name] = node.firstChild.nodeValue
out_df = pd.DataFrame(rows, columns=df_cols)
This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.
Assuming no issue with parsing your XML from URL (since link is not available on our end), your first lxml can work if you parse on actual nodes. Specifically, there is no <item> node in XML document.
Instead use link. And consider a nested list/dict comprehension to migrate content to a data frame. For lxml you can swap out findall and xpath to return same result.
df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
for item in lnk.findall("*") if item is not None}
for lnk in root.findall('.//link')])
print(df)
# categories promotiontypes offerdescription ... advertiserid advertisername network
# 0 Apparel Percentage off 25% Off Boys Quiksilver Apparel. Shop now at M... ... 3184 cys.com US Network
# 1 Apparel Percentage off 25% Off Boys' Quiksilver Apparel. Shop now at ... ... 3184 cys.com US Network
here I need to read XML data from URL (exchange rate list), output is dictionary...now I can get only first currency...tried with find_all but without success...
Can somebody comment where I need to put for loop to read all values...
import bs4 as bs
import urllib.request
source urllib.request.urlopen('http://www.xxxy.hr/Downloads/PBZteclist.xml').read()
soup = bs.BeautifulSoup(source,'xml')
name = soup.find('Name').text
unit = soup.find('Unit').text
buyratecache = soup.find('BuyRateCache').text
buyrateforeign = soup.find('BuyRateForeign').text
meanrate = soup.find('MeanRate').text
sellrateforeign = soup.find('SellRateForeign').text
sellratecache = soup.find('SellRateCache').text
devize = {'naziv_valute': '{}'.format(name),
'jedinica': '{}'.format(unit),
'kupovni': '{}'.format(buyratecache),
'kupovni_strani': '{}'.format(buyrateforeign),
'srednji': '{}'.format(meanrate),
'prodajni_strani': '{}'.format(sellrateforeign),
'prodajni': '{}'.format(sellratecache)}
print ("devize:",devize)
Example of XML:
<ExchRates>
<ExchRate>
<Bank>Privredna banka Zagreb</Bank>
<CurrencyBase>HRK</CurrencyBase>
<Date>12.01.2019.</Date>
<Currency Code="036">
<Name>AUD</Name>
<Unit>1</Unit>
<BuyRateCache>4,485390</BuyRateCache>
<BuyRateForeign>4,530697</BuyRateForeign>
<MeanRate>4,646869</MeanRate>
<SellRateForeign>4,786275</SellRateForeign>
<SellRateCache>4,834138</SellRateCache>
</Currency>
<Currency Code="124">
<Name>CAD</Name>
<Unit>1</Unit>
<BuyRateCache>4,724225</BuyRateCache>
<BuyRateForeign>4,771944</BuyRateForeign>
<MeanRate>4,869331</MeanRate>
<SellRateForeign>4,991064</SellRateForeign>
<SellRateCache>5,040975</SellRateCache>
</Currency>
<Currency Code="203">
<Name>CZK</Name>
<Unit>1</Unit>
<BuyRateCache>0,280057</BuyRateCache>
<BuyRateForeign>0,284322</BuyRateForeign>
<MeanRate>0,290124</MeanRate>
<SellRateForeign>0,297377</SellRateForeign>
<SellRateCache>0,300351</SellRateCache>
</Currency>
...etc...
</ExchRate>
</ExchRates>
Simply iterate through all Currency nodes (not the soup object) and even use a list comprehension to build a list of dictionaries:
soup = bs.BeautifulSoup(source, 'xml')
# ALL EXCHANGE RATE NODES
curency_nodes = soup.findAll('Currency')
# LIST OF DICTIONAIRES
devize_list = [{'naziv_valute': c.find('Name').text,
'jedinica': c.find('Unit').text,
'kupovni': c.find('BuyRateCache').text,
'kupovni_strani': c.find('BuyRateForeign').text,
'srednji': c.find('MeanRate').text,
'prodajni_strani': c.find('SellRateForeign').text,
'prodajni': c.find('SellRateCache').text
} for c in curency_nodes]
Alternatively, incorporate a dictionary comprehension since you are extracting all elements:
devize_list = [{n.name: n.text} for c in currency_nodes \
for n in c.children if n.name is not None ]
Hello I am making a requests call to return order data from a online store. My issue is that once I have passed my data to a root variable the method iter is not returning the correct results. e.g. Display multiple tags of the same name rather than one and not showing the data within the tag.
I thought this was due to the XML not being correctly formatted so I formatted it by saving it to a file using pretty_print but that hasn't fixed the error.
How do I fix this? - Thanks in advance
Code:
import requests, xml.etree.ElementTree as ET, lxml.etree as etree
url="http://publicapi.ekmpowershop24.com/v1.1/publicapi.asmx"
headers = {'content-type': 'application/soap+xml'}
body = """<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetOrders xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersRequest>
<APIKey>my_api_key</APIKey>
<FromDate>01/07/2018</FromDate>
<ToDate>04/07/2018</ToDate>
</GetOrdersRequest>
</GetOrders>
</soap12:Body>
</soap12:Envelope>"""
#send request to ekm
r = requests.post(url,data=body,headers=headers)
#save output to file
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(r.text)
file.close()
#take the file and format the xml
x = etree.parse("C:/Users/Mark/Desktop/test.xml")
newString = etree.tostring(x, pretty_print=True)
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(newString.decode('utf-8'))
file.close()
#parse the file to get the roots
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
#access elements names in the data
for child in root.iter('*'):
print(child.tag)
#show orders elements attributes
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
out = {}
for child in order:
if child.tag in ('OrderID'):
out[child.tag] = child.text
print(out)
Elements output:
{http://publicapi.ekmpowershop.com/}Orders
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
Orders Output:
{http://publicapi.ekmpowershop.com/}Order {}
{http://publicapi.ekmpowershop.com/}Order {}
XML Structure after formating:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetOrdersResponse xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersResult>
<Status>Success</Status>
<Errors/>
<Date>2018-07-10T13:47:00.1682029+01:00</Date>
<TotalOrders>10</TotalOrders>
<TotalCost>100</TotalCost>
<Orders>
<Order>
<OrderID>100</OrderID>
<OrderNumber>102/040718/67</OrderNumber>
<CustomerID>6910</CustomerID>
<CustomerUserID>204</CustomerUserID>
<FirstName>TestFirst</FirstName>
<LastName>TestLast</LastName>
<CompanyName>Test Company</CompanyName>
<EmailAddress>test#Test.com</EmailAddress>
<OrderStatus>Dispatched</OrderStatus>
<OrderStatusColour>#00CC00</OrderStatusColour>
<TotalCost>85.8</TotalCost>
<OrderDate>10/07/2018 14:30:43</OrderDate>
<OrderDateISO>2018-07-10T14:30:43</OrderDateISO>
<AbandonedOrder>false</AbandonedOrder>
<EkmStatus>SUCCESS</EkmStatus>
</Order>
</Orders>
<Currency>GBP</Currency>
</GetOrdersResult>
</GetOrdersResponse>
</soap:Body>
</soap:Envelope>
You need to consider the namespace when checking for tags.
>>> # Include the namespace part of the tag in the tag values that we check.
>>> tags = ('{http://publicapi.ekmpowershop.com/}OrderID', '{http://publicapi.ekmpowershop.com/}OrderNumber')
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag] = child.text
... print(out)
...
{'{http://publicapi.ekmpowershop.com/}OrderID': '100', '{http://publicapi.ekmpowershop.com/}OrderNumber': '102/040718/67'}
If you don't want the namespace prefixes in the output, you can strip them by only including that part of the tag after the } character.
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag[child.tag.index('}')+1:]] = child.text
... print(out)
...
{'OrderID': '100', 'OrderNumber': '102/040718/67'}
Let's assume I have the following XML:
<time from="2017-07-29T08:00:00" to="2017-07-29T09:00:00">
<!-- Valid from 2017-07-29T08:00:00 to 2017-07-29T09:00:00 -->
<symbol number="4" numberEx="4" name="Cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T08:00:00 -->
<windDirection deg="300.9" code="WNW" name="West-northwest"/>
<windSpeed mps="1.3" name="Light air"/>
<temperature unit="celsius" value="15"/>
<pressure unit="hPa" value="1002.4"/>
</time>
<time from="2017-07-29T09:00:00" to="2017-07-29T10:00:00">
<!-- Valid from 2017-07-29T09:00:00 to 2017-07-29T10:00:00 -->
<symbol number="4" numberEx="4" name="Partly cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T09:00:00 -->
<windDirection deg="293.2" code="WNW" name="West-northwest"/>
<windSpeed mps="0.8" name="Light air"/>
<temperature unit="celsius" value="17"/>
<pressure unit="hPa" value="1002.6"/>
</time>
And I want to collect time from, symbol name and temperature value from it, and then print it out in the following manner: time from: symbol name, temperaure value -- like this: 2017-07-29, 08:00:00: Cloudy, 15°.
(And there are a few name and value attributes in this XML, as you see.)
As of now, my approach was quite straightforward:
#!/usr/bin/env python
# coding: utf-8
import re
from BeautifulSoup import BeautifulSoup
# data is set to the above XML
soup = BeautifulSoup(data)
# collect the tags of interest into lists. can it be done wiser?
time_l = []
symb_l = []
temp_l = []
for i in soup.findAll('time'):
i_time = str(i.get('from'))
time_l.append(i_time)
for i in soup.findAll('symbol'):
i_symb = str(i.get('name'))
symb_l.append(i_symb)
for i in soup.findAll('temperature'):
i_temp = str(i.get('value'))
temp_l.append(i_temp)
# join the forecast lists to a dict
forc_l = []
for i, j in zip(symb_l, temp_l):
forc_l.append([i, j])
rez = dict(zip(time_l, forc_l))
# combine and format the rezult. can this dict be printed simpler?
wew = ''
for key in sorted(rez):
wew += re.sub("T", ", ", key) + str(rez[key])
wew = re.sub("'", "", wew)
wew = re.sub("\[", ": ", wew)
wew = re.sub("\]", "°\n", wew)
# print the rezult
print wew
But I imagine there must be some better, more intelligent approach? Mostly, I'm interested in collecting the attributes from the XML, my way seems rather dumb to me, actually. Also, is there any simpler way to print out a dict {'a': '[b, c]'} nicely?
Would be grateful for any hints or suggestions.
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read() # xml content stored in this variable
soup = BeautifulSoup(content, "lxml")
for values in soup.findAll("time"):
print("{} : {}, {}°".format(values["from"], values.find("symbol")["name"], values.find("temperature")["value"]))
Output:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°
One more, also you can fetch xml data by importing xml.dom.minidom module.
Here is the data you want:
from xml.dom.minidom import parse
doc = parse("path/to/xmlfile.xml") # parse an XML file by name
itemlist = doc.getElementsByTagName('time')
for items in itemlist:
from_tag = items.getAttribute('from')
symbol_list = items.getElementsByTagName('symbol')
symbol_name = [d.getAttribute('name') for d in symbol_list ][0]
temperature_list = items.getElementsByTagName('temperature')
temp_value = [d.getAttribute('value') for d in temperature_list ][0]
print ("{} : {}, {}°". format(from_tag, symbol_name, temp_value))
Output will be as follows:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°
Hope it is useful.
Here you can also use an alternate way using builtin module(i'm using python 3.6.2):
import xml.etree.ElementTree as et # this is built-in module in python3
tree = et.parse("sample.xml")
root = tree.getroot()
for temp in root.iter("time"): # iterate time element in xml
print(temp.attrib["from"], end=": ") # prints attribute of time element
for sym in temp.iter("symbol"): # iterate symbol element within time element
print(sym.attrib["name"], end=", ")
for t in temp.iter("temperature"): # iterate temperature element within time element
print(t.attrib["value"], end="°\n")