Extracting similar XML attributes with BeautifulSoup

Extracting similar XML attributes with BeautifulSoup - python

Let's assume I have the following XML:
<time from="2017-07-29T08:00:00" to="2017-07-29T09:00:00">
<!-- Valid from 2017-07-29T08:00:00 to 2017-07-29T09:00:00 -->
<symbol number="4" numberEx="4" name="Cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T08:00:00 -->
<windDirection deg="300.9" code="WNW" name="West-northwest"/>
<windSpeed mps="1.3" name="Light air"/>
<temperature unit="celsius" value="15"/>
<pressure unit="hPa" value="1002.4"/>
</time>
<time from="2017-07-29T09:00:00" to="2017-07-29T10:00:00">
<!-- Valid from 2017-07-29T09:00:00 to 2017-07-29T10:00:00 -->
<symbol number="4" numberEx="4" name="Partly cloudy" var="04"/>
<precipitation value="0"/>
<!-- Valid at 2017-07-29T09:00:00 -->
<windDirection deg="293.2" code="WNW" name="West-northwest"/>
<windSpeed mps="0.8" name="Light air"/>
<temperature unit="celsius" value="17"/>
<pressure unit="hPa" value="1002.6"/>
</time>
And I want to collect time from, symbol name and temperature value from it, and then print it out in the following manner: time from: symbol name, temperaure value -- like this: 2017-07-29, 08:00:00: Cloudy, 15°.
(And there are a few name and value attributes in this XML, as you see.)
As of now, my approach was quite straightforward:
#!/usr/bin/env python
# coding: utf-8
import re
from BeautifulSoup import BeautifulSoup
# data is set to the above XML
soup = BeautifulSoup(data)
# collect the tags of interest into lists. can it be done wiser?
time_l = []
symb_l = []
temp_l = []
for i in soup.findAll('time'):
i_time = str(i.get('from'))
time_l.append(i_time)
for i in soup.findAll('symbol'):
i_symb = str(i.get('name'))
symb_l.append(i_symb)
for i in soup.findAll('temperature'):
i_temp = str(i.get('value'))
temp_l.append(i_temp)
# join the forecast lists to a dict
forc_l = []
for i, j in zip(symb_l, temp_l):
forc_l.append([i, j])
rez = dict(zip(time_l, forc_l))
# combine and format the rezult. can this dict be printed simpler?
wew = ''
for key in sorted(rez):
wew += re.sub("T", ", ", key) + str(rez[key])
wew = re.sub("'", "", wew)
wew = re.sub("\[", ": ", wew)
wew = re.sub("\]", "°\n", wew)
# print the rezult
print wew
But I imagine there must be some better, more intelligent approach? Mostly, I'm interested in collecting the attributes from the XML, my way seems rather dumb to me, actually. Also, is there any simpler way to print out a dict {'a': '[b, c]'} nicely?
Would be grateful for any hints or suggestions.

from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read() # xml content stored in this variable
soup = BeautifulSoup(content, "lxml")
for values in soup.findAll("time"):
print("{} : {}, {}°".format(values["from"], values.find("symbol")["name"], values.find("temperature")["value"]))
Output:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°

One more, also you can fetch xml data by importing xml.dom.minidom module.
Here is the data you want:
from xml.dom.minidom import parse
doc = parse("path/to/xmlfile.xml") # parse an XML file by name
itemlist = doc.getElementsByTagName('time')
for items in itemlist:
from_tag = items.getAttribute('from')
symbol_list = items.getElementsByTagName('symbol')
symbol_name = [d.getAttribute('name') for d in symbol_list ][0]
temperature_list = items.getElementsByTagName('temperature')
temp_value = [d.getAttribute('value') for d in temperature_list ][0]
print ("{} : {}, {}°". format(from_tag, symbol_name, temp_value))
Output will be as follows:
2017-07-29T08:00:00 : Cloudy, 15°
2017-07-29T09:00:00 : Partly cloudy, 17°
Hope it is useful.

Here you can also use an alternate way using builtin module(i'm using python 3.6.2):
import xml.etree.ElementTree as et # this is built-in module in python3
tree = et.parse("sample.xml")
root = tree.getroot()
for temp in root.iter("time"): # iterate time element in xml
print(temp.attrib["from"], end=": ") # prints attribute of time element
for sym in temp.iter("symbol"): # iterate symbol element within time element
print(sym.attrib["name"], end=", ")
for t in temp.iter("temperature"): # iterate temperature element within time element
print(t.attrib["value"], end="°\n")

Related

Creating dataframe from XML file with non-unique tags

I have a directory of XML files, and I need to extract 4 values from each file and store to a dataframe/CSV.
The problem is some of the data I need to extract uses redundant tags (e.g., <PathName>) so I'm not sure of the best way to do this. I could specify the exact line # to extract, because it appears consistent with the files I have seen; but I am not certain that will always be the case, so doing it that way is too brittle.
<?xml version="1.0" encoding="utf-8"?>
<BxfMessage xsi:schemaLocation="http://smpte-ra.org/schemas/2021/2019/BXF BxfSchema.xsd" id="jffsdfs" dateTime="2023-02-02T20:11:38Z" messageType="Info" origin="url" originType="Delivery" userName="ABC Corp User" destination=" System" xmlns="http://sffe-ra.org/schema/1999/2023/BXF" xmlns:xsi="http://www.w9.org/4232/XMLSchema-instance">
<BxfData action="Spotd">
<Content timestamp="2023-02-02T20:11:38Z">
<NonProgramContent>
<Details>
<SpotType>Paid</SpotType>
<SpotType>Standard</SpotType>
<Spotvertiser>
<SpotvertiserName>Spot Plateau</SpotvertiserName>
</Spotvertiser>
<Agency>
<AgencyName>Spot Plateau</AgencyName>
</Agency>
<Product>
<Name></Name>
<BrandName>zzTop</BrandName>
<DirectResponse>
<PhoneNo></PhoneNo>
<PCode></PCode>
<DR_URL></DR_URL>
</DirectResponse>
</Product>
</Details>
<ContentMetSpotata>
<ContentId>
<BHGXId idType="CISC" auth="Agency">AAAA1111999Z</BHGXId>
</ContentId>
<Name>Pill CC Dutch</Name>
<Policy>
<PlatformType>Spotcast</PlatformType>
</Policy>
<Media>
<BaseBand>
<Audio VO="true">
<AnalogAudio primAudio="false" />
<DigitalAudio>
<MPEGLayerIIAudio house="false" audioId="1" dualMono="false" />
</DigitalAudio>
</Audio>
<Video withlate="false" sidebend="false">
<Format>1182v</Format>
<CCs>true</CCs>
</Video>
<AccessServices>
<AudioDescription_DVS>false</AudioDescription_DVS>
</AccessServices>
<QC>Passed QC (AAAA1111103H )</QC>
</BaseBand>
<MediaLocation sourceType="Primary">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb.mp4</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Proxy" qualifer="Low-res">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://app.url.com/DMM/DL/wew52f</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:30;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
<MediaLocation sourceType="Preview" qualifer="Thumbnail">
<Location>
<AssetServer PAA="true" FTA="true">
<PathName>https://f9-int-5.rainxyz.com/url.com/media/t43fs/423gs-389a-40a4.jpg?inline</PathName>
</AssetServer>
</Location>
<SOM>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SOM>
<Duration>
<SmpteDuration>
<SmpteTimeCode>00:00:00;00</SmpteTimeCode>
</SmpteDuration>
</Duration>
</MediaLocation>
</Media>
</ContentMetSpotata>
</NonProgramContent>
</Content>
</BxfData>
</BxfMessage>
Is there a more flexible method so that I can get consistent output like:
FileName Brand ID URL
zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb zzTop AAAA1111999Z https://app.url.com/DMM/DL/wew52f
zzTap_zzTab_BAAA1111999Z_30s_Pill_aa-cc zzTab BAAA1111999Z https://app.url.com/DMM/DL/wew52c
zzTap_zzTan_CAAA1111999Z_30s_Pill_aa-dd zzTan CAAA1111999Z https://app.url.com/DMM/DL/wew523
zzTap_zzTon_DAAA1111999Z_30s_Pill_aa-zz zzTon DAAA1111999Z https://app.url.com/DMM/DL/wew52y

To parse one XML file using beautifulsoup you can use this example:
from bs4 import BeautifulSoup
def get_info(xml_file):
with open(xml_file, 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'xml')
file_name = soup.find(lambda tag: tag.name == 'PathName' and '.mp4' in tag.text).text.rsplit('.mp4', maxsplit=1)[0]
url = soup.select_one('[sourceType="Proxy"] PathName').text
brand_name = soup.select_one('BrandName').text
id_ = soup.select_one('BHGXId').text
return file_name, brand_name, id_, url
print(get_info('your_file.xml'))
Prints:
('zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb', 'zzTop', 'AAAA1111999Z', 'https://app.url.com/DMM/DL/wew52f')

How looks your code? Here is my try.
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse("zzTab.xml")
root = tree.getroot()
ns = "{http://sffe-ra.org/schema/1999/2023/BXF}"
list_of_interest = [f"{ns}PathName", f"{ns}BHGXId", f"{ns}BrandName"]
PathName_dir_list = []
PathName_file_list = []
BHGXId_list = []
BrandName_list = []
for elem in root.iter():
#print(elem.tag, elem.text)
if elem.tag in list_of_interest:
if elem.tag == f"{ns}PathName" and '.mp4' not in elem.text:
#print("Dir:",elem.text)
PathName_dir_list.append(elem.text)
if elem.tag == f"{ns}PathName" and '.mp4' in elem.text:
#print("File:",elem.text)
PathName_file_list.append(elem.text)
if elem.tag == f"{ns}BHGXId":
#print("ID", elem.text)
BHGXId_list.append(elem.text)
if elem.tag == f"{ns}BrandName":
print("Brand", elem.text)
BrandName_list.append(elem.text)
t = zip(PathName_dir_list, PathName_file_list, BHGXId_list, BrandName_list,)
list_of_tuples = list(t)
df = pd.DataFrame(list_of_tuples, columns = ['Path', 'File', 'ID', 'Brand'])
df.to_csv('file_list.csv')
print(df)

If working with BeautifulSoup, I suggest looking into using .select with CSS selectors so that you can do something like
# from bs4 import BeautifulSoup
def getXMLdata(xmlFile:str, defaultVal=None):
with open(xmlFile, 'r') as f: xSoup = BeautifulSoup(f, 'xml')
selRef = {
'FileName': 'MediaLocation[sourceType="Primary"] Location',
'Brand': 'BrandName', 'ID': 'ContentId',
'URL': 'MediaLocation[sourceType="Proxy"] Location'
}
xfDets = {} # {'fromFile': xmlFile}
for k, sel in selRef.items():
t = xSoup.select_one(sel)
xfDets[k] = t.get_text(' ').strip() if t else defaultVal
fn = xfDets.get('FileName')
if isinstance(fn, str) and '.' in fn: # remove extensions like ".mp4"
xfDets['FileName'] = '.'.join(fn.split('.')[:-1])
return xfDets
Since I've seen only one example, I can't know for sure if the selectors in selRef will apply for all your files; but I saved the snippet from your question to a file name x.xml, and getXMLdata('x.xml') returned
{'FileName': 'zzTap_zzTop_AAAA1111999Z_30s_Pill_aa-bb',
'Brand': 'zzTop',
'ID': 'AAAA1111999Z',
'URL': 'https://app.url.com/DMM/DL/wew52f'}
If you had a list of paths to XML files (let's say filesList), you could tabulate their outputs with pandas like
# import pandas
# filesList = ['x.xml', ...] ## LIST OF XML FILES
xDF = pandas.DataFrame([getXMLdata(x) for x in filesList])
[ If you wanted to save that output to a csv file, you can use .to_csv like xDF.to_csv('xmldata.csv'). ]

Retrieving text data from <content:encoded> in XML file

I have an XML file which looks like this:
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
<channel>
<item>
<title>Label: some_title"</title>
<link>some_link</link>
<pubDate>some_date</pubDate>
<dc:creator><![CDATA[University]]></dc:creator>
<guid isPermaLink="false">https://link.link</guid>
<description></description>
<content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some texttext some more text</strong><!--more-->
[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A screenshot by the people</em>[/caption]
<strong>some more text</strong>
<div class="entry-content">
<em>Leave your comments</em>
</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
<excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>
I want to extract the raw text within the <content:encoded> section, excluding the tags and urls. I have tried this with BeautifulSoup, and Scarpy, as well as other lxml methods. Most return an empty list.
Is there a way for me to retrieve this information without having to use regex?
Much appreciated.
UPDATE
I opened the XML file using:
content = []
with open(xml_file, "r") as file:
content = file.readlines()
content = "".join(content)
xml = bs(content, "lxml")
then I tried this with scrapy:
response = HtmlResponse(url=xml_file, encoding='utf-8')
response.selector.register_namespace('content',
'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()
which returns an empty list.
and tried the code in the first answer:
soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)
and get this error: Only the following pseudo-classes are implemented: nth-of-type.
When I opened the file with lxml, I ran this for loop:
data = {}
n = 0
for item in xml.findall('item'):
id = 'claim_id_' + str(n)
keys = {}
title = item.find('title').text
keys['label'] = title.split(': ')[0]
keys['claim'] = title.split(': ')[1]
if item.find('content:encoded'):
keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
data[id] = keys
print(data)
n += 1
It saved the label and claim perfectly well, but nothing for the text. Now that I opened the file using BeautifulSoup, it returns this error: 'NoneType' object is not callable

If you only need text inside <strong> tags, you can use my example. Otherwise, only regex seems suitable here:
from bs4 import BeautifulSoup
xml_doc = """
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
...the XML from the question...
</rss>
"""
soup = BeautifulSoup(xml_doc, "xml")
soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")
text = "\n".join(
s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)
Prints:
some text text some more text
some more text
RESEARCH | ARTICLE

I eventually got the text part using regular expressions (regex).
import re
for item in root.iter('item'):
grandchildren = item.getchildren()
for grandchild in grandchildren:
if 'encoded' in grandchild.tag:
text = grandchild.text
text = re.sub(r'\[.*?\]', "", text) # gets rid of square brackets and their content
text = re.sub(r'\<.*?\>', "", text) # gets rid of <> signs and their content
text = text.replace(" ", "") # gets rid of
text = " ".join(text.split())

How to change tags with lxml in Python?

I want to change all tags names <p> to <para> using lxml in python.
Here's an example of what the xml file looks like.
<concept id="id15CDB0Q0Q4G"><title id="id15CDB0R0VHA">General</title>
<conbody><p>This section</p></conbody>
<concept id="id156F7H00GIE"><title id="id15CDB0R0V1W">
System</title>
<conbody><p> </p>
<p>The
</p>
<p>status.</p>
<p>sensors.</p>
And I've been trying to code it like this but it doesn't find the tags with .findall.
from lxml import etree
doc = etree.parse("73-20.xml")
print("\n")
print(etree.tostring(doc, pretty_print=True, xml_declaration=True, encoding="utf-8"))
print("\n")
raiz = doc.getroot()
print(raiz.tag)
children = raiz.getchildren()
print(children)
print("\n")
libros = doc.findall("p")
print(libros)
print("\n")
for i in range(len(libros)):
if libros[i].find("p").tag == "p" :
libros[i].find("p").tag = "para"
Any thoughts?

lxml findall() function provides ability to search by path:
libros = raiz.findall(".//p")
for el in libros:
el.tag = "para"
Here .//p means that lxml will search nested p elements as well.

Python XML Parse and getElementsByTagName

I was trying to parse the following xml and fetch specific tags that i'm interested in around my business need. and i guess i'm doing something wrong. Not sure how to parse my required tags?? Wanted to leverage pandas, so that i can further filter for specifics. Apprentice all the support
My XMl coming from URI
<couponfeed>
<TotalMatches>1459</TotalMatches>
<TotalPages>3</TotalPages>
<PageNumberRequested>1</PageNumberRequested>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys' Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=ZZvk49eM&offerid=777210.100474695&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZZvk49NAwbids=777210.100474695&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
My Code
from xml.dom import minidom
import urllib
import pandas as pd
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500"
xmldoc = minidom.parse(urllib.request.urlopen(url))
#itemlist = xmldoc.getElementsByTagName('clickurl')
df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
rows = []
for entry in xmldoc.couponfeed:
s_promotiontype = couponfeed.get("promotiontype","")
s_category = couponfeed.get("category","")
s_offerdescription = couponfeed.get("offerdescription", "")
s_offerstartdate = couponfeed.get("offerstartdate", "")
s_offerenddate = couponfeed.get("offerenddate", "")
s_clickurl = couponfeed.get("clickurl", "")
s_impressionpixel = couponfeed.get("impressionpixel", "")
s_advertisername = couponfeed.get("advertisername","")
s_network = couponfeed.get ("network","")
rows.append({"promotiontype":s_promotiontype, "category": s_category, "offerdescription": s_offerdescription,
"offerstartdate": s_offerstartdate, "offerenddate": s_offerenddate,"clickurl": s_clickurl,"impressionpixel":s_impressionpixel,
"advertisername": s_advertisername,"network": s_network})
out_df = pd.DataFrame(rows, columns=df_cols)
out_df.to_csv(r"C:\\Users\rai\Downloads\\merchants_offers_share.csv", index=False)
Trying easy way but i dont get any results
import lxml.etree as ET
import urllib
response = urllib.request.urlopen('http://couponfeed.synergy.com/coupon?token=xxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500')
xml = response.read()
root = ET.fromstring(xml)
for item in root.findall('.//item'):
title = item.find('category').text
print (title)
another try
from lxml import etree
import pandas as pd
import urllib
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500"
xtree = etree.parse(urllib.request.urlopen(url))
for value in xtree.xpath("/root/couponfeed/categories"):
print(value.text)

Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
# html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
html = '''
<couponfeed>
<TotalMatches>1459</TotalMatches>
<TotalPages>3</TotalPages>
<PageNumberRequested>1</PageNumberRequested>
<link type="TEXT">
<categories>
<category id="1">Apparel</category>
</categories>
<promotiontypes>
<promotiontype id="11">Percentage off</promotiontype>
</promotiontypes>
<offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
<offerstartdate>2020-07-24</offerstartdate>
<offerenddate>2020-07-26</offerenddate>
<clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
<impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
<advertiserid>3184</advertiserid>
<advertisername>cys.com</advertisername>
<network id="1">US Network</network>
</link>
</couponfeed>
'''
doc = SimplifiedDoc(html)
df_cols = [
"promotiontype", "category", "offerdescription", "offerstartdate",
"offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
]
rows = [df_cols]
links = doc.couponfeed.links # Get all links
for link in links:
row = []
for col in df_cols:
row.append(link.select(col).text) # Get col text
rows.append(row)
utils.save2csv('merchants_offers_share.csv', rows) # Save to csv file
Result:
promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Remove the last empty row
import io
with io.open('merchants_offers_share.csv', "rb+") as f:
f.seek(-1,2)
l = f.read()
if l == b"\n":
f.seek(-2,2)
f.truncate()

First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts & into &.
As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.
from xml.dom import minidom
import pandas as pd
import urllib
limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"
xmldoc = minidom.parse(urllib.request.urlopen(url))
df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
# create an object for each row
rows = [{} for i in range(limit)]
nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]
for row_name in df_cols:
# get results for each row_name
nodes = xmldoc.getElementsByTagName(row_name)
for i, node in enumerate(nodes):
rows[i][row_name] = node.firstChild.nodeValue
out_df = pd.DataFrame(rows, columns=df_cols)
nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]
for row_name in df_cols:
nodes = et.getElementsByTagName(row_name)
for i, node in enumerate(nodes):
rows[i][row_name] = node.firstChild.nodeValue
out_df = pd.DataFrame(rows, columns=df_cols)
This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.

Assuming no issue with parsing your XML from URL (since link is not available on our end), your first lxml can work if you parse on actual nodes. Specifically, there is no <item> node in XML document.
Instead use link. And consider a nested list/dict comprehension to migrate content to a data frame. For lxml you can swap out findall and xpath to return same result.
df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
for item in lnk.findall("*") if item is not None}
for lnk in root.findall('.//link')])
print(df)
# categories promotiontypes offerdescription ... advertiserid advertisername network
# 0 Apparel Percentage off 25% Off Boys Quiksilver Apparel. Shop now at M... ... 3184 cys.com US Network
# 1 Apparel Percentage off 25% Off Boys' Quiksilver Apparel. Shop now at ... ... 3184 cys.com US Network

Parsing nested xml with lxml and Python

I am having trouble parsing XML when it is in the form of:
<Cars>
<Car>
<Color>Blue</Color>
<Make>Ford</Make>
<Model>Mustant</Model>
</Car>
<Car>
<Color>Red</Color>
<Make>Chevy</Make>
<Model>Camaro</Model>
</Car>
</Cars>
I have figured out how to parse 1st level children like this:
<Car>
<Color>Blue</Color>
<Make>Chevy</Make>
<Model>Camaro</Model>
</Car>
With this kind of code:
from lxml import etree
a = os.path.join(localPath,file)
element = etree.parse(a)
cars = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCars = [{field.tag: field.text for field in cars} for action in cars]
print parsedCars[0]['Make'] #Chevy
How can I parse our multiple "Car" tags that is a child tag of "Cars"?

Try this
from lxml import etree
a = os.path.join(localPath,file)
element = etree.parse(a)
cars = element.xpath('//Root/Foo/Bar/Car')
for car in cars:
colors = car.xpath('./Color')
makes = car.xpath('./Make')
models = car.xpath('./Model')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting similar XML attributes with BeautifulSoup - python

Related

Creating dataframe from XML file with non-unique tags

Retrieving text data from <content:encoded> in XML file

How to change tags with lxml in Python?

Python XML Parse and getElementsByTagName

Parsing nested xml with lxml and Python

Categories

Resources