inserting html into an xml query using CDATA - python

I am trying to insert the content of an html file into a xml request .
I am opening the html file this way :
page = open(html).read()
then inserting the content in the xml this way :
"<Description><![CDATA["+page+"]]</Description>"+\
This errors out this way :
XML Parse error. XML Error Text: "; nested exception is:
org.xml.sax.SAXParseException: XML document structures must start and
end within the same entity."
I'm assuming I have to do a bit more than just dumping the content of the html file into a CDATA tag ? or maybe do it in a different way ?

Two potential issues.
First, the correct way to end a CDATA block is with ]]>, not ]]
Second, your HTML data might include CDATA blocks, and nested CDATA blocks are not allowed. You might consider encoding your HTML data, using Base64 for example:
import base64
encPage = base64.b64encode(page)

You forgot the closing > for the CDATA element:
"<Description><![CDATA["+page+"]]></Description>"+\

Related

XML parser in BeautifulSoup only scrapes the first symbol out of two

I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.
As the content is xml, I would like to stick to xml parser instead of lxml.
Expected output:
WDS
PAM
The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I'm assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.
Or, to keep using the xml library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM

How to add extracted html in yattag?

I am trying to create html via yattag. Only problem is, the header file I use from another html file, so I read that file and try to insert that as header. Here is the problem. Though I pass unescaped html string, yattag escapes it. That is it converts '<' to < while adding to html string.
MWE:
from yattag import Doc, indent
import html
doc, tag, text = Doc().tagtext()
h = open(nbheader_template, 'r')
h_content= h.read()
h_content = html.unescape(h_content)
doc.asis('<!DOCTYPE html>')
with tag('html'):
# insert dummy head
with tag('head'):
text(h_content) # just some dummy text to replace later - workaround for now
with tag('body'):
# insert as many divs as no of files
for i in range(counter):
with tag('div', id = 'divID_'+ str(1)):
text('Div Page: ' + str(i))
result = indent(doc.getvalue())
# inject raw head - dirty workaround as yattag not doing it
# result = result.replace('<head>headtext</head>',h_content)
with open('test.html', "w") as file:
file.write(result)
Output:
Context: I am trying to combine multiple jupyter python notebooks, in to a single html, that is why heavy header. The header content (nbheader_template) could be found here
If you want to prevent the escaping you have to use doc.asis instead of text.
The asis methods appends a string to the document without any form of escaping.
See also the documentation.
I'm not fully fluent in 'yattag', but the one thing I see missing is:
with tag('body'):
The code you have quoted (above) is placing <div> and text into your header, where it clearly doesn't belong.

Converting Python's bytes object to a string causes data inside html to disappear

I’m trying to read HTML content and extract only the data (such as the lines in a Wikipedia article). Here’s my code in Python:
import urllib.request
from html.parser import HTMLParser
urlText = []
#Define HTML Parser
class parseText(HTMLParser):
def handle_data(self, data):
print(data)
if data != '\n':
urlText.append(data)
def main():
thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
#Create instance of HTML parser (the above class)
lParser = parseText()
#Feed HTML file into parser. The handle_data method is implicitly called.
with urllib.request.urlopen(thisurl) as url:
htmlAsBytes = url.read()
#print(htmlAsBytes)
htmlAsString = htmlAsBytes.decode(encoding="utf-8")
#print(htmlAsString)
lParser.feed(htmlAsString)
lParser.close()
#for item in urlText:
#print(item)
I do get the HTML content from the webpage and if I print the bytes object returned by the read() method, it looks like I receive all the HTML content of the webpage. However, when I try to parse this content to get rid of the tags and store only the readable data, I’m not getting the result I expect at all.
The problem is that in order to use the feed() method of the parser, one has to convert the bytes object to a string. To do that you use the decode() method, which receives the encoding with which to do the conversion. If I print the decoded string, the content printed doesn’t contain the data itself (the useful readable data I’m trying to extract). Why does that happen and how can I solve this?
Note: I'm using Python 3.
Thanks for the help.
All right, I eventually used beautifulsoup to do the job, as Alden recommended, but I still don't know why the decoding process mysteriously gets rid of the data.

How to read an entire web page into a variable

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!

How to insert base64 file into a specif tag using Python

<book>
<title>sponge bob</title>
<author>Joe Doe</author>
<file>Tbase</file>
</book>
I have 2 files, one is a xml and the other is a base64 file. I would like to know how to insert and replace the string"Tbase" with the content of the base64 file using python.
Are you wanting to put the verbatim contents of the base64 file (still base64 encoded) into the XML file, in place of "Tbase"? If that's the case, you could just do something like:
xml = open("xmlfile.xml").read()
b64file = open("b64file.base64").read()
open("xmlfile.xml", "w").write(xml.replace("Tbase", b64file))
(If you're on Python 2.6 or later, you can do this a little bit cleaner using with statements, but that's another discussion.)
If you want to decode the base64 file first, and place the decoded contents into the XML file, then you'd replace b64file on the last line of the example above with b64file.decode("base64").
Of course, doing simple text replacement, as above, opens you up to the problems you'll have if, say, the title or author contain "Tbase" as well. A better way would be to use an actual XML parsing library, like so:
from xml.etree.ElementTree import fromstring, tostring
xml = fromstring(open("xmlfile.xml").read())
xml.find("file").text = open("b64file.base64").read()
open("xmlfile.xml", "w").write(tostring(xml))
This sets the contents of the <file> tag to be the contents of the file b64file.base64, regardless of what its former contents were and regardless of whether "Tbase" appears elsewhere in the XML document.

Categories

Resources