Does tidylib damage my HTML file? - python

I am using python 3.5 and in some cases, when I call tidylib.tidy_document
on an HTML file, the '/' character at the end of the <link ../> tag in the
header is getting removed. Tidylib does not give any errors or warnings when
it removes this character.
The HTML file I am using is part of an Epub generated with writer2epub. The
error occurs in almost all files in this Epub. The only exceptions are very
short ones (e.g. titlepage of the document). The error is the same in all
affected files.
I suspected a problem with the use of carriage returns (0x0d) instead of
linefeeds (0x0a), but changing them doesn't make a difference. I also see that the file contains various other non-ASCII characters, so maybe they're to blame. Googling for unicode problems with tidylib didn't turn up anything that seems to relate to this problem.
I have uploaded a test file that reproduces the problem with the following code:
import re
from tidylib import tidy_document
def printLink(html):
""" Print the <link> tag from the HTML header """
for line in html.split('\n'):
match = re.search('<link[^>]+>', line)
if match is not None:
print(match.group(0))
if __name__ == '__main__':
fname = 'test04.xhtml'
print(fname)
with open(fname, 'r') as fh:
html = fh.read()
print('checkpoint 01')
printLink(html)
newHtml, errors = tidy_document(html)
print('checkpoint 02')
printLink(newHtml)
If the problem is reproduced, the output will be:
<link rel="stylesheet" href="../styles/style001.css" type="text/css" />
at checkpoint 01 and
<link rel="stylesheet" href="../styles/style001.css" type="text/css">
at checkpoint 02.
What is causing tidylib to remove this one '/' character?

Related

How to load a yaml file from url to process in python?

I have a yaml file stored at URL location. How do I load it into python for processing?
This is the code I use to read then simply print it out to verify. But I do not see the yaml file format, looks like html to me.
code:
import yaml
import urllib
from urllib import request
x = urllib.request.urlopen("https://git.myplace.net/projects/groups%2users.yaml")
User_Object = yaml.load(x)
print(User_Object)
...
The output looks like:
anch:create-branch-action":{"serverCondition":false}});}(_PageDataPlugin));</script><meta name="application-name" content="Bitbucket"><link rel="shortcut icon" type="image/x-icon" href="/s/-1051105741/5ab4b55/261/1.0/_/download/resources/com.atlassian.bitbucket.server.bitbucket-webpack-INTERNAL:favicon/favicon.ico" /><link rel="search" href="https://git.cnvrmedia.net/plugins/servlet/opensearch-descriptor" type="application/opensearchdescription+xml" title="Bitbucket code search"/></head><body class="aui-page-sidebar bitbucket-theme"><ul id="assistive-skip-links" class="assistive"><li>Skip to sidebar navigation</li><li>Skip to content</li></ul><div id="page"><!-- start
The file name is "groups+users.yaml". What is the best way to read in yaml format for python to parse/process?

Mismatched Tag Error While Parsing XML?

I'm writing this script that downloads an HTML document from http://example.com/ and attempts to parse it as an XML by using:
with urllib.request.urlopen("http://example.com/") as f:
tree = xml.etree.ElementTree.parse(f)
However, I keep getting a ParseError: mismatched tag error, supposedly at line 1, column 2781, so I donwloaded the file manually (Ctrl+S on my browser) and checked it, but such position indicates a place in the middle of a string, and not even near the EOF, but there were a few lines before the actual 2781nth character so that might've messed up my calculation of the exact position. However, I tried to download and actually write the response to a file to parse it later by:
response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)
And I'm still getting the same mismatched tag error at the same column, but this time I opened the downloaded html and the only stuff near column 2781 is this:
;</script></head><body class
And the exact 2781nth column marks the first "h" in </head>, so what could be wrong here? am I missing something?
Edit:
I've been looking more into it and tried to parse the XML using another parser, this time minidom, but I'm still getting the exact same error at the exact same line, what could be the problem here? this also happens even though I've downloaded the file by several different ways (urllib, curl, wget, even Ctrl+Save on the browser) and the result is the same.
Edit 2:
This is what I've tried so far:
This is an example xml I just got from the API doc, and saved it to text.html:
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>Moved to example.org
or example.com.</p>
</body>
</html>
And I tried:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.parse(f)
And it works, then:
with urllib.request.urlopen("text.html") as f:
tree = xml.etree.ElementTree.fromstring(f.read())
And it also works, but:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.parse(f)
Doesn't, also tried:
with urllib.request.urlopen("http://example.com/") as f:
xml.etree.ElementTree.fromstring(f.read())
And it doesn't work too, what could be the problem? as far as I can tell the document doesn't have mismatching tags, but perhaps it's too large? it's only 95.2 KB.
You can use bs4 to parse this page. Like this:
import bs4
import urllib
url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a
OUTPUT:
a
You can download bs4 from this link.
Based on the urllib and ElementTree documentation, this code snippet seemed to work without error for your sample URL.
import urllib.request
import xml.etree.ElementTree as ET
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
html = response.read()
tree = ET.parse(html)
If you don't want to read the response into a variable before parsing it with ElementTree, this also works:
with urllib.request.urlopen('http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl') as response:
tree = ET.parse(response.read())

Python search and replace not working

I have the following simple HTML file.
<html data-noop=="http://www.w3.org/1999/xhtml">
<head>
<title>Hello World</title>
</head>
<body>
SUMMARY1
hello world
</body>
</html>
I want to read this into a python script and replace SUMMARY1 with the text "hi there" (say). I do the following in python
with open('test.html','r') as htmltemplatefile:
htmltemplate = htmltemplatefile.read().replace('\n','')
htmltemplate.replace('SUMMARY1','hi there')
print htmltemplate
The above code reads in the file into the variable htmltemplate.
Next I call the replace() function of the string object to replace the pattern SUMMARY1 with "hi there". But the output does not seem to search and replace SUMMARY1 with "hi there". Here is what I'm getting.
<html data-noop=="http://www.w3.org/1999/xhtml"><head><title>Hello World</title></head><body>SUMMARY1hello world</body></html>
Could someone point out what I'm doing wrong here?
open() does not return a str, it returns a file object. Additionally, you are only opening it for reading ('r'), not for writing.
What you want to do is something like:
new_lines = []
with open('test.html', 'r') as f:
new_lines = f.readlines()
with open('test.html', 'w') as f:
f.writelines([x.replace('a', 'b') for x in new_lines])
The fileinput library makes this a lot easier.

Loading mako templates from files

I'm new to python and currently trying to use mako templating.
I want to be able to take an html file and add a template to it from another html file.
Let's say I got this index.html file:
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Hello, ${name}!</p>
</body>
</html>
and this name.html file:
world
(yes, it just has the word world inside).
I want the ${name} in index.html to be replaced with the content of the name.html file.
I've been able to do this without the name.html file, by stating in the render method what name is, using the following code:
#route(':filename')
def static_file(filename):
mylookup = TemplateLookup(directories=['html'])
mytemplate = mylookup.get_template('hello/index.html')
return mytemplate.render(name='world')
This is obviously not useful for larger pieces of text. Now all I want is to simply load the text from name.html, but haven't yet found a way to do this. What should I try?
return mytemplate.render(name=open(<path-to-file>).read())
Thanks for the replies.
The idea is to use the mako framework since it does things like cache and check if the file has been updated...
this code seems to eventually work:
#route(':filename')
def static_file(filename):
mylookup = TemplateLookup(directories=['.'])
mytemplate = mylookup.get_template('index.html')
temp = mylookup.get_template('name.html').render()
return mytemplate.render(name=temp)
Thanks again.
Did I understand you correctly that all you want is read the content from a file? If you want to read the complete content use something like this (Python >= 2.5):
from __future__ import with_statement
with open(my_file_name, 'r') as fp:
content = fp.read()
Note: The from __future__ line has to be the first line in your .py file (or right after the content encoding specification that can be placed in the first line)
Or the old approach:
fp = open(my_file_name, 'r')
try:
content = fp.read()
finally:
fp.close()
If your file contains non-ascii characters, you should also take a look at the codecs page :-)
Then, based on your example, the last section could look like this:
from __future__ import with_statement
#route(':filename')
def static_file(filename):
mylookup = TemplateLookup(directories=['html'])
mytemplate = mylookup.get_template('hello/index.html')
content = ''
with open('name.html', 'r') as fp:
content = fp.read()
return mytemplate.render(name=content)
You can find more details about the file object in the official documentation :-)
There is also a shortcut version:
content = open('name.html').read()
But I personally prefer the long version with the explicit closing :-)

Edit and create HTML file using Python

I am currently working on an assignment for creating an HTML file using python. I understand how to read an HTML file into python and then edit and save it.
table_file = open('abhi.html', 'w')
table_file.write('<!DOCTYPE html><html><body>')
table_file.close()
The problem with the above piece is it's just replacing the whole HTML file and putting the string inside write(). How can I edit the file and the same time keep it's content intact. I mean, writing something like this, but inside the body tags
<link rel="icon" type="image/png" href="img/tor.png">
I need the link to automatically go in between the opening and closing body tags.
You probably want to read up on BeautifulSoup:
import bs4
# load the file
with open("existing_file.html") as inf:
txt = inf.read()
soup = bs4.BeautifulSoup(txt)
# create new link
new_link = soup.new_tag("link", rel="icon", type="image/png", href="img/tor.png")
# insert it into the document
soup.head.append(new_link)
# save the file again
with open("existing_file.html", "w") as outf:
outf.write(str(soup))
Given a file like
<html>
<head>
<title>Test</title>
</head>
<body>
<p>What's up, Doc?</p>
</body>
</html>
this produces
<html>
<head>
<title>Test</title>
<link href="img/tor.png" rel="icon" type="image/png"/></head>
<body>
<p>What's up, Doc?</p>
</body>
</html>
(note: it has munched the whitespace, but gotten the html structure correct).

Categories

Resources