Pulling doctype using beautifulsoup [duplicate] - python

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?

Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None

You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child

You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'

Related

How to get raw html with absolute links paths when using 'requests-html'

When making a request using the requests library to https://stackoverflow.com
page = requests.get(url='https://stackoverflow.com')
print(page.content)
I get the following:
<!DOCTYPE html>
<html class="html__responsive html__unpinned-leftnav">
<head>
<title>Stack Overflow - Where Developers Learn, Share, & Build Careers</title>
<link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
<link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
<link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
..........
These source code here have the absolute paths, but when running the same URL using requests-html with js rendering
with HTMLSession() as session:
page = session.get('https://stackoverflow.com')
page.html.render()
print(page.content)
I get the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>StackOverflow.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........
The links here are relative paths,
How can I get the source code with absolute paths like requests when using requests-html with js rendering?
This should probably a feature request for the request-html developers. However for now we can achieve this with this hackish solution:
from requests_html import HTMLSession
from lxml import etree
with HTMLSession() as session:
html = session.get('https://stackoverflow.com').html
html.render()
# iterate over all links
for link in html.pq('a'):
if "href" in link.attrib:
# Make links absolute
link.attrib["href"] = html._make_absolute(link.attrib["href"])
# Print html with only absolute links
print(etree.tostring(html.lxml).decode())
We change the html-objects underlying lxml tree, by iterating over all links and changing their location to absolute using the html-object's private _make_absolute function.
The documentation on the module in this link mentions a distinguishment between the absolute and relative links.
Quote:
Grab a list of all links on the page, in absolute form (anchors
excluded):
r.html.absolute_links
Could you try this statement?

get parents element of a tag using python requests-HTML

hi is There any way to get all The parent elements of a Tag using requests-HTML?
for example:
<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>
I want to get all parent of b tag: [html, body, p]
or for the h1 tag get this result: [html, body]
With the excellent lxml :
from lxml import etree
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html> """
tree = etree.HTML(html)
# We search the first <b> element
b_elt = tree.xpath('//b')[0]
print(b_elt.text)
# -> "four"
# Walking around ancestors of this <b> element
ancestors_tags = [elt.tag for elt in b_elt.iterancestors()]
print(ancestors_tags)
# -> [p, body, html]
You can access the lower level lxml Element via the element attribute which has an iterancestors()
Here is how you could do it:
from requests_html import HTML
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>"""
html = HTML(html=html)
b = html.find('b', first=True)
parents = [a for a in b.element.iterancestors()]

python lxml.html add parameter

I have a html-template where i want to add some content. The Template looks like the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
</div>
<div class="info_screen">
</div>
</body>
</html>
I want to search for the <div class="file_explorer"></div> and add some parameters to it. Afterwards it should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
<p class="folder">Folder_1</p>
<p class="folder">Folder_2</p>
</div>
<div class="info_screen">
</div>
</body>
</html>
Therefore I tried to parse the html-template and wanted to search for the file_explorer tag to add the paragraphs. How do I search for them and add the paragraphs afterwards. I tried html.cssselector but it did not work. Pls help me. Thats my code:
from lxml import html
from os import path
class HtmlGenerator:
#staticmethod
def modify_html(html_path, list_folders):
html_path = path.abspath(html_path)
parser = html.HTMLParser(remove_blank_text=True)
if path.isfile(html_path) and html_path.endswith(".html"):
tree = html.parse(html_path, parser)
# search for <div class="file_explorer"> [MISSING]
for folder in list_folders:
# add folder as paragraph to html [MISSING]
tree.write(html_path, pretty_print=True)
Thanks in advance.
You can use XPath to find the target div in your template, and then use E-factory to build the new elements :
from lxml.html import builder as E
....
tree = html.parse(html_path, parser)
root = tree.getroot()
# search for <div class="file_explorer">
div = root.find('.//div[#class="file_explorer"]')
for folder in list_folders:
# add folder as paragraph to html
# I assume `folder` as a string like 'Folder_1', 'Folder_2', ...
d.append(E.P(E.CLASS('folder'), folder))
tree.write(html_path, pretty_print=True)

Unicode characters from SQLite to HTML

I'm building a web page using CGI and Python (yes, I know, horrible combination!). I have some unicode data stored inside my SQLite 3 database, which I load in my Python script. When it's time to combine those unicode characters with HTML, I'm viewing things like this in my web browser:
\xc3\xb3
Instead of:
ó
I think the problem isn't in the HTML code, because I have defined the encoding like this:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
This is how I'm rendering the HTML code (using web.py's templating system):
render = web.template.render("templates")
return render.index(name)
Where name is:
name = cursor.execute("SELECT name FROM names").fetchone()
And the template (index.html):
$def with (name)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es" lang="es">
<head>
<title>untitled</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
</head>
<body>
Hello, $name!
</body>
</html>
How can I achieve this?

Inserting into a html file using python

I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.

Categories

Resources