python lxml.html add parameter - python

I have a html-template where i want to add some content. The Template looks like the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
</div>
<div class="info_screen">
</div>
</body>
</html>
I want to search for the <div class="file_explorer"></div> and add some parameters to it. Afterwards it should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
<p class="folder">Folder_1</p>
<p class="folder">Folder_2</p>
</div>
<div class="info_screen">
</div>
</body>
</html>
Therefore I tried to parse the html-template and wanted to search for the file_explorer tag to add the paragraphs. How do I search for them and add the paragraphs afterwards. I tried html.cssselector but it did not work. Pls help me. Thats my code:
from lxml import html
from os import path
class HtmlGenerator:
#staticmethod
def modify_html(html_path, list_folders):
html_path = path.abspath(html_path)
parser = html.HTMLParser(remove_blank_text=True)
if path.isfile(html_path) and html_path.endswith(".html"):
tree = html.parse(html_path, parser)
# search for <div class="file_explorer"> [MISSING]
for folder in list_folders:
# add folder as paragraph to html [MISSING]
tree.write(html_path, pretty_print=True)
Thanks in advance.

You can use XPath to find the target div in your template, and then use E-factory to build the new elements :
from lxml.html import builder as E
....
tree = html.parse(html_path, parser)
root = tree.getroot()
# search for <div class="file_explorer">
div = root.find('.//div[#class="file_explorer"]')
for folder in list_folders:
# add folder as paragraph to html
# I assume `folder` as a string like 'Folder_1', 'Folder_2', ...
d.append(E.P(E.CLASS('folder'), folder))
tree.write(html_path, pretty_print=True)

Related

How to Insert dictionary values into a html template file using python?

I have a html template file as shown below and I want to replace the title and body with dictionary values in my python script.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>#Want to insert dictionary values here in python></title>
<LINK href="styles.css" rel="stylesheet" type="text/css">
</head>
<body>
<img src="forkit.gif" id="octocat" alt="" />
<!-- Feel free to change this text here -->
<p>
#Want to insert dictionary values here in python>
</p>
<p>
#Want to insert dictionary values here in python>
</p>
</body>
</html>
I'm parsing json file and stored values in a dictionary and now want to insert those values in the html file created.
import json
#from lxml import etree
with open('ninjs_basic.json','r') as file:
resp_str = file.read()
#print(resp_str)
resp_dict = json.loads(resp_str)
with open('result.html','w') as output:
output.write('uri: ' + resp_dict['uri']+ '\n')
output.write(resp_dict['headline'] + '\n')
output.write(resp_dict['body_text'])
I tried with following code and had no luck. What would be the right approach here ?
Give you an example of using SimplifiedDoc.
from simplified_scrapy import SimplifiedDoc,req,utils
import json
html ='''
<body>
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>#Want to insert dictionary values here in python></title>
<LINK href="styles.css" rel="stylesheet" type="text/css">
</head>
<body>
<img src="forkit.gif" id="octocat" alt="" />
<!-- Feel free to change this text here -->
<p>
#Want to insert dictionary values here in python>
<placeholder1 />
</p>
<p>
#Want to insert dictionary values here in python>
<placeholder2 />
</p>
</body>
</html>'''
doc = SimplifiedDoc(html)
# with open('ninjs_basic.json','r') as file:
# resp_str = file.read()
# resp_dict = json.loads(resp_str)
with open('result.html','w') as output:
doc.title.setContent("The title you took from the JSON file")
doc.placeholder1.repleaceSelf("Want to insert dictionary values here in python")
doc.placeholder2.repleaceSelf("Want to insert dictionary values here in python")
output.write(doc.html)

Switch all "href = (link)" with "onclick = (PythonScript(link)) "

I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?
You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))

get parents element of a tag using python requests-HTML

hi is There any way to get all The parent elements of a Tag using requests-HTML?
for example:
<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>
I want to get all parent of b tag: [html, body, p]
or for the h1 tag get this result: [html, body]
With the excellent lxml :
from lxml import etree
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html> """
tree = etree.HTML(html)
# We search the first <b> element
b_elt = tree.xpath('//b')[0]
print(b_elt.text)
# -> "four"
# Walking around ancestors of this <b> element
ancestors_tags = [elt.tag for elt in b_elt.iterancestors()]
print(ancestors_tags)
# -> [p, body, html]
You can access the lower level lxml Element via the element attribute which has an iterancestors()
Here is how you could do it:
from requests_html import HTML
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>"""
html = HTML(html=html)
b = html.find('b', first=True)
parents = [a for a in b.element.iterancestors()]

This XML file does not appear to have any style information associated with it. The document tree is shown below.2

When using the following code in django template:
<!DOCTYPE html>
<html lang="en">
<head>
<link href="http://52.11.183.14/static/wiki/bootstrap/css/wiki-bootstrap.css" type="text/css" rel="stylesheet"/>
<link href="http://52.11.183.14/static/wiki/bootstrap/css/simple-sidebar.css" type="text/css" rel="stylesheet"/>
<title> Profile - Technology βιβλιοθήκη </title>
</head>
<body>
<div class="container">
{% for p in profiles %}
{{p}}
{% endfor %}
</div>
</body>
</html>
I receive the following error:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
Why? And what can I do to fix it?
Solved: by change HttpResponse on render_to_response
my_context={'profiles': profiles}
c = RequestContext(request,{'profiles': profiles})
return render_to_response('wiki/profile.html',
my_context,
context_instance=RequestContext(request))
#return HttpResponse(t.render(c), content_type="application/xhtml+xml")
You must replace your content html tag with this.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="rtl" xmlns="http://www.w3.org/1999/xhtml">

Pulling doctype using beautifulsoup [duplicate]

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the following html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title>
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" />
<script src="js/h5utils.js"></script>
</head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?
Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)
def doctype(soup):
items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
return items[0] if items else None
You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:
for child in soup.contents:
if isinstance(child, BS.Declaration):
declaration_type = child.string.split()[0]
if declaration_type.upper() == 'DOCTYPE':
declaration = child
You could just fetch the first item in soup contents:
>>> soup.contents[0]
u'DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"'

Categories

Resources