I am following a tutorial where the teacher pastes in the html inline with our scrappy shell via: %paste ( the html below)
html_doc = " "
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
" "
but I get this error:
<html>
File "<console>", line 1
<html>
SyntaxError: invalid syntax
I have imported tkinter, and looked up other reasources but cant figure out how to get html inline.
Try doing """:
html_doc = """
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
"""
Related
How can I put variables from dict into an HTML document?
Like in replace {{variable}} with 1 in the HTML document with Python.
Python code:
def convert_html_to_python(variables={}, code=""):
## Stuff with code variable
some_code = """
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
{{var}}
</body>
</html>
"""
convert_html_to_python({"var":"1"}, some_code)
Then the Python script converts HTML document to:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
1
</body>
</html>
I want it to be like Django templates. I do not want to use Django or Flask.
You can use the replace method on the string.
def convert_html_to_python(variables={}, code=""):
for variable in variables:
code = code.replace("{{"+str(variable)+"}}", variables[variable])
return code
some_code = """
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
{{var}}
</body>
</html>
"""
print(convert_html_to_python({"var":"1"}, some_code))
Flask and Django comes with a solution for this.
You can try reading an HTML file like a string and change it.
For example:
HTML code
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Function to add data:
def add_data(data, html_code):
write_index = html_code.find('</body>') - 1
html_code = html_code[:write_index] + '\n<h1>' + data + '</h1>\n' + html_code[write_index + 1:]
print(html_code)
HTML now:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<h1>~~data we insert in~~</h1>
</body>
</html>
I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?
You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))
Why is this not finding anything? I'm looking to extract the id out of this html.
from bs4 import BeautifulSoup
import re
a="""
<html lang="en-US">
<head>
<title>
Coverage
</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="2017-07-12T08:12:00.0000000" name="created"/>
</head>
<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
<div id="div:{1586118a-0184-027e-07fc-99debbfc309f}{35}" style="position:absolute;left:1030px;top:449px;width:624px">
<p id="p:{dd73b86c-408c-4068-a1e7-769ad024cf2e}{40}" style="margin-top:5.5pt;margin-bottom:5.5pt">
{FB} 2 Facebook 465.8 /
<span style="color:green">
12
</span>
<span style="color:green">
5
</span>
<span style="color:green">
10
</span>
<span style="color:red">
-3
</span>
/ updated
</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(a,'html.parser')
ticker='{FB}'
target= soup.find('p', text = re.compile(ticker))
There is more than one p i just omitted the rest. I need the text= part
I've also tried the wildcards (.*) but still can get it to work.
I must get the id by ticker... i don't know anything else and the rest of the page is dynamic
This would get the "id" value for <p> tags which contains the text "{FB}":
ticker='{FB}'
target= soup.find_all('p')
for items in target:
check=items.text
if '{FB}' in check:
print (items.get("id"))
More compact way:
for elem in soup(text=re.compile(ticker)):
print (elem.parent.get("id"))
I have a html-template where i want to add some content. The Template looks like the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
</div>
<div class="info_screen">
</div>
</body>
</html>
I want to search for the <div class="file_explorer"></div> and add some parameters to it. Afterwards it should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
<title>Data Base</title>
<link rel="stylesheet" href="stylesheet.css" />
</head>
<body>
<h1>Data Base</h1>
<div class="file_explorer">
<p class="folder">Folder_1</p>
<p class="folder">Folder_2</p>
</div>
<div class="info_screen">
</div>
</body>
</html>
Therefore I tried to parse the html-template and wanted to search for the file_explorer tag to add the paragraphs. How do I search for them and add the paragraphs afterwards. I tried html.cssselector but it did not work. Pls help me. Thats my code:
from lxml import html
from os import path
class HtmlGenerator:
#staticmethod
def modify_html(html_path, list_folders):
html_path = path.abspath(html_path)
parser = html.HTMLParser(remove_blank_text=True)
if path.isfile(html_path) and html_path.endswith(".html"):
tree = html.parse(html_path, parser)
# search for <div class="file_explorer"> [MISSING]
for folder in list_folders:
# add folder as paragraph to html [MISSING]
tree.write(html_path, pretty_print=True)
Thanks in advance.
You can use XPath to find the target div in your template, and then use E-factory to build the new elements :
from lxml.html import builder as E
....
tree = html.parse(html_path, parser)
root = tree.getroot()
# search for <div class="file_explorer">
div = root.find('.//div[#class="file_explorer"]')
for folder in list_folders:
# add folder as paragraph to html
# I assume `folder` as a string like 'Folder_1', 'Folder_2', ...
d.append(E.P(E.CLASS('folder'), folder))
tree.write(html_path, pretty_print=True)
Let say I have the following iframe
s=""""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
I want to replace all content with this string 'this is the replacement'
If I use
dom = BeatifulSoup(s, 'html.parser')
f = dom.find('iframe')
f.contents[0].replace_with('this is the replacement')
Then instead of replacing all the content I will replace only the first character, which in this case is a newline. Also this does not work if the iframe is completely empty because f.contents[0] is out of index
Simply set the .string property:
from bs4 import BeautifulSoup
data = """
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
frame = soup.iframe
frame.string = 'this is the replacement'
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
this is the replacement
</iframe>
</body>
</html>
This will work for you to replace the iframe tag content.
s="""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
show= soup.findAll('iframe')[0]
show.replaceWith('<iframe src="http://www.w3schools.com">this is the replacement</iframe>'.encode('utf-8'))
html = HTMLParser()
print html.unescape(str(soup.prettify()))
Output:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">my text</iframe>
</body>
</html>