How to avoid printing utf-8 characters in BeautifulSoup with replace_with - python

I am having a problem and I can find a way to solve it. I am trying to parse an html page and then replace a string, while using Beautiful Soup. Although the process looks correct and I do not get any errors when I open the new html page I get some utf-8 characters inside that I do not want.
Sample of working code:
#!/usr/bin/python
import codecs
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
try:
my_soup = BeautifulSoup(html_sample.decode('utf-8'), 'html.parser') # html5lib or html.parser
forecast = my_soup.find("div", {"class": "date"})
forecast.tag = unicode(forecast).replace('LAST UPDATE', 'TEST')
forecast.replace_with(forecast.tag)
# print(my_soup.prettify())
f = codecs.open('test.html', "w", encoding='utf-8')
f.write(my_soup.prettify().encode('utf-8'))
f.close()
except UnicodeDecodeError as e:
print('Error, encoding/decoding: {}'.format(e))
except IOError as e:
print('Error Replacing: {}'.format(e))
except RuntimeError as e:
print('Error Replacing: {}'.format(e))
And the output with utf-8 characters in the new html page:
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</meta>
</head>
<body>
<div class="date">TEST</div>
</body>
</html>
I think that I have mixed up, the encoding and decoding process. Someone with more knowledge on this area can possible elaborate more. I am a total beginner on coding and encoding.
Thank you for your time and effort in advance.

There is no need to get into encoding here. You can replace the text content of a Beautiful Soup element by setting the element.string as follows:
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
soup = BeautifulSoup(html_sample)
forecast = soup.find("div", {"class": "date"})
forecast.string = 'TEST'
print(soup.prettify())
Output
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</head>
<body>
<div class="date">
TEST
</div>
</body>
</html>

Related

How to use a Python Variable in HTML code?

I have a String variable(name) that contains the name of the song.
(Python)
from pytube import YouTube
yt = YouTube("https://www.youtube.com/watch?v=6BYIKEH0RCQ")
name = yt.title #Contains the title of the song
Here is my HTML code for website to download the mp3 song:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Download</title>
</head>
<body>
<button>Click here </button>
</body>
</html>
With this code, I'd like to use the exact title of song as the name of the file when its been downloaded from the user.
I want to use the name Variable from Python in place of Song_name in HTML code.
Please suggest me any possible way in order to make this work.
You can try try this:
from pytube import YouTube
yt = YouTube("https://www.youtube.com/watch?v=6BYIKEH0RCQ")
name = yt.title
HTML=f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Download</title>
</head>
<body>
<button>Click here </button>
</body>
</html>
"""
with open("test.html","w") as f:
f.write(HTML)
This will put title of song in download attribute. If you want you may put it anywhere. Just don't forget to use f"" and {variable}.

Find duplicate id attributes

Before uploading on my server I want to check if I accidentally defined an id two or more times in one of my html files:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>The HTML5 Herald</title>
<meta name="description" content="The HTML5 Herald">
<meta name="author" content="SitePoint">
<link rel="stylesheet" href="css/styles.css?v=1.0">
</head>
<body>
<div id="test"></div>
<div id="test"></div>
</body>
</html>
The idea is to print an error message if there are duplicates:
"ERROR: The id="test" is not unique."
You can do this by using find_all to gather all elements with an id attribute, and then collections.Counter to collect the ids that contain duplicates
import bs4
import collections
soup = bs4.BeautifulSoup(html)
ids = [a.attrs['id'] for a in soup.find_all(attrs={'id': True})]
ids = collections.Counter(ids)
dups = [key for key, value in ids.items() if value > 1]
for d in dups:
print('ERROR: The id="{}" is not unique.'.format(d))
>>> ERROR: The id="test" is not unique.
You could use a regex to find all ids in the HTML and then search for duplicates.
For example:
import re
html_page = """
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>$The HTML5 Herald</title>
<div id="test1"></div>
<meta name="description" content="The HTML5 Herald">
<meta name="author" content="SitePoint">
<link $rel="stylesheet" href="css/styles.css?v=1.0">
</head>
<body>
<div id="test2"></div>
<div id="test2"></div>
</body>
<div id="test3"></div>
</html>
"""
ids_match = re.findall(r'(?<=\s)id=\"\w+\"',html_page)
print(ids_match) #-> ['id="test1"', 'id="test2"', 'id="test2"', 'id="test3"']
print(len(ids_match)) #-> 4
print(len(set(ids_match))) #->3
# the following returns True if there are dupicates in ids_match
print(len(ids_match) != len(set(ids_match))) #->True

Switch all "href = (link)" with "onclick = (PythonScript(link)) "

I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?
You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))

Inserting into a html file using python

I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.

Append before closing body tag in python

ok guys so I have a template.html file like so:
<h1>Hello wolrd</h1>
<div>This is me</div>
And I want to append that to my index file before the closing body tag. Just like so:
<!doctype html>
<html>
<head>
<meta charset="utf-8"/>
<title></title>
</head>
<body>
<script type="text/ng-template" id="templates/template.html">
<h1>Hello wolrd</h1>
<div>This is me</div>
</script>
</body>
</html>
I've so far gotten to read the file and append to the end of it but I have yet to add the script tags to the file that I am reading and append to the correct spot of my file. This is what I currently have:
#!/usr/bin/env python
import fileinput
to_readfile=open('index.html', "r")
try:
reading_file=to_readfile.read()
writefile=open('index2.html','a')
try:
writefile.write("\n")
writefile.write(reading_file)
finally:
writefile.close()
finally:
to_readfile.close()
Any help would be much appreciated. Thank you!
The simplest approach would be to add a placeholder in the layout template and then when processing the layout search for the placeholder and replace it with the contents of the other template.
<!doctype html>
<html>
<head>
<meta charset="utf-8"/>
<title></title>
</head>
<body>
<script type="text/ng-template" id="templates/template.html">
{{content}}
</script>
</body>
</html>
...
..
.
layout = open('layout.html', "r")
layout_contents = layout.read()
partial=open('partial_file.html','r')
result = layout_contents.replace("{{content}}", partial)
writefile = open("file_to_write.html", "w")
writefile.write("\n")
writefile.write(result)
.
..
....
You can also work on a much more extensive solution such as the ones used by jinja http://jinja.pocoo.org/docs/templates/#template-inheritance.

Categories

Resources