Switch all "href = (link)" with "onclick = (PythonScript(link)) " - python

I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?

You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))

Related

How to use a Python Variable in HTML code?

I have a String variable(name) that contains the name of the song.
(Python)
from pytube import YouTube
yt = YouTube("https://www.youtube.com/watch?v=6BYIKEH0RCQ")
name = yt.title #Contains the title of the song
Here is my HTML code for website to download the mp3 song:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Download</title>
</head>
<body>
<button>Click here </button>
</body>
</html>
With this code, I'd like to use the exact title of song as the name of the file when its been downloaded from the user.
I want to use the name Variable from Python in place of Song_name in HTML code.
Please suggest me any possible way in order to make this work.
You can try try this:
from pytube import YouTube
yt = YouTube("https://www.youtube.com/watch?v=6BYIKEH0RCQ")
name = yt.title
HTML=f"""
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Download</title>
</head>
<body>
<button>Click here </button>
</body>
</html>
"""
with open("test.html","w") as f:
f.write(HTML)
This will put title of song in download attribute. If you want you may put it anywhere. Just don't forget to use f"" and {variable}.

Python - %paste

I am following a tutorial where the teacher pastes in the html inline with our scrappy shell via: %paste ( the html below)
html_doc = " "
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
" "
but I get this error:
<html>
File "<console>", line 1
<html>
SyntaxError: invalid syntax
I have imported tkinter, and looked up other reasources but cant figure out how to get html inline.
Try doing """:
html_doc = """
<html>
<head>
<title>Title of hte page </title>
</head>
<body>
<h1>H1 Tag</h1>
<h2> H2 Tag with link</h2>
<p> First Paragraph </p>
<p>Second Paragraph </p>
</body>
</html>
"""

Add content to iframe with BeautifulSoup

Let say I have the following iframe
s=""""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
I want to replace all content with this string 'this is the replacement'
If I use
dom = BeatifulSoup(s, 'html.parser')
f = dom.find('iframe')
f.contents[0].replace_with('this is the replacement')
Then instead of replacing all the content I will replace only the first character, which in this case is a newline. Also this does not work if the iframe is completely empty because f.contents[0] is out of index
Simply set the .string property:
from bs4 import BeautifulSoup
data = """
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
frame = soup.iframe
frame.string = 'this is the replacement'
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
this is the replacement
</iframe>
</body>
</html>
This will work for you to replace the iframe tag content.
s="""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
show= soup.findAll('iframe')[0]
show.replaceWith('<iframe src="http://www.w3schools.com">this is the replacement</iframe>'.encode('utf-8'))
html = HTMLParser()
print html.unescape(str(soup.prettify()))
Output:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">my text</iframe>
</body>
</html>

How to avoid printing utf-8 characters in BeautifulSoup with replace_with

I am having a problem and I can find a way to solve it. I am trying to parse an html page and then replace a string, while using Beautiful Soup. Although the process looks correct and I do not get any errors when I open the new html page I get some utf-8 characters inside that I do not want.
Sample of working code:
#!/usr/bin/python
import codecs
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
try:
my_soup = BeautifulSoup(html_sample.decode('utf-8'), 'html.parser') # html5lib or html.parser
forecast = my_soup.find("div", {"class": "date"})
forecast.tag = unicode(forecast).replace('LAST UPDATE', 'TEST')
forecast.replace_with(forecast.tag)
# print(my_soup.prettify())
f = codecs.open('test.html', "w", encoding='utf-8')
f.write(my_soup.prettify().encode('utf-8'))
f.close()
except UnicodeDecodeError as e:
print('Error, encoding/decoding: {}'.format(e))
except IOError as e:
print('Error Replacing: {}'.format(e))
except RuntimeError as e:
print('Error Replacing: {}'.format(e))
And the output with utf-8 characters in the new html page:
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</meta>
</head>
<body>
<div class="date">TEST</div>
</body>
</html>
I think that I have mixed up, the encoding and decoding process. Someone with more knowledge on this area can possible elaborate more. I am a total beginner on coding and encoding.
Thank you for your time and effort in advance.
There is no need to get into encoding here. You can replace the text content of a Beautiful Soup element by setting the element.string as follows:
from bs4 import BeautifulSoup
html_sample = """
<!DOCTYPE html>
<html><head lang="en"><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"></head>
<body>
<div class="date">LAST UPDATE</div>
</body>
</html>
"""
soup = BeautifulSoup(html_sample)
forecast = soup.find("div", {"class": "date"})
forecast.string = 'TEST'
print(soup.prettify())
Output
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
</head>
<body>
<div class="date">
TEST
</div>
</body>
</html>

Inserting into a html file using python

I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.

Categories

Resources