Add content to iframe with BeautifulSoup - python

Let say I have the following iframe
s=""""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
I want to replace all content with this string 'this is the replacement'
If I use
dom = BeatifulSoup(s, 'html.parser')
f = dom.find('iframe')
f.contents[0].replace_with('this is the replacement')
Then instead of replacing all the content I will replace only the first character, which in this case is a newline. Also this does not work if the iframe is completely empty because f.contents[0] is out of index

Simply set the .string property:
from bs4 import BeautifulSoup
data = """
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
frame = soup.iframe
frame.string = 'this is the replacement'
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
this is the replacement
</iframe>
</body>
</html>

This will work for you to replace the iframe tag content.
s="""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
show= soup.findAll('iframe')[0]
show.replaceWith('<iframe src="http://www.w3schools.com">this is the replacement</iframe>'.encode('utf-8'))
html = HTMLParser()
print html.unescape(str(soup.prettify()))
Output:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">my text</iframe>
</body>
</html>

Related

Switch all "href = (link)" with "onclick = (PythonScript(link)) "

I am working on a webscraper that scrapes a website, does some stuff to the body of the website, and outputs that into a new html file. One of the features would be to take any hyperlinks in the html file and instead run a script where the link would be an input for the script.
I want to go from this..
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
To this....
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
I did a lot of googling and I read about jQuery and ajax but do not know these tools and would prefer to do this in python. Is it possible to do this using File IO in python?
You can do something like this using BeautifulSoup:
PS: You need to install Beautifulsoup: pip install bs4
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>
<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
'''
soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
actual_link = link['href']
link['href'] = '#'
link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)
Output:
<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
mercury poisoning
</a>
</body>
</html>
Bonus:
You can also create a new HTML file like this:
with open('new_html_file.html', 'w') as out:
out.write(str(soup))

get parents element of a tag using python requests-HTML

hi is There any way to get all The parent elements of a Tag using requests-HTML?
for example:
<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>
I want to get all parent of b tag: [html, body, p]
or for the h1 tag get this result: [html, body]
With the excellent lxml :
from lxml import etree
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html> """
tree = etree.HTML(html)
# We search the first <b> element
b_elt = tree.xpath('//b')[0]
print(b_elt.text)
# -> "four"
# Walking around ancestors of this <b> element
ancestors_tags = [elt.tag for elt in b_elt.iterancestors()]
print(ancestors_tags)
# -> [p, body, html]
You can access the lower level lxml Element via the element attribute which has an iterancestors()
Here is how you could do it:
from requests_html import HTML
html = """<!DOCTYPE html>
<html lang="en">
<body id="two">
<h1 class="text-primary">hello there</h1>
<p>one two tree<b>four</b>five</p>
</body>
</html>"""
html = HTML(html=html)
b = html.find('b', first=True)
parents = [a for a in b.element.iterancestors()]

Wrap found tag inside new tag in bs4

How to wrap a tag with new tag in bs4.
for example I have html like this.
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>
I want to convert it to this.
<html>
<body>
<b><p>Demo</p></b>
<b> <p>world</p> </b>
</body>
</html>
Here is Exemplification.
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>"""
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all('p'):
# wrap tag with '<b>'
Document:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>Demo</p>
<p>world</p>
</body>
</html>"""
soup = BeautifulSoup(html, 'html.parser')
for p in soup('p'): # shortcut for soup.find_all('p')
p.wrap(soup.new_tag("b"))
out:
<html>
<body>
<b><p>Demo</p></b>
<b><p>world</p></b>
</body>
</html>

Inserting into a html file using python

I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.

How can I add background color inside html code using beautifulsoup?

With beautifulsoup I get the html code of a site, let say it's this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
How I can add this line body {background-color:#b0c4de;} inside the head tag using beautifulsoup?
Lets say python code is:
#!/usr/bin/python
import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup
site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
You can use:
soup.head.append('body {background-color:#b0c4de;}')
But you should create a <style> tag before.
For instance:
head = soup.head
head.append(soup.new_tag('style', type='text/css'))
head.style.append('body {background-color:#b0c4de;}')

Categories

Resources