<html>
<body>
<h2>HTML Iframes</h2>
<p>You can use the height and width attributes to specify the size of the iframe:</p>
<iframe src="https://www.google.com/maps/embed/v1/place?key=AIzaSyAPi0wzs7IlNc4nlL3atU7iCd-A9QXfuHs&q=4.5596%2C-76.2801&zoom=18&maptype=satellite" height="200" width="300"></iframe>
</body>
</html>
So have this html file created on my computer, now i need to change what goes after src to a new string, example this -
https://www.google.com/maps/embed/v1/place?key=AIzaSyAPi0wzs7IlNc4nlL3atU7iCd-A9QXfuHs&q=25.5596%2C-7.2801&zoom=18&maptype=satellite
So how can i add that example line into a html file after src=?
from bs4 import BeautifulSoup
htmlstr = '''
<html>
<body>
<h2>HTML Iframes</h2>
<p>You can use the height and width attributes to specify the size of the iframe:</p>
<iframe src="https://www.google.com/maps/embed/v1/place?key=AIzaSyAPi0wzs7IlNc4nlL3atU7iCd-A9QXfuHs&q=4.5596%2C-76.2801&zoom=18&maptype=satellite" height="200" width="300"></iframe>
</body>
</html>
'''
soup = BeautifulSoup(htmlstr)
iframe = soup.find('iframe')
iframe["src"] = "test"
print(soup)
this should be the solution you're looking for. replace the iframe["src"] = "test" with the link you want to provide and save the result back to the html file.
Related
I'm trying to check if the text e.g.: "Recommended" is in the same div as the text "Product".
The structure of the HTML file is:
<html>
<head>
<title>Product Page</title>
</head>
<body>
<div class="div1">
<div class="div2">
<div class="divInside">
Recommended
</div>
<div class="above">
<div class="under"></div>
</div>
<div class="pct"></div>
<div class="prod">
Product
</div>
</div>
</div>
</body>
</html>
That's just an example HTML file to show my problem but as you can see both texts are in the same div with div2 class and they are in their own divs as well. So how can I check if both texts are present in a div2 class div tag?
For every element in your HTML DOM, you can get all children of the elements as lists, then check if they occur in the same list (there are many ways to do that - for instance, turn such a list into a set, then compare their lenghts). Repeat this for every children until the child has no further children.
One option is to try something along these lines (in this case, with CSS selectors):
from bs4 import BeautifulSoup as bs
data = """your html above"""
soup = bs(data,'lxml')
targets = ['Recommended', 'Product']
print(targets == list(soup.select('div.div2')[0].stripped_strings))
#or
print(targets == list(soup.select_one('div.div2').stripped_strings))
Output in either case:
True
What I want to do:
This HTML code:
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
I want to extract the "data-src" or "src" (or every attribute contain the URL to the image) attribute value.
What I Tried:
Posters = soup.find("img")["src"]
print(Posters)
But this obviously returns all the values from every img tag, so every link is not related to the posters.
Output:
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
With posters I mean (check this URL: https://www.themoviedb.org/search?&query=Hitman) the posters of films.
Summary
I want to extract the value inside an attribute, inside the class ".lazyloaded"
I hope is everything clear. Thanks.
Edit:
Explaination, where was the problem?
For everyone reading, Laurent's answer is the solution, the problem was the parsed HTML.
As we can see on my browser the class that contain the attribute that i was trying to scrape was inside the class "poster lazyload lazyloaded":
but if we print the website.content:
<img class="poster lazyload"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 2x"
alt="The Hitman's Bodyguard Collection">
it's very very different.
You can try to filter by class:
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
See the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
edit: more explanation
Say you have the following file demo.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<img class="logo" src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg">
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
</body>
</html>
You can parse the "poster" images like this:
import io
from bs4 import BeautifulSoup
with io.open("demo.html", encoding="utf8") as fd:
soup = BeautifulSoup(fd.read(), features="html.parser")
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
You get:
https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg
Let say I have the following iframe
s=""""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
I want to replace all content with this string 'this is the replacement'
If I use
dom = BeatifulSoup(s, 'html.parser')
f = dom.find('iframe')
f.contents[0].replace_with('this is the replacement')
Then instead of replacing all the content I will replace only the first character, which in this case is a newline. Also this does not work if the iframe is completely empty because f.contents[0] is out of index
Simply set the .string property:
from bs4 import BeautifulSoup
data = """
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
frame = soup.iframe
frame.string = 'this is the replacement'
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
this is the replacement
</iframe>
</body>
</html>
This will work for you to replace the iframe tag content.
s="""
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">
<p>Your browser does not support iframes.</p>
</iframe>
</body>
</html>
"""
from BeautifulSoup import BeautifulSoup
from HTMLParser import HTMLParser
soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
show= soup.findAll('iframe')[0]
show.replaceWith('<iframe src="http://www.w3schools.com">this is the replacement</iframe>'.encode('utf-8'))
html = HTMLParser()
print html.unescape(str(soup.prettify()))
Output:
<!DOCTYPE html>
<html>
<body>
<iframe src="http://www.w3schools.com">my text</iframe>
</body>
</html>
I'm using robobrowser to parse some html content. I has a BeautifulSoup inside. How can I find a comment with specified string inside
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
In fact I need to get TEXT_1 if I know ANY_ID
Thanks
Use the text argument and check the type to be Comment. Then, load the contents with BeautifulSoup again and find the desired element by id:
from bs4 import BeautifulSoup
from bs4 import Comment
data = """
<html>
<body>
<div>
<!-- some commented code here!!!<div><ul><li><div id='ANY_ID'>TEXT_1</div></li>
<li><div>other text</div></li></ul></div>-->
</div>
</body>
</html>
"""
soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment) and "ANY_ID" in text)
soup_comment = BeautifulSoup(comment, "html.parser")
text = soup_comment.find("div", id="ANY_ID").get_text()
print(text)
Prints TEXT_1.
With beautifulsoup I get the html code of a site, let say it's this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
How I can add this line body {background-color:#b0c4de;} inside the head tag using beautifulsoup?
Lets say python code is:
#!/usr/bin/python
import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup
site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
You can use:
soup.head.append('body {background-color:#b0c4de;}')
But you should create a <style> tag before.
For instance:
head = soup.head
head.append(soup.new_tag('style', type='text/css'))
head.style.append('body {background-color:#b0c4de;}')