What I want to do:
This HTML code:
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
I want to extract the "data-src" or "src" (or every attribute contain the URL to the image) attribute value.
What I Tried:
Posters = soup.find("img")["src"]
print(Posters)
But this obviously returns all the values from every img tag, so every link is not related to the posters.
Output:
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
https://www.themoviedb.org/assets/2/v4/logos/v2/blue_short-8e7b30f73a4020692ccca9c88bafe5dcb6f8a62a4c6bc55cd9ba82bb2cd95f6c.SVG
With posters I mean (check this URL: https://www.themoviedb.org/search?&query=Hitman) the posters of films.
Summary
I want to extract the value inside an attribute, inside the class ".lazyloaded"
I hope is everything clear. Thanks.
Edit:
Explaination, where was the problem?
For everyone reading, Laurent's answer is the solution, the problem was the parsed HTML.
As we can see on my browser the class that contain the attribute that i was trying to scrape was inside the class "poster lazyload lazyloaded":
but if we print the website.content:
<img class="poster lazyload"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/lrDpwvha8VX05vIFxeSZTiPJGYl.jpg 2x"
alt="The Hitman's Bodyguard Collection">
it's very very different.
You can try to filter by class:
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
See the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
edit: more explanation
Say you have the following file demo.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<img class="logo" src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg">
<img class="poster lazyload lazyloaded"
data-src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
data-srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
alt="Hitman"
src="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg"
srcset="https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 1x, https://image.tmdb.org/t/p/w188_and_h282_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg 2x"
data-loaded="true">
</body>
</html>
You can parse the "poster" images like this:
import io
from bs4 import BeautifulSoup
with io.open("demo.html", encoding="utf8") as fd:
soup = BeautifulSoup(fd.read(), features="html.parser")
posters = soup.find_all("img", {"class": "lazyloaded"})
for poster in posters:
print(poster["src"])
You get:
https://image.tmdb.org/t/p/w94_and_h141_bestv2/3qlQM9KP1cyvNfPChA9rASASdHr.jpg
Related
import bs4
foo = """<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<p>
This is a paragraph1
</p>
<h2>
This is heading2
</h2>
</body>
</html>"""
def remove_p(text):
obj = bs4.BeautifulSoup(text, features="html.parser")
for tag in obj.find_all("p"):
tag.decompose()
return str(obj)
foo = remove_p(foo)
print(foo)
beautifulsoup4 4.11.0
bs4 0.0.1
bs4 inserts blank lines corresponding to <p>. I expected entries corresponding to <p> tag to be deleted - no blank lines.
bs4 removes the leading spaces for opening tags. However, it doesn't remove leading spaces for closing tags </h2> and text.
I would like the function to return text with <p> entries removed without modifying the formatting. Please suggest.
Actual output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
Expected Output
<!DOCTYPE html>
<html>
<body>
<h1>This is heading1</h1>
<h2>
This is heading2
</h2>
</body>
</html>
EDIT:
Thanks for all the suggestions to use prettify(). I have already tried using prettify() but it completely changes the formatting of the document. Excuse me for not mentioning it to start with.
To add some context, we receive these documents from our upstream, and we are supposed to just delete some nodes without changing the formatting.
This is not exactly what you want, but there is a way to prettify the code: use obj.prettify() instead of str(obj)
You can use the function Prettify that is built into BeautifulSoup
here is an example shown from the documentation of BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
I'm trying to check if the text e.g.: "Recommended" is in the same div as the text "Product".
The structure of the HTML file is:
<html>
<head>
<title>Product Page</title>
</head>
<body>
<div class="div1">
<div class="div2">
<div class="divInside">
Recommended
</div>
<div class="above">
<div class="under"></div>
</div>
<div class="pct"></div>
<div class="prod">
Product
</div>
</div>
</div>
</body>
</html>
That's just an example HTML file to show my problem but as you can see both texts are in the same div with div2 class and they are in their own divs as well. So how can I check if both texts are present in a div2 class div tag?
For every element in your HTML DOM, you can get all children of the elements as lists, then check if they occur in the same list (there are many ways to do that - for instance, turn such a list into a set, then compare their lenghts). Repeat this for every children until the child has no further children.
One option is to try something along these lines (in this case, with CSS selectors):
from bs4 import BeautifulSoup as bs
data = """your html above"""
soup = bs(data,'lxml')
targets = ['Recommended', 'Product']
print(targets == list(soup.select('div.div2')[0].stripped_strings))
#or
print(targets == list(soup.select_one('div.div2').stripped_strings))
Output in either case:
True
I have a html file where I would like to insert a <meta> tag between the <head> & </head> tags using python. If I open the file in append mode how do I get to the relevant position where the <meta> tag is to be inserted?
Use BeautifulSoup. Here's an example there a meta tag is inserted right after the title tag using insert_after():
from bs4 import BeautifulSoup as Soup
html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div>test</div>
</html>
"""
soup = Soup(html)
title = soup.find('title')
meta = soup.new_tag('meta')
meta['content'] = "text/html; charset=UTF-8"
meta['http-equiv'] = "Content-Type"
title.insert_after(meta)
print soup
prints:
<html>
<head>
<title>Test Page</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body>
<div>test</div>
</body>
</html>
You can also find head tag and use insert() with a specified position:
head = soup.find('head')
head.insert(1, meta)
Also see:
Add parent tags with beautiful soup
How to append a tag after a link with BeautifulSoup
You need a markup manipulation library like the one in this link Python and HTML Processing, I would advice against just opening the the file and try to append it.
Hope it helps.
With beautifulsoup I get the html code of a site, let say it's this:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
How I can add this line body {background-color:#b0c4de;} inside the head tag using beautifulsoup?
Lets say python code is:
#!/usr/bin/python
import cgi, cgitb, urllib2, sys
from bs4 import BeautifulSoup
site = "www.example.com"
page = urllib2.urlopen(site)
soup = BeautifulSoup(page)
You can use:
soup.head.append('body {background-color:#b0c4de;}')
But you should create a <style> tag before.
For instance:
head = soup.head
head.append(soup.new_tag('style', type='text/css'))
head.style.append('body {background-color:#b0c4de;}')
Is there a way to use BeautifulSoup to match a tag with only the indicated class attribute, not the indicated class attribute and others? For example, in this simple HTML:
<html>
<head>
<title>
Title here
</title>
</head>
<body>
<div class="one two">
some content here
</div>
<div class="two">
more content here
</div>
</body>
</html>
is it possible to match only the div with class="two", but not match the div with class="one two"? Unless I'm missing something, that section of the documentation doesn't give me any ideas. This is the code I'm using currently:
from bs4 import BeautifulSoup
html = '''
<html>
<head>
<title>
Title here
</title>
</head>
<body>
<div class="one two">
should not be matched
</div>
<div class="two">
this should be matched
</div>
</body>
</html>
'''
soup = BeautifulSoup(html)
div_two = soup.find("div", "two")
print(div_two.contents[0].strip())
I'm trying to get this to print this should be matched instead of should not be matched.
EDIT: In this simple example, I know that the only options for classes are "one two" or "two", but in production code, I'll only know that what I want to match will have class "two"; other tags could have a large number of other classes in addition to "two", which may not be known.
On a related note, it's also helpful to read the documentation for version 4, not version 3 as I previously linked.
Try:
divs = soup.findAll('div', class="two")
for div in divs:
if div['class'] == ['two']:
pass # handle class="two"
else:
pass # handle other cases, including but not limited to "one two"
Hope, below code helps you. Though I didn't try this one.
soup.find("div", { "class" : "two" })