Beautiful Soup: extracting picture url from webpage - python

So currently I'm having some issues trying to extract a picture URL from a web page using beautiful soup. I'm quite inexperienced with beautiful soup and would appreciate any feedback you have for me. Here is a snippet of the HTML I'm trying to extract the picture link from (more specifically, the data-srcset URL in the source media tag):
<div class="container-fluid" itemscope="" itemtype="http://schema.org/Product">
<div class="row">
<div id="js_carousel" class="col-xs-12 col-md-8">
<div id="psp-carousel" class="carousel_outer">
<div id="product-carousel" class="pdp-carousel carousel pdp-initial" style="display:block;">
<!-- Wrapper for slides -->
<div class="carousel-inner" id="carousel-inner" role="listbox">
<img class="product-image-placeholder" itemprop="image" alt="..." src="data:image/svg+xml;charset=utf-8,%3Csvg xmlns%3D'http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg' viewBox%3D'0 0 355 462'%3E %3Crect fill%3D'%23eee' width%3D'100%25' height%3D'100%25'%2F%3E%3C%2Fsvg%3E" width="355" height="462">
<picture class="item active" data-image="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of" role="option" aria-selected="true" tabindex="0">
<source media="(max-width: 767px)" data-srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$" srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$">
Any time I try to use the line
my_imgs = page_soup.findAll('picture',{'class':'item active'})
I get an empty array. I apologize if this is a dumb question, but any help would be appreciated.

Have you tried using the .select() function for a bs4 instance? The documentation says that this is the prefered method for finding css elements in your HTML soup. So in this case use page_soup.select('picture[class="item active"]') instead of .findall()
The .find() and .findAll() are for older versions of Beautiful Soup. And reading the documentation it seems like your code for the older versions should be formatted my_imgs = page_soup.findAll('picture', attrs ={'class':'item active'}) instead of my_imgs = page_soup.findAll('picture',{'class':'item active'}) you forgot to include the attrs part of the code to create a dictionary which beautiful soup then uses incase the data attributes that have names that can't be used as keyword arguments

Related

Beautiful Soup - How can I scrape images that contain a specific src attribute?

I've just started learning webscraping a few days ago and thought it would be fun to try scraping Mangadex as a mini project. Thank you for the advice in advance!
I'm trying to scrape images by extracting the src attribute of an img tag using Beautiful Soup 4 and Python 3.7
The HTML section I'm interested in is:
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
Each image that I'm interested in contains a src attribute that begins with "https://s5.mangadex.org/data/" so I thought maybe I could target images that begin with that specific attribute.
I've tried using select() to find the img element and then using get() to find the src but didn't have any luck with that specific html section.
HTML sections that did work using select() and get() were:
<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">
<img src="/images/misc/miku.jpg" width="100%">
<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">
Try with this :
from bs4 import BeautifulSoup
html = """
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
"""
soup = BeautifulSoup(html)
for n in soup.find_all('img'):
if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
print(n.get('src'))
result:
https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg
attrs will list all the attributes set in that tag. Its a dictionary so to get specific attribute value see below.
# for getting webpages
import requests
r = requests.get(URL_LINK)
base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
src = img.attrs['src']
if not src.startswith(base_url):
src = base_url+src
print(src)
You cannot scrape mangadex with BeautifulSoup DIRECTLY. Mangadex load their images with javascript after the document is ready. What you get with BeautifulSoup is that empty document. That is why it is failing. This website explains how you can scrape web pages that rely on javascript to serve their content:
https://towardsdatascience.com/data-science-skills-web-scraping-javascript-using-python-97a29738353f

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Elements Inside Opening Tag

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.
Upon inspection, the links are provided but the HTML looks like this for all of them:
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?
*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.
How to read the mangled attribute, data-cachedhtml
The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:
Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.
XPath part
For the de-mangled data-chachedhtml in this form:
<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
<div class="media-preview-content">
<a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
<img class="preview" src="https://i.redditmedia.com/elided"
width="861" height="638"/>
</a>
</div>
<span class="error">loading...</span>
</div>
This XPath will retrieve the preview image links:
//a/img/#src
(That is, all src attributes of img element children of a elements.)
or
This XPath will retrieve the click-through image links:
//a[img]/#href
(That is, all href attributes of the a elements that have a img child.)

web parsing using selenium and classes

I am trying to parse several items from a blog but I am unable to to reach the last two items I need.
The html is:
<div class="post">
<div class="postHeader">
<h2 class="postTitle"><span></span>cuba and the cameraman</h2>
<span class="postMonth" title="2017">Nov</span>
<span class="postDay" title="2017">24</span>
<div class="postSubTitle"><span class="postCategories">TV Shows</span></div>
</div>
<div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a> <br />
n/A<br />
<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
</p>
The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.
I managed to parse correctly only all the postTile for the website:
all_titles = []
url = 'http://test.com'
browser.get(url)
titles = browser.find_elements_by_class_name('postHeader')
for title in titles:
link = title.find_element_by_tag_name('a')
all_titles.append(link.text)
But I can't get the the image and imdb links using the same method as above , class name.
COuld you support me on this? Thanks.
You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:
for post in driver.find_elements_by_xpath('//div[#class="post"]'):
title = post.find_element_by_xpath('.//h2[#class="postTitle"]//a').text
img_src = post.find_element_by_xpath('.//div[#class="postContent"]//img').get_attribute('src')
link = post.find_element_by_xpath('.//div[#class="postContent"]//a[last()]').get_attribute('href')
Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.

Extracting HTML data fields with Python

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.
<div class="profile-section" id="a-bit-more-about">
<dl>
<dt>Name:</dt>
<dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
</dl>
<!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
<dl>
<dt>Joined:</dt>
<dd>September 1910</dd>
</dl>
<div class="sep"></div>
<dl>
<dt>Hometown:</dt>
<dd>Quiet Rest Maximum Security Twilight Home</dd>
</dl>
<dl>
<dt>Currently:</dt>
<dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
</dl>
<div class="sep"></div>
You want an HTML parser. I recommend beautiful soup or lxml.
Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')
Or if like, you can use regex for small target.

Categories

Resources