Elements Inside Opening Tag - python

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.
Upon inspection, the links are provided but the HTML looks like this for all of them:
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?
*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.

How to read the mangled attribute, data-cachedhtml
The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:
Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.
XPath part
For the de-mangled data-chachedhtml in this form:
<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
<div class="media-preview-content">
<a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
<img class="preview" src="https://i.redditmedia.com/elided"
width="861" height="638"/>
</a>
</div>
<span class="error">loading...</span>
</div>
This XPath will retrieve the preview image links:
//a/img/#src
(That is, all src attributes of img element children of a elements.)
or
This XPath will retrieve the click-through image links:
//a[img]/#href
(That is, all href attributes of the a elements that have a img child.)

Related

Find tag <a> and tag <img> when using bs4

I have the following source code:
code
<div class='aaa'>
<div class='aaa-child'>
<a>
<img></img>
</a>
</div>
</div>
code
So the structure is an image inside a hyperlink.
I would like to find if tags "a" and "img" exists inside the above divs. Any ideas? I tried with find_all but I get too many results that don't match my expectations.
Yeah use descendant CSS selector with a class selector:
soup.select('.aaa a,img')

Using Selenium to find an image with a given src link and click

This is the html code of a website I want to scrape
<div>
<div class="activityinstance">
<a class="" onclick="" href="https://www.blablabla.com">
<img src="http://www.blablabla.com/justapicture.jpg" class="iconlarge activityicon" alt="" role="presentation" aria-hidden="true">
<span class="instancename">title<span class="accesshide "> text
</span>
</span>
</a>
</div>
</div>
IMAGELINK = "http://www.blablabla.com/justapicture.jpg"
My aim is to find in particular page all of the hrefs that are associated with IMAGELINK using python.
the picture from this url tend to be shown multiple times and I want to recieve all the links so I could click on them.
I tried to find elements by class name "a" to extract all of the links in the page, and that way if I could find their xPath I could just format "/img" and get attribute "src" from that element.
But the problem is I haven't found a way to extract the xPath with given webdriver element.
NOTE: I don't have access to the Xpath of the element unless I write some function to generate it
Find all elements with tag img and print the src attribute:
imgs = driver.find_elements_by_xpath("//img")
for img in imgs:
print(img.get_attribute("src"))
I think he wanted the parent href with the img src which equals
imgs = driver.find_elements_by_xpath("//img[src='http://www.blablabla.com/justapicture.jpg']/parent::a")
for img in imgs:
print(img.get_attribute("href"))

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag
Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

scrapy scrape only after checking if a class exits

I've created a crawler to crawl a webpage and store items in mysql database. I'm facing a slight problem while scraping a fixed part of the webpage. I want to check if a div with certain class name exits inside a div or not and if it exits, i will store the root div.
<div class="page-col-1-2-right">
<div class="block">
<h2>Produktbewertung und Test</h2>
<div class="area spacing ingredient-rating"></div>
</div>
<div class="block">
<h2>Artikel zu Nasentropfen & Schnupfen</h2>
<div class="cell clickable teaser-large" data-id="62151"></div>
<div>
</div>
In the above code, i want div block if and only if it has
<div class="area spacing ingredient-rating"></div>
inside it. Since some pages of the website i'm crawling might have or might not have the required block my code below didn't work.
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]').extract()[0]
Since you a test to perform before extracting the text you can make usage of a and expression inside the xpath. So it will be response.xpath(test1 and test2).extract()[0]
Applying it to your code it:
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]' and '//div[contains(#class, "ingredient-rating")]').extract()[0]`

scrapy scrape html source code

I'm using scrapy to crawl and scrape a website. I need the whole html instead of components. We can easily extract the component using xpath selectors but is there any method to extract the whole html block for a given class. For example in the below html code, i need the exact html source code of the whole div block prod-basic-info. Is there anyway i can do this ?
<div class="block prod-basic-info">
<h2>Product information</h2>
<p class="product-info-label">Category</p>
<p>
<a href="xyz.html"</a>
</p>
</div>
Just point your xpath expression or CSS selector to the element and extract() it:
response.xpath('//div[contains(#class, "prod-basic-info")]').extract()[0]
response.css('div.prod-basic-info').extract()[0]

Categories

Resources