scrapy scrape only after checking if a class exits - python

I've created a crawler to crawl a webpage and store items in mysql database. I'm facing a slight problem while scraping a fixed part of the webpage. I want to check if a div with certain class name exits inside a div or not and if it exits, i will store the root div.
<div class="page-col-1-2-right">
<div class="block">
<h2>Produktbewertung und Test</h2>
<div class="area spacing ingredient-rating"></div>
</div>
<div class="block">
<h2>Artikel zu Nasentropfen & Schnupfen</h2>
<div class="cell clickable teaser-large" data-id="62151"></div>
<div>
</div>
In the above code, i want div block if and only if it has
<div class="area spacing ingredient-rating"></div>
inside it. Since some pages of the website i'm crawling might have or might not have the required block my code below didn't work.
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]').extract()[0]

Since you a test to perform before extracting the text you can make usage of a and expression inside the xpath. So it will be response.xpath(test1 and test2).extract()[0]
Applying it to your code it:
response.xpath('//div[contains(#class, "page-col-1-2-right")]/div[contains(#class, "block")][2]' and '//div[contains(#class, "ingredient-rating")]').extract()[0]`

Related

Can't find html sub-elements from inside an element

I'm somewhat inexperienced in scraping websites with lots of sub elements and am trying to understand the best way to loop through elements that have data you want buried in further levels of sub elements.
Here is an example HTML
<div class="s-item__info clearfix">
<h3 class="s-item__title">The Music Tree Activities Book: Part 1 (Music Tree (Summy)) by Clark, Frances, </h3>
</a>
<div class="s-item__subtitle"><span class="SECONDARY_INFO">Pre-Owned</span></div>
<div class="s-item__reviews">
</div>
<div class="s-item__details clearfix">
<div class="s-item__detail s-item__detail--primary"><span class="s-item__price">$3.99</span></div>
<span class="s-item__detail s-item__detail--secondary">
</span>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__purchase-options-with-icon" aria-label="">Buy It Now</span></div>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__shipping s-item__logisticsCost">Free shipping</span></div>
<div class="s-item__detail s-item__detail--primary"><span class="s-item__free-returns s-item__freeReturnsNoFee">Free returns</span></div>
<div class="s-item__detail s-item__detail--primary"></div>
</div>
</div>
There are multiple items so I started by getting all of them in a list and I can find each title by iterating through but am having an issue getting price. Example code
for item in driver.find_elements_by_class_name("s-item__info"):
title = item.find_element_by_xpath('.//h3')
print(title.text)
details = item.find_element_by_xpath('.//span[#class="s-item__price"]')
print(details.text)
This gets the Title of the item, but can't find the price. If I look outside of "s-item_info" element and just use the driver I can get all the prices with the code below, but wondering why it cant find it in the info element, I would think the details would be a subelement and .// would look through those.
driver.find_elements_by_class_name("s-item__price")
Have also tried
find_element_by_xpath('.//div[#class="s-item__detail"]//span[#class="s-item__price"]')
I can grab the data I need but want to understand why I can't get the price when I try to iterate through each item. Thanks
See if this works
for item in driver.find_elements_by_class_name("s-item__info"):
title = item.find_element_by_xpath('.//h3')
print(title.text)
details = item.find_element_by_xpath('.//following::div[contains(#class,'s-item__details')]//span[#class='s-item__price']')
print(details.text)
OK, there are several problems here:
s-item__info is not the only class name on that element, you should use
//div[contains(#class,'s-item__info')] instead
The first element matching this class name is not a valid search result.
The simples approach to make your code work can be:
for item in driver.find_elements_by_xpath("//div[contains(#class,'s-item__info')]"):
title = item.find_elements_by_xpath('.//h3')
if(title):
print(title[0].text)
details = item.find_elements_by_xpath('.//span[#class="s-item__price"]')
if(details):
print(details[0].text)
This will print data if existing, otherwise just print empty strings

Find the elements only after a specific text in html using selenium python

Lets say I have following HTML Code
<div class="12">
<div class="something"></div>
</div>
<div class="12">
<div class="34">
<span>TODAY</span>
</div>
</div>
<div class="12">
<div class="something"></div>
</div>
<div class="12">
<div class="something"></div>
</div>
Now If I use driver.find_elements_by_class_name("something") then I get all the classes present in the HTML code. But I want to get classes only after a specific word ("Today") in HTML. How to exclude classes that appear before the specific word. Next divs and classes could be at any level.
You can use search by XPath as below:
driver.find_elements_by_xpath('//*/text()[.="some specific word"]/following-sibling::div[#class="something"]')
Note that you might need some modifications in case your real HTML differs from provided simplified HTML
Update
replace following-sibling with following if required div nodes are not siblings:
driver.find_elements_by_xpath('//*/text()[.="some specific word"]/following::div[#class="something"]')

Elements Inside Opening Tag

I am writing a spider to download all images on the front page of a subreddit using scrapy. To do so, I have to find the image links to download the images from and use a CSS or XPath selector.
Upon inspection, the links are provided but the HTML looks like this for all of them:
<div class="expando expando-uninitialized" style="display: none" data-cachedhtml=" <div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px"> <div class="media-preview-content"> <img class="preview" src="https://i.redditmedia.com/Q-LKAeFelFa9wAdrnvuwCMyXLrs0ULUKMsJTXSf3y34.jpg?w=861&s=69085fb507bed30f1e4228e83e24b6b2" width="861" height="638"> </div> </div> " data-pin-condition="function() {return this.style.display != 'none';}"><span class="error">loading...</span></div>
From what I can tell, it looks like all of the new elements are being initialized inside the opening tag of the <div> element. Could you explain what exactly is going on here, and how one would go about extracting image information from this?
*Sorry, I'm not quite sure how to properly format the html code, but there really isn't all too much to format, as it is all one big tag anyway.
How to read the mangled attribute, data-cachedhtml
The HTML is a mess. Try the techniques listed in How to parse invalid (bad / not well-formed) XML? to get viable markup before using XPath. It may take three passes:
Cleanup the markup mess.
Get the attribute value of data-cachedhtml.
Use XPath to extract the image links.
XPath part
For the de-mangled data-chachedhtml in this form:
<div class="media-preview" id="media-preview-7lp06p" style="max-width: 861px">
<div class="media-preview-content">
<a href="https://i.redd.it/29moua43so501.jpg" class="may-blank">
<img class="preview" src="https://i.redditmedia.com/elided"
width="861" height="638"/>
</a>
</div>
<span class="error">loading...</span>
</div>
This XPath will retrieve the preview image links:
//a/img/#src
(That is, all src attributes of img element children of a elements.)
or
This XPath will retrieve the click-through image links:
//a[img]/#href
(That is, all href attributes of the a elements that have a img child.)

scrapy scrape html source code

I'm using scrapy to crawl and scrape a website. I need the whole html instead of components. We can easily extract the component using xpath selectors but is there any method to extract the whole html block for a given class. For example in the below html code, i need the exact html source code of the whole div block prod-basic-info. Is there anyway i can do this ?
<div class="block prod-basic-info">
<h2>Product information</h2>
<p class="product-info-label">Category</p>
<p>
<a href="xyz.html"</a>
</p>
</div>
Just point your xpath expression or CSS selector to the element and extract() it:
response.xpath('//div[contains(#class, "prod-basic-info")]').extract()[0]
response.css('div.prod-basic-info').extract()[0]

Using scrapy to get the "Next Page" data

I need to grab a commodity website's review data, but it's user data is paged .The comments per page are 10 strips , and there are about 100 pages. How can I crawl all of them out?
My intention is to use the yield and Request method to crawl the "Next Page" link, and then using the Xpath to extract data. But I can't jump to the next page to extract the data.
Here is the Html code about the "Next Page" link:
<div class="xs-pagebar clearfix">
<div class="Pagecon">
<div class="Pagenum">
<a class="pre-page pre-disable">
<a class="pre-page pre-disable">
<span class="curpage">1</span>
2
3
<span class="elli">...</span>
Next Page
Final Page
</div>
</div>
</div>
What does href="#" exactly mean?
Unfortunately you will not be able to do this with scrapy. href="#" is an anchor link that just links nowhere (to make this look like a link). What really happens is the javascript onclick handler that is executed. You will need to have a method of executing the javascript to do this for your use case. You might want to look into Splinter to do this.

Categories

Resources