How to use scrapy for Amazon.com links after "Next" Button? - python

I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought".
For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the crawler.py) will be greatly appreciated. Thank you for your time!
Ok. Here is my code. As I said I am new to Python so the code might look quite stupid but it works to scrap the first page (6 items). I work mostly with Fortran or Matlab. I would love to learn Python systematically If I have time though.
# Code of my crawler.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )
def parse_item(self, response):
sel = Selector(response)
stuff = BetaItem()
isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
isbn10 = []
if len(isbn10R) > 0:
isbn10 = [(isbn10R[0].split(' '))[1]]
stuff['isbn10'] = isbn10
starsR = sel.xpath('//div[contains(#id,"averageCustomerReviews")]/span/#title').extract()
stars = []
if len(starsR) > 0:
stars = [(starsR[0].split(' '))[0]]
stuff['stars'] = stars
reviewsR = sel.xpath('//div[contains(#id,"averageCustomerReviews")]/a[contains(#href,"showViewpoints=1")]/text()').extract()
reviews = []
if len(reviewsR) > 0:
reviews = [(reviewsR[0].split(' '))[0]]
stuff['reviews'] = reviews
copsR = sel.xpath('//a[#class="sim-img-title"]/#href').extract()
ncops = len(copsR)
cops = [None] * ncops
if ncops > 0:
for idx, cop in enumerate(copsR):
cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
stuff['cops'] = cops
return stuff

So I understand you were able to scrape these "Customers Who Bought This Item Also Bought" product details. As you probably saw, these are within a ul in a div with class "shoveler-content":
<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
<a class="back-button" onclick="return false;" style="" href="#Back">
<div class="shoveler-content">
<ul tabindex="-1">
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
<div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
...
</div>
</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
</ul>
</div>
<a class="next-button" onclick="return false;" style="" href="#Next">
<span class="auiTestSprite s_shvlNext">...</span>
</a>
</div>
</div>
When you inspect your browser of choice's network activity (via Firebug or Chrome Inspect tool), when you click on the "next" button for next suggested products, you'll see an AJAX query to this sort of URL:
http://www.amazon.com
/gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
&pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
&shovelerName=purchase
(I'm using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)
What's in the id query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next "next" click a user will probably make.
What do you get back from this AJAX query? Still within your browser's inspect tool, you'll see the response is of type application/json, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:
<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
<a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
<div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div> Home Game: An Accidental Guide to Fatherhood
</a>
<div class="byline">
<span class="carat">&#8250</span>
Michael Lewis
</div>
<div class="rating-price">
<span class="rating-stars">
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" name="B00261OOWQ">
<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
<span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
<span>4.1 out of 5 stars</span>
</span>
</a>
</span>
(99)
</span>
</span>
</div>
<div class="binding-platform"> Kindle Edition </div>
<div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div>
</div>
So you basically get what was in the original page section for suggested products earlier, in each <li> from <div class="shoveler-content"><ul>
But how do you get those ASINs codes to append to the AJAX query's id parameter?
Well, in the product page, you'll notice this section
<div id="purchaseSimsData"
class="sims-data" style="display:none"
data-baseAsin="B005CRQ2OE" data-featureId="pd_sim"
data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
data-wdg="ebooks_display_on_website" data-widgetName="purchase">
B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>
which looks like all the suggested products ASINs.
Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json package, and then parse each HTML snippet to extract product info you want.

I would recommend you to avoid scrapy especially since you're a beginner.
Use awesome Requests module for downloading pages
https://github.com/kennethreitz/requests
and BeautifulSoup for parsing webpages.
http://www.crummy.com/software/BeautifulSoup/.

Related

Any way to only extract specific div from beautiful soup

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

Python Beautiful soup to scrape urls from a web page

I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])

Remove html after some point in Beautilful Soup

I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.

Web scraping with Beautiful soup multiple tags

I'm trying to get multiple addresses from a web page with an A to Z of links.
First I get A to Z links with:
URL = "http://www.example.com"
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]
This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.
For getting the addresses I need I used:
for item in links[0:5]:
try:
htmlss = urlopen(item).read()
soup = bfs(htmlss, "lxml")
titl = soup.find('div','views-field-title').a.contents
add = soup.find('div','views-field-address').span.contents
zipp = soup.find('div','views-field-city-state-zip').span.contents
except AttributeError:
continue
The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.
My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.
I've tried using soup.findAll but it doesn't work with a.content or span.content
Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.
Example of some html:
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-earl-fry">
Chuck Corica Golf Complex, Earl Fry
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content"></span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
<div class="views-field-value"></div>
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-jack-clark">
Chuck Corica Golf Complex, Jack Clark
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content">
1 Clubhouse Memorial Rd
<br></br>
</span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
This is just a sample of similar html I need to find data for. Thanks

How to extract Text Content from a Web site using Beautiful using select() and specific CSS Selectors

I'm learning to extract content from a Website using Python and BeautifulSoup.
This is the HTML structure:
<div id="preview-prediction" class="two-cols rc-b rc-r">
<span style="position: absolute; top: 0.5em; left: 1em; color: #808080;">Prediction: </span>
<div class="home">
<div class="team-name">
<img src="http://164.177.157.12/img/teams/13.png" class="team-emblem">
Arsenal
</div>
<span class="predicted-score">2</span>
<div class="clear"></div>
</div>
<div class="away">
<span class="predicted-score">1</span>
<div class="team-name">
Liverpool
<img src="http://164.177.157.12/img/teams/26.png" class="team-emblem">
</div>
<div class="clear"></div>
</div>
</div>
I want to extract the exact text from the specific tag in the page. I cannot use find_all() or find() as the page has this complex structure. So i'm using the select() function with the CSS selector:
soup.select("#preview-prediction > .home > .team-name > .team-link")
The last class team-link contains the text which i need to extract. How to perform this task ?
This would create a list of all the contents of selected tags.
>>> [i.text for i in soup.select('#preview-prediction > .home > .team-name > .team-link')]
['Arsenal']
OR
This would print the contents of first selected tag.
>>> soup.select('#preview-prediction > .home > .team-name > .team-link')[0].text
'Arsenal'

Categories

Resources