Scrapy Splash missing elements - python

I've written a spider to crawl the boardgamegeek.com/browse/boardgame site for information regarding boardgames in the list.
My problem is that when pulling two specific selectors in my code, a response is not always received for those selectors, sometimes it returns a selector object other times it doesn't. After inspecting the response during debugging, the dynamically loaded selectors don't exist in the code.
My two offending lines
bggspider.py
bg['txt_cnt'] = response.xpath(
selector_paths.SEL_TXT_REVIEWS).extract_first()
bg['vid_cnt'] = response.xpath(
selector_paths.SEL_VID_REVIEWS).extract_first()
Where the selectors are defined as
selector_paths.py
SEL_TXT_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Text Reviews")]/text()'
SEL_VID_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Video Reviews")]/text()'
After yielding the bg item, in the pipeline the attributes are processed where a check is performed since many boardgames have very little information for various parts of the page.
pipelines.py
if item['txt_cnt']:
item['txt_cnt'] = int(re.findall('\d+', item['txt_cnt'])[0])
else:
item['txt_cnt'] = 0
if item['vid_cnt']:
item['vid_cnt'] = int(re.findall('\d+', item['vid_cnt'])[0])
else:
item['vid_cnt'] = 0
The aim of the field processing is just to grab the numerical value in the string which is the number of text and video reviews for a boardgame.
I'm assuming I'm missing something that has to do with Splash since I'm getting selector items for some/most queries but still missing many.
I am running the ScrapySplash docker container locally, localhost:8050.
Code for the spider can be found here. BGGSpider on Github
Any help or information about how to remedy this problem or how ScrapySplash works would be appreciated.

Related

Python web scaping recursively (next page)

from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page
You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...
Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11
The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.

scrapy spider scraping data from link randomly why?

First i have grabed all the coin link from the website and requested to those link.
But scrapy do'nt requesting serially from the link list.after requesting to thos link scraping data successfully but when saving to csv file it making a blank row every time after one succesfull scraped item.Result Screenshot
I am expecting that it will request serially from the link list and it will not make any blank row.how can i do that?
I am using python 3.6 and scrapy version 1.5.1
My code:
import scrapy
class MarketSpider(scrapy.Spider):
name = 'market'
allowed_domains = ['coinmarketcap.com']
start_urls = ['http://coinmarketcap.com/']
def parse(self, response):
Coin = response.xpath('//*[#class="currency-name-container link-secondary"]/#href').extract()
for link in Coin:
absolute_url = response.urljoin(link)
yield scrapy.Request(absolute_url,callback=self.website_link)
def website_link(self,response):
link = response.xpath('//*[#class="list-unstyled details-panel-item--links"]/li[2]/a/#href').extract()
name = response.xpath('normalize-space(//h1)').extract()
yield{'Name': name ,'Link': link}
I think scrapy is visiting pages in a multi-threaded (producer/consumer) fashion. This can explain the non-sequential aspect of your result.
To verify this hypothesis, you could change your config to use a single thread.
For the blank link, are you sure any of your name or link variable contains a \n ?
Scrapy is an asynchronous framework - multiple requests are executed concurrently and the responses are parsed as they are received.
The only way to reliably control which responses are parsed first is to turn this feature off, e.g. by setting CONCURRENT_REQUESTS to 1.
This would make your spider less efficient though, and this kind of control of the parse order is rarely necessary, so I would avoid it if possible.
The extra newlines in csv exports on windows are a known issue, and will be fixed in the next scrapy release.

Spider error URL processing

I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.
class GoodWillOutSpider(Spider):
name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]
def __init__(self):
logging.critical("GoodWillOut STARTED.")
def parse(self, response):
products = Selector(response).xpath('//div[#id="elasticsearch-results-container"]/ul[#class="product-list clearfix"]')
for product in products:
item = GoodWillOutItem()
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0]
item['link'] = "www.thegoodwillout.com" + product.xpath('//#href').extract()[0]
# item['image'] = "http:" + product.xpath("/div[#class='catalogue-product-cover']/a[#class='catalogue-product-cover-image']/img/#src").extract()[0]
# item['size'] = '**NOT SUPPORTED YET**'
yield item
yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)
This is my class GoodWillOutSpider, and this is the error I get:
[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)
line 1085, in parse item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0] IndexError: list index out of range
And I wanna know in the future, how can I get without asking here again the correct xpath for every site
The problem
If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.
This can mean one of two things:
Your scraper is being recognized as such and served different content
Some of the content is generated dynamically (usually through javascript)
The generic solution
The most straight-forward way of getting around both of these problems is using an actual browser.
There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.
More specialized solutions
Sometimes, you can figure out what the reason for this different behavior is, and change your code.
This will usually be the more efficient solution, but might require significantly more work on your part.
For example, if your scraper is getting redirected, it is possible that you just need to use a different user agent string, pass some additional headers, or slow down your requests.
If the content is generated by javascript, you might be able to look at the page source (response.text or view source in a browser), and figure out what is going on.
After that, there are two possibilities:
Extract the data in an alternate way (like gangabass did for your previous question)
Replicate what the javascript is doing in your spider code (such as making additional requests, like in the current example)
IndexError: list index out of range
You need to check first if list has any values after extracting
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()
if item['name']:
item['name'] = item['name'][0]

Scrapy crawled pages, but scraped 0 items

I am fairly new to Python, Scrapy and this board, so please bear with me, as I try to illustrate my problem.
My goal is to collect the names (and possibly prices) of all available hotels in Berlin on booking.com for a specific date (see for example the predefined start_url) with the help of Scrapy.
I think the crucial parts are:
I want to paginate through all next pages until the end.
On each page I want to collect the name of every hotel and the name should be saved respectively.
If I run "scrapy runspider bookingspider.py -o items.csv -t csv" for my code below, the terminal shows me that it crawls through all available pages, but in the end I only get an empty items.csv.
Step 1 seems to work, as the terminal shows succeeding urls are being crawled (e.g. [...]offset=15, then [...]offset=30). Therefore I think my problem is step 2.
For step 2 one needs to define a container or block, in which each hotel information is contained seperately and can serve as the basis for a loop, right?
I picked "div class="sr_item_content sr_item_content_slider_wrapper"", since every hotel block has this element at a superordinate level, but I am really unsure about this part. Maybe one has to consider a higher level
(but which element should I take, since they are not the same across the hotel blocks?).
Anyway, based on that I figured out the remaining XPath to the element, which contains the hotel name.
I followed two tutorials with similar settings (though different websites), but somehow it does not work here.
Maybe you have an idea, every help is very much appreciated. Thank you!
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.http.request import Request
class HotelItem(Item):
title = Field()
price = Field()
class BookingCrawler(CrawlSpider):
name = 'booking_crawler'
allowed_domains = ['booking.com']
start_urls = ['http://www.booking.com/searchresults.html?checkin_monthday=25;checkin_year_month=2016-10;checkout_monthday=26;checkout_year_month=2016-10;class_interval=1;dest_id=-1746443;dest_type=city;offset=0;sb_travel_purpose=leisure;si=ai%2Cco%2Cci%2Cre%2Cdi;src=index;ss=Berlin']
custom_settings = {
'BOT_NAME': 'booking-scraper',
}
def parse(self, response):
s = Selector(response)
index_pages = s.xpath('//div[#class="results-paging"]/a/#href').extract()
if index_pages:
for page in index_pages:
yield Request(response.urljoin(page), self.parse)
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
items = []
for hotel in hotels:
item = HotelItem()
item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0]
item['price'] = hotel.xpath('//div[#class="sr-prc--num sr-prc--final"]/text()').extract()[0]
items.append(item)
for item in items:
yield item
I think the problem may be with your XPath on this line:
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
From this SO question it looks like you need to define something more along the lines of:
//div[contains(#class, 'sr_item_content') and contains(#class, 'sr_item_content_slider_wrapper')]
To help you debug further, you could try outputting the contents of index_pages first to see if it is definitely returning what you expect on that level.
Also, check Xpath Visualiser (also mentioned in the question), which can help with building Xpath.

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items
Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)

Categories

Resources