scrapy spider scraping data from link randomly why? - python

First i have grabed all the coin link from the website and requested to those link.
But scrapy do'nt requesting serially from the link list.after requesting to thos link scraping data successfully but when saving to csv file it making a blank row every time after one succesfull scraped item.Result Screenshot
I am expecting that it will request serially from the link list and it will not make any blank row.how can i do that?
I am using python 3.6 and scrapy version 1.5.1
My code:
import scrapy
class MarketSpider(scrapy.Spider):
name = 'market'
allowed_domains = ['coinmarketcap.com']
start_urls = ['http://coinmarketcap.com/']
def parse(self, response):
Coin = response.xpath('//*[#class="currency-name-container link-secondary"]/#href').extract()
for link in Coin:
absolute_url = response.urljoin(link)
yield scrapy.Request(absolute_url,callback=self.website_link)
def website_link(self,response):
link = response.xpath('//*[#class="list-unstyled details-panel-item--links"]/li[2]/a/#href').extract()
name = response.xpath('normalize-space(//h1)').extract()
yield{'Name': name ,'Link': link}

I think scrapy is visiting pages in a multi-threaded (producer/consumer) fashion. This can explain the non-sequential aspect of your result.
To verify this hypothesis, you could change your config to use a single thread.
For the blank link, are you sure any of your name or link variable contains a \n ?

Scrapy is an asynchronous framework - multiple requests are executed concurrently and the responses are parsed as they are received.
The only way to reliably control which responses are parsed first is to turn this feature off, e.g. by setting CONCURRENT_REQUESTS to 1.
This would make your spider less efficient though, and this kind of control of the parse order is rarely necessary, so I would avoid it if possible.
The extra newlines in csv exports on windows are a known issue, and will be fixed in the next scrapy release.

Related

How to make scrapy spider get start URL and allowed domains from csv list?

I am using a scrapy spider for URL crawling for my research project. My spider is based on the code from bhattraideb (Scrapy follow all the links and get status) and slightly edited to fit my needs better.
At the moment I’m restarting the spider every time when changing the allowed domain and start URL since I need the output for each allowed domain in a separate file. Since my list of URLs is growing this is getting very tedious to do...
I tried to iterate over a .csv importing both columns with the allowed_domains and start_urls as list using "i" and "while", however it always clashes with the classes.
I'd appreciate any help :-)
see : How to loop through multiple URLs to scrape from a CSV file in Scrapy?
For info : when not using a CSV, you can also do something like this, reading start_urls from a list:
# https://www.food.com/recipe/all/healthy?pn=1
list_url = 'https://www.food.com/recipe/all/healthy?pn='
start_urls = [list_url + str(page)]
Increment the page variable for the next page, until next page is None.

python scrapy Two-direction crawling with a spider

I am reading learning scrapy by Dimitrios Kouzis-Loukas. Actually I have a question of the Two-direction crawling with a spider part in chapter 3 page58.
The original code is like:
def parse(self, response):
# Get the next index URLs and yield Requests
next_selector = response.xpath('//*[contains(#class,"next")]//#href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URLs and yield Requests
item_selector = response.xpath('//*[#itemprop="url"]/#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)`
But from my understanding, should the second loop block be included into the first one so that we can first download the index page and then download all the information pages in the first page, after that move onto the next index page?
So I just wanna know the operating order of the original code, please help!
You can't really merge the two loops.
The Request objects yielded in them have different callbacks.
The first one will be processed by the parse method (which seems to be parsing a listing of multiple items), and the second by the parse_item method (probably parsing the details of a single item).
As for the order of scraping, scrapy (by default) uses a LIFO queue, which means the last request created will be processed first.
However, due to the asynchronous nature of scrapy, it's impossible to say what the exact order will be.

Spider error URL processing

I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.
class GoodWillOutSpider(Spider):
name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]
def __init__(self):
logging.critical("GoodWillOut STARTED.")
def parse(self, response):
products = Selector(response).xpath('//div[#id="elasticsearch-results-container"]/ul[#class="product-list clearfix"]')
for product in products:
item = GoodWillOutItem()
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0]
item['link'] = "www.thegoodwillout.com" + product.xpath('//#href').extract()[0]
# item['image'] = "http:" + product.xpath("/div[#class='catalogue-product-cover']/a[#class='catalogue-product-cover-image']/img/#src").extract()[0]
# item['size'] = '**NOT SUPPORTED YET**'
yield item
yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)
This is my class GoodWillOutSpider, and this is the error I get:
[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)
line 1085, in parse item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0] IndexError: list index out of range
And I wanna know in the future, how can I get without asking here again the correct xpath for every site
The problem
If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.
This can mean one of two things:
Your scraper is being recognized as such and served different content
Some of the content is generated dynamically (usually through javascript)
The generic solution
The most straight-forward way of getting around both of these problems is using an actual browser.
There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.
More specialized solutions
Sometimes, you can figure out what the reason for this different behavior is, and change your code.
This will usually be the more efficient solution, but might require significantly more work on your part.
For example, if your scraper is getting redirected, it is possible that you just need to use a different user agent string, pass some additional headers, or slow down your requests.
If the content is generated by javascript, you might be able to look at the page source (response.text or view source in a browser), and figure out what is going on.
After that, there are two possibilities:
Extract the data in an alternate way (like gangabass did for your previous question)
Replicate what the javascript is doing in your spider code (such as making additional requests, like in the current example)
IndexError: list index out of range
You need to check first if list has any values after extracting
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()
if item['name']:
item['name'] = item['name'][0]

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items
Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)

how to crawl webpages base on the info on the index page

I am trying to write a spider to crawl certain pages base on the data or info on the index page. And then store the result in a database.
For example, let say I would like to crawl stackoverflow.com/questions/tagged/scrapy
I would go through the index page, if the question is not in my database, then I would store the answer count in the database, then follow the link of the question and crawl that page.
if the question is already in the database, but the number of answer is greater than the one in the database: crawl that page again.
if the question is already in the database and the answer counter is the same: skip this question.
At the moment I could get all the links and answer counts(in this example)on the index page.
but I don't know how to make the spider to follow the link to the question page base on the answer count.
Is there a way to do this with one spider instead of having two spiders, one spider is getting all the links on the index page, compares the data with the database, exports a json or csv file, and then passes it to another spider to crawl the question page?
Just use BaseSpider. That way you can make all the logic depend on the content you are scraping. I personally prefer BaseSpider since it gives you a lot more control over the scraping process.
The spider should look something like this (this is more of a pseudo code):
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from myproject.items import MyItem
class StackOverflow(BaseSpider):
name = 'stackoverflow.com'
allowed_domains = ['stackoverflow.com']
start_urls = ['http://stackoverflow.com/questions']
def parse(self, response):
hxs = HtmlXPathSelector(response)
for question in hxs.select('//question-xpath'):
question_url = question.select('./question-url')
answer_count = question.select('./answer-count-xpath')
# you'll have to write the xpaths and db logic yourself
if get_db_answer_count(question_url) != answer_count[0]:
yield Request(question_url, callback = self.parse_question)
def parse_question(self, response):
insert_question_and_answers_into_db
pass
This is what the CrawlSpider and the Rules do (be sure to check out the example). You could first get the information from the index site (though your approach counting answers is somehow flawed: what if a user deleted a post and a new one has been added) and decide on each sub page, if you want to get its information or not.
Put simple: use the spider on the index pages and follow its questions. When given a question, check if you want to get the information or drop/ignore the question.

Categories

Resources