Scrapy not writing all outputs to CSV - python

I am using Scrapy to ping a bunch of webpages and determine which of the homepages are no longer live.
The full code of my spider is very simple, and reproduced below:
class BrokenLinksSpider(scrapy.Spider):
name = 'broken-link-spider'
def __init__(self, id_homepage_mapping, *args, **kwargs):
self.org_homepages = id_homepage_mapping
self.download_timeout = 10
def start_requests(self):
for qId, url in self.org_homepages.items():
if url.endswith("/"):
url = url[0:-1]
yield scrapy.Request(url,
callback=self.complete_callback,
errback=self.error_callback,
dont_filter=True,
meta={ 'handle_httpstatus_all': True, 'id': id })
def complete_callback(self, response):
if str(response.status)[0] in ("4", "5"):
yield BrokenLinkCheckerItem(
url=response.request.url,
status=response.status,
qId=response.meta['id'])
def error_callback(self, failure):
yield BrokenLinkCheckerItem(
url=failure.request.url,
status=repr(failure),
qId=failure.request.meta['id'])
In the logs of my app, I can see in the scrapy stats collector that there were exceptions with fetching about 40,000 links. However, the csv output file that I'm writing to only contains about 8k rows. Furthermore, the CSV file seems to contain all of the different types of exceptions reported by the StatsCollector, just less instances of each.
So my question is, why is there such a discrepancy between what the statsCollector reports and what's actually written to the CSV? Given my understanding of the errback callback function, shouldn't all those exceptions end up being written to the CSV?
Thanks!

Related

Scrapy item enriching from multiple websites

I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!
I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.

Pass a list of start_urls as parameter from Django to Scrapyd

I'm working in a little scraping platform using Django and Scrapy (scrapyd as API). Default spider is working as expected, and using ScrapyAPI (python-scrapyd-api) I'm passing a URL from Django and scrap data, I'm even saving results as JSON to a postgres instance. This is for a SINGLE URL pass as parameter.
When trying to pass a list of URLs, scrapy just takes the first URL from a list. I don't know if it's something about how Python or ScrapyAPI is treating or processing this arguments.
# views.py
# This is how I pass parameters from Django
task = scrapyd.schedule(
project=scrapy_project,
spider=scrapy_spider,
settings=scrapy_settings,
url=urls
)
# default_spider.py
def __init__(self, *args, **kwargs):
super(SpiderMercadoLibre, self).__init__(*args, **kwargs)
self.domain = kwargs.get('domain')
self.start_urls = [self.url] # list(kwargs.get('url'))<--Doesn't work
self.allowed_domains = [self.domain]
# Setup to tell Scrapy to make calls from same URLs
def start_requests(self):
...
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'original_url': url}, dont_filter=True)
Of course I can make some changes to my model so I can save every result iterating from the list of URLs and scheduling each URL using ScrapydAPI, but I'm wondering if this is a limitation of scrapyd itself or am I missing something about Python mechanics.
This is how ScrapydAPI is processing the schedule method:
def schedule(self, project, spider, settings=None, **kwargs):
"""
Schedules a spider from a specific project to run. First class, maps
to Scrapyd's scheduling endpoint.
"""
url = self._build_url(constants.SCHEDULE_ENDPOINT)
data = {
'project': project,
'spider': spider
}
data.update(kwargs)
if settings:
setting_params = []
for setting_name, value in iteritems(settings):
setting_params.append('{0}={1}'.format(setting_name, value))
data['setting'] = setting_params
json = self.client.post(url, data=data, timeout=self.timeout)
return json['jobid']
I think i'm implementing everything as expected but everytime, no matter what approach is used, only the first URL from the list of URLs is scraped

Scrapy Spider - Saving data through Stats Collection

I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api
You need to save this data to disk in one way or another (in a file or database).
The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.
I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.
If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.
Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836

python scrapy parse() function, where is the return value returned to?

I am new on Scrapy, and I am sorry if this question is trivial. I have read the document on Scrapy from official webpage. And while I look through the document, I met this example:
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = ’example.com’
allowed_domains = [’example.com’]
start_urls = [
’http://www.example.com/1.html’,
’http://www.example.com/2.html’,
’http://www.example.com/3.html’,
]
def parse(self, response):
for h3 in response.xpath(’//h3’).extract():
yield MyItem(title=h3)
for url in response.xpath(’//a/#href’).extract():
yield scrapy.Request(url, callback=self.parse)
I know, the parse method must return an item or/and request, but where are these return values returned to?
One is an item and the other is request, I think these two type would be handled differently and in the case of CrawlSpider, it has Rule with callback. What about this callback's return value? where to ? same as parse()?
I am very confused on Scrapy procedure, even i read the document....
According to the documentation:
The parse() method is in charge of processing the response and
returning scraped data (as Item objects) and more URLs to follow (as
Request objects).
In other words, returned/yielded items and requests are handled differently, items are being handed to the item pipelines and item exporters, but requests are being put into the Scheduler which pipes the requests to the Downloader for making a request and returning a response. Then, the engine receives the response and gives it to the spider for processing (to the callback method).
The whole data-flow process is described in the Architecture Overview page in a very detailed manner.
Hope that helps.

Multiple pages per item in Scrapy

Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.

Categories

Resources