I am pretty new to Scrapy. I am looking into using it to crawl an entire website for links, in which I would output the items into multiple JSON files. So I could then upload them to Amazon Cloud Search for indexing. Is it possible to split the items into multiple files instead of having just one giant file in the end? From what I've read, the Item Exporters can only output to one file per spider. But I am only using one CrawlSpider for this task. It would be nice if I could set a limit to the number of items included in each file, like 500 or 1000.
Here is the code I have set up so far (based off the Dmoz.org used in the tutorial):
dmoz_spider.py
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/",
]
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
items.py
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
Thanks for the help.
I don't think built-in feed exporters support writing into multiple files.
One option would be to export into a single file in jsonlines format basically, one JSON object per line which is convenient to pipe and split.
Then, separately, after the crawling is done, you can read the file in the desired chunks and write into separate JSON files.
So I could then upload them to Amazon Cloud Search for indexing.
Note that there is a direct Amazon S3 exporter (not sure it helps, just FYI).
You can add a name to each item and use a custom pipeline to output to different json files. like so:
from scrapy.exporters import JsonItemExporter
from scrapy import signals
class MultiOutputExporter(object):
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
self.items = ['item1','item2']
self.files = {}
self.exporters = {}
for item in self.items:
self.files[item] = open(f'{item}.json', 'w+b')
self.exporters[item] = JsonItemExporter(self.files[item])
self.exporters[item].start_exporting()
def spider_closed(self, spider):
for item in self.items:
self.exporters[item].finish_exporting()
self.files[item].close()
def process_item(self, item, spider):
self.exporters[item.name].export_item()
return item
Then add names to your items as follows:
class Item(scrapy.Item):
name = 'item1'
Now enable the pipeline in scrapy.setting and voila.
Related
I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with
I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.
But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component (middlewares, pipelines, exporters, signals, etc...) but no one seems useful for this purpose. I'm arrived at the conclusion that a solution to solve this problem doesn't exist at all in scrapy (except, maybe, a very elaborated trick), and you are forced to order things in a second phase. Do you agree, or do you have some idea? I copy here the code of my scraper.
Spider:
# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader
class EulerSpider(scrapy.Spider):
name = 'euler'
allowed_domains = ['projecteuler.net']
start_urls = ["https://projecteuler.net/archives"]
def parse(self, response):
numpag = response.css("div.pagination a[href]::text").extract()
maxpag = int(numpag[len(numpag) - 1])
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
for i in range(2, maxpag + 1):
next_page = "https://projecteuler.net/archives;page=" + str(i)
yield response.follow(next_page, self.parse_next)
return [scrapy.Request("https://projecteuler.net/archives", self.parse)]
def parse_next(self, response):
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
def parse_problems(self, response):
l = ItemLoader(item=Problem(), response=response)
l.add_css("title", "h2")
l.add_css("id", "#problem_info")
l.add_css("content", ".problem_content")
yield l.load_item()
Item:
import re
import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags
def extract_first_number(text):
i = re.search('\d+', text)
return int(text[i.start():i.end()])
def array_to_value(element):
return element[0]
class Problem(scrapy.Item):
id = scrapy.Field(
input_processor=MapCompose(remove_tags, extract_first_number),
output_processor=Compose(array_to_value)
)
title = scrapy.Field(input_processor=MapCompose(remove_tags))
content = scrapy.Field()
If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.
This is how Scrapy's built-in JsonItemExporter is implemented.
With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().
Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.
By now I've found a working solution using pipeline:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.list_items = []
self.file = open('euler.json', 'w')
def close_spider(self, spider):
ordered_list = [None for i in range(len(self.list_items))]
self.file.write("[\n")
for i in self.list_items:
ordered_list[int(i['id']-1)] = json.dumps(dict(i))
for i in ordered_list:
self.file.write(str(i)+",\n")
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
self.list_items.append(item)
return item
Though it may be non optimal, because the guide suggests in another example:
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
I wrote a code in scrapy to scrape coffee shops from yellowpage. The total data is around 870 but I'm getting around 1200 with a minimum number of duplicates. Moreover, in the csv output the data are getting placed in every alternate row. Expecting someone to provide me with a workaround. Thanks in advance.
Folder Name "yellpg" and "items.py" contains
from scrapy.item import Item, Field
class YellpgItem(Item):
name= Field()
address = Field()
phone= Field()
Spider Name "yellsp.py" which contains:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from yellpg.items import YellpgItem
class YellspSpider(CrawlSpider):
name = "yellsp"
allowed_domains = ["yellowpages.com"]
start_urls = (
'https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page=1',
)
rules = (Rule(LinkExtractor(allow=('\&page=.*',)),callback='parse_item',follow=True),)
def parse_item(self, response):
page=response.xpath('//div[#class="info"]')
for titles in page:
item = YellpgItem()
item["name"] = titles.xpath('.//span[#itemprop="name"]/text()').extract()
item["address"] = titles.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()').extract()
item["phone"] = titles.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()').extract()
yield item
To get the CSV output, the command line I'm using:
scrapy crawl yellsp -o items.csv
I could recommend creating a pipeline that stores items to later check if the new ones are duplicates, but that isn't a real solution here, as it could create memory problems.
The real solution here is that you should avoid "storing" duplicates in your final database.
Define what field of your item is going to behave as the index in your database and everything should be working.
The best way would be to use CSVItemExporter in your pipeline. Create a file named pipeline.py inside your scrapy project and add below code lines.
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class CSVExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_coffer_shops.csv' % spider.name, 'w+b') # hard coded filename, not a good idea
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Now add these lines in setting.py
ITEM_PIPELINES = {
'your_project_name.pipelines.CSVExportPipeline': 300
}
This custom CSVItemExporter will export your data in CSV styles. If you are not getting the data as expected you can modify process_item method to suits your need.
I have written two spiders in single file. When I ran scrapy runspider two_spiders.py, only the first Spider was executed. How can I run both of them without splitting the file into two files.
two_spiders.py:
import scrapy
class MySpider1(scrapy.Spider):
# first spider definition
...
class MySpider2(scrapy.Spider):
# second spider definition
...
Let's read the documentation:
Running multiple spiders in the same process
By default, Scrapy runs a
single spider per process when you run scrapy crawl. However, Scrapy
supports running multiple spiders per process using the internal API.
Here is an example that runs multiple spiders simultaneously:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
(there are few more examples in the documentation)
From your question it is not clear how have you put two spiders into one file. It was not enough to concatenate content of two files with single spiders.
Try to do what is written in the documentation. Or at least show us your code. Without it we can't help you.
Here is a full Scrapy project with 2 spiders in one file.
# quote_spiders.py
import json
import string
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.item import Item, Field
class TextCleaningPipeline(object):
def _clean_text(self, text):
text = text.replace('“', '').replace('”', '')
table = str.maketrans({key: None for key in string.punctuation})
clean_text = text.translate(table)
return clean_text.lower()
def process_item(self, item, spider):
item['text'] = self._clean_text(item['text'])
return item
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open(spider.settings['JSON_FILE'], 'a')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class QuoteItem(Item):
text = Field()
author = Field()
tags = Field()
spider = Field()
class QuotesSpiderOne(scrapy.Spider):
name = "quotes1"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/', ]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('small.author::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
item['spider'] = self.name
yield item
class QuotesSpiderTwo(scrapy.Spider):
name = "quotes2"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/2/', ]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('small.author::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
item['spider'] = self.name
yield item
if __name__ == '__main__':
settings = dict()
settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
settings['HTTPCACHE_ENABLED'] = True
settings['JSON_FILE'] = 'items.jl'
settings['ITEM_PIPELINES'] = dict()
settings['ITEM_PIPELINES']['__main__.TextCleaningPipeline'] = 800
settings['ITEM_PIPELINES']['__main__.JsonWriterPipeline'] = 801
process = CrawlerProcess(settings=settings)
process.crawl(QuotesSpiderOne)
process.crawl(QuotesSpiderTwo)
process.start()
Install Scrapy and run the script
$ pip install Scrapy
$ python quote_spiders.py
No other file is needed.
This example coupled with graphical debugger of pycharm/vscode can help understand scrapy workflow and make debugging easier.
Unsure of how to create a dynamic item class: http://scrapy.readthedocs.org/en/latest/topics/practices.html#dynamic-creation-of-item-classes
Not quite sure where I would use that code provided in the documentation.
Would I stick this in pipelines.py, items.py and call this from the parse function of the spider? or the main script file that calls the scrapy spider?
I would place the code snippet in items.py, and use it in the spider for any dynamic item I need (probably down to personal preferences), for example:
from myproject.items import create_item_class
# base on one of the scrapy example...
class MySpider(CrawlSpider):
# ... name, allowed_domains ...
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
# for need to use a dynamic item
field_list = ['id', 'name', 'description']
DynamicItem = create_item_class('DynamicItem', field_list)
item = DynamicItem()
# then you can use it here...
item['id'] = response.xpath('//td[#id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[#id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[#id="item_description"]/text()').extract()
return item
You may be interested to read the Dynamic Creation of Item Classes #398 for a better understanding.
I wrote a crawler using the scrapy framework in python to select some links and meta tags.It then crawls the start urls and write the data in a JSON encoded format onto a file.The problem is that when the crawler is run two or three times with the same start urls the data in the file gets duplicated .To avoid this I used a downloader middleware in scrapy which is this : http://snippets.scrapy.org/snippets/1/
What I did was copy and paste the above code in a file inside my scrapy project and I enabled it in the settings.py file by adding the following line:
SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}
where "a11ypi.removeDuplicates.IgnoreVisitedItems" is the class path name and finally I went in and modified my items.py file and included the following fields
visit_id = Field()
visit_status = Field()
But this doesn't work and still the crawler produces the same result appending it to the file when run twice
I did the writing to the file in my pipelines.py file as follows:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
json.dump(d,self.file)
return item
And my spider code is as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "a11y.in"
allowed_domains = ["a11y.in"]
# This is the list of seed URLs to begin crawling with.
start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]
# This is the callback method, which is used for scraping specific data
def parse(self,response):
temp = []
hxs = HtmlXPathSelector(response)
item = AYpiItem()
wholeforuri = hxs.select("//#foruri").extract() # XPath to extract the foruri, which contains both the URL and id in foruri
for i in wholeforuri:
temp.append(i.rpartition(":"))
item["foruri"] = [i[0] for i in temp] # This contains the URL in foruri
item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri] # This contains the id in foruri
item['thisurl'] = response.url
item["thisid"] = hxs.select("//#foruri/../#id").extract()
item["rec"] = hxs.select("//#foruri/../#rec").extract()
return item
Kindly suggest what to do.
try to understand why the snippet is written as it is:
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
Notice in line 2, you actually require a key in in Request.meta called FILTER_VISITED in order for the middleware to drop the request. The reason is well-intended because every single url you have visited will be skipped and you will not have urls to tranverse at all if you do not do so. So, FILTER_VISITED actually allows you to choose what url patterns you want to skip. If you want links extracted with a particular rule skipped, just do
Rule(SgmlLinkExtractor(allow=('url_regex1', 'url_regex2' )), callback='my_callback', process_request = setVisitFilter)
def setVisitFilter(request):
request.meta['filter_visited'] = True
return request
P.S I do not know if it works for 0.14 and above as some of the code has changed for storing spider context in the sqlite db.