Following document, I can run scrapy from a Python script, but I can't get the scrapy result.
This is my spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem
class DmozSpider(BaseSpider):
name = "douban"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/group/xxx/discussion"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select("//table[#class='olt']/tr/td[#class='title']/a")
items = []
# print sites
for row in rows:
item = DmozItem()
item["title"] = row.select('text()').extract()[0]
item["link"] = row.select('#href').extract()[0]
items.append(item)
return items
Notice the last line, I try to use the returned parse result, if I run:
scrapy crawl douban
the terminal could print the return result
But I can't get the return result from the Python script. Here is my Python script:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")
I try to get the result at the reactor.run(), but it return nothing,
How can I get the result?
Terminal prints the result because the default log level is set to DEBUG.
When you are running your spider from the script and call log.start(), the default log level is set to INFO.
Just replace:
log.start()
with
log.start(loglevel=log.DEBUG)
UPD:
To get the result as string, you can log everything to a file and then read from it, e.g.:
log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)
reactor.run()
with open("results.log", "r") as f:
result = f.read()
print result
Hope that helps.
I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it:
items = []
def add_item(item):
items.append(item)
dispatcher.connect(add_item, signal=signals.item_passed)
Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by:
crawler.signals.connect(add_item, signals.item_passed)
My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.
in my case, i placed the script file at scrapy project level e.g. if scrapyproject/scrapyproject/spiders then i placed it at scrapyproject/myscript.py
Related
Ihave wriiten a crawler in scrapy but I would want to initiate the crwaling by using main method
import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
import re
class TutsplusItem(scrapy.Item):
title = scrapy.Field()
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["bbc.com"]
start_urls = ["http://www.bbc.com/"]
def __init__(self, *args):
try:
opts, args = getopt.getopt(args, "hi:o:", ["ifile=", "ofile="])
except getopt.GetoptError:
print 'test.py -i <inputfile> -o <outputfile>'
sys.exit(2)
super(MySpider, self).__init__(self,*args)
def parse(self, response):
links = response.xpath('//a/#href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
# I only want to get the tutorial posts
# linkPattern = re.compile("^\/tutorials\?page=\d+")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
#if linkPattern.match(link) and not link in crawledLinks:
if not link in crawledLinks:
link = "http://www.bbc.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(#class, "media__link")]/text()').extract()
count=0
for title in titles:
item = TutsplusItem()
item["title"] = title
print("Title is : %s" %title)
yield item
Instead of using scrapy runspider Crawler.py arg1 arg2
I would like to have a seprate class with main function and initiate scrapy from there. How to this?
There are different ways to approach this, but I suggest the following:
Have a main.py file on the same directory that will open a new process and launch the spider with the parameters you need.
The main.py file would have something like the following:
import subprocess
scrapy_command = 'scrapy runspider {spider_name} -a param_1="{param_1}"'.format(spider_name='your_spider', param_1='your_value')
process = subprocess.Popen(scrapy_command, shell=True)
With this code, you just need to call your main file.
python main.py
Hope it helps.
I have the following script for crawling a website recursively:
#!/usr/bin/python
import scrapy
from scrapy.selector import Selector
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
class GivenSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/",
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
rules = (Rule(LinkExtractor(allow=r'/'), callback=parse, follow=True),)
def parse(self, response):
select = Selector(response)
titles = select.xpath('//a[#class="listinglink"]/text()').extract()
print ' [*] Start crawling at %s ' % response.url
for title in titles:
print '\t %s' % title
#configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(GivenSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
When I invoke it:
$ python spide.py
NameError: name 'Rule' is not defined
If you go by the documentation and search for the word Rule, you'll find this:
http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=rule#crawling-rules
As you didn't import anything, it is clear that Rule isn't being defined.
class scrapy.contrib.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)
So, in theory, you should be able to import the Rule class with from scrapy.contrib.spiders import Rule
Loïc Faure-Lacroix is right. But in the current version of Scrapy (1.6), you need to import Rule from scrapy.spiders like this:
from scrapy.spiders import Rule
See documentation for more information
I'm trying to crawl various websites looking for particular keywords of interest and only scraping those pages. I've written the script to run as a standalone Python script rather than the traditional Scrapy project structure (following this example) and using the CrawlSpider class. The idea is that from a given homepage the Spider will crawl pages within that domain and only scrape links from pages which contain the keyword. I'm also trying to save a copy of the page when I find one containing the keyword. The previous version of this question related to a syntax error (see comments below, thanks #tegancp for helping me clear that up) but now although my code runs I am still unable to crawl links only on pages of interest as intended.
I think I want to either i) remove the call to LinkExtractor in the __init__ function or ii) only call LinkExtractor from within __init__ but with a rule based on what I find when I visit that page rather than some attribute of the URL. I can't do i) because the CrawlSpider class wants a rule and I can't do ii) because LinkExtractor doesn't have a process_links option like the old SgmlLinkExtractor which appears to be deprecated. I'm new to Scrapy so wondering if my only option is to write my own LinkExtractor?
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class GenItem(Item):
url = Field()
# define a spider
class GenSpider(CrawlSpider):
name = "genspider3"
# requires 'start_url', 'allowed_domains' and 'folderpath' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
def __init__(self):
self.start_urls = [sys.argv[1]]
self.allowed_domains = [sys.argv[2]]
self.folder = sys.argv[3]
self.writefile1 = self.folder + 'hotlinks.txt'
self.writefile2 = self.folder + 'pages.txt'
self.rules = [Rule(LinkExtractor(allow_domains=(sys.argv[2],)), follow=True, callback='parse_links')]
super(GenSpider, self).__init__()
def parse_start_url(self, response):
# get list of links on start_url page and process using parse_links
list(self.parse_links(response))
def parse_links(self, response):
# if this page contains a word of interest save the HTML to file and crawl the links on this page
theHTML = response.body
if 'keyword' in theHTML:
with open(self.writefile2, 'a+') as f2:
f2.write(theHTML + '\n')
with open(self.writefile1, 'a+') as f1:
f1.write(response.url + '\n')
for link in LinkExtractor(allow_domains=(sys.argv[2],)).extract_links(response):
linkitem = GenItem()
linkitem['url'] = link.url
log.msg(link.url)
with open(self.writefile1, 'a+') as f1:
f1.write(link.url + '\n')
return linkitem
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
#settings.set('DEPTH_LIMIT', 2)
settings.set('DOWNLOAD_DELAY', 0.25)
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = GenSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start(loglevel=log.DEBUG)
# start the reactor (blocks execution)
reactor.run()
I've made a lot of headway with this spider- am just growing accustomed to coding and am enjoying every minute of it. However, as I'm learning the majority of my programming is problem solving. Here's my current error:
My spider shows all of the data I want in the terminal window. When I go to output, nothing shows up. Here is my code.
import re
import json
from urlparse import urlparse
from scrapy.selector import Selector
try:
from scrapy.spider import Spider
except:
from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from database.items import databaseItem
from scrapy.log import *
class CommonSpider(CrawlSpider):
name = 'fenders.py'
allowed_domains = ['usedprice.com']
start_urls = ['http://www.usedprice.com/items/guitars-musical-instruments/fender/?ob=model_asc#results']
rules = (
Rule(LinkExtractor(allow=( )), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = []
data = hxs.select('//tr[#class="oddItemColor baseText"]')
tmpNextPage = hxs.select('//div[#class="baseText blue"]/span[#id="pnLink"]/a/#href').extract()
for attr in data:
#item = RowItem()
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
print "Instrument Info: ", instrInfo
yield instrInfo
As JoeLinux said, you're yielding a string, instead of returning the item. If you're mostly working off the tutorial, you probably have an "items.py" file someplace (maybe some other name), where you item is defined - it would appear that it's called "RowItem()". Here you've got several fields, or maybe just one.
What you need to do is figure out how you want to store the data in the item. So, making a gross assumption, you probably want RowItem() to include a field called instrInfo. So your items.py file might include something like this:
class RowItem(scrapy.Item):
instrInfo = scrapy.Field()
Then your spider should include something like:
item = RowItem()
data = data = hxs.select('//tr[#class="oddItemColor baseText"]')
for attr in data:
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
item['instrInfo'].append = instrInfo
return item
This will send the item off to your pipeline for processing.
As I said, some gross assumptions about what you're trying to do, and the format of your information, but hopefully this gets you started.
Separately, the print function probably isn't necessary. When the item is returned, it's displayed in the terminal (or log) as the spider runs.
Good luck!
What is the way to view the return data of the parse function of the spider when I execute a script like this?
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
crawler.stats
#log.start()
reactor.run()
I disable the log for view the print messages in the spiders, but with the log enabled the return data don't show either.
The code of the spider parse function return a simple string.
How i get this data? I try to print the "reactor.run" results but always is "none"
This is the way I have found to get the collected items:
items = []
def add_item(item):
items.append(item)
crawler.signals.connect(add_item, signals.item_passed)
I gave my original answer in the linked question and give a bit more details:
https://stackoverflow.com/a/23892650/2730032
If you want to see the logging in the screen change this line:
#log.start()
to this:
log.start(loglevel=log.DEBUG)
to your script.
See this question