I've made a lot of headway with this spider- am just growing accustomed to coding and am enjoying every minute of it. However, as I'm learning the majority of my programming is problem solving. Here's my current error:
My spider shows all of the data I want in the terminal window. When I go to output, nothing shows up. Here is my code.
import re
import json
from urlparse import urlparse
from scrapy.selector import Selector
try:
from scrapy.spider import Spider
except:
from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from database.items import databaseItem
from scrapy.log import *
class CommonSpider(CrawlSpider):
name = 'fenders.py'
allowed_domains = ['usedprice.com']
start_urls = ['http://www.usedprice.com/items/guitars-musical-instruments/fender/?ob=model_asc#results']
rules = (
Rule(LinkExtractor(allow=( )), callback='parse_item'),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = []
data = hxs.select('//tr[#class="oddItemColor baseText"]')
tmpNextPage = hxs.select('//div[#class="baseText blue"]/span[#id="pnLink"]/a/#href').extract()
for attr in data:
#item = RowItem()
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
print "Instrument Info: ", instrInfo
yield instrInfo
As JoeLinux said, you're yielding a string, instead of returning the item. If you're mostly working off the tutorial, you probably have an "items.py" file someplace (maybe some other name), where you item is defined - it would appear that it's called "RowItem()". Here you've got several fields, or maybe just one.
What you need to do is figure out how you want to store the data in the item. So, making a gross assumption, you probably want RowItem() to include a field called instrInfo. So your items.py file might include something like this:
class RowItem(scrapy.Item):
instrInfo = scrapy.Field()
Then your spider should include something like:
item = RowItem()
data = data = hxs.select('//tr[#class="oddItemColor baseText"]')
for attr in data:
instrInfo = attr.select('//td[#class="itemResult"]/text()').extract()
item['instrInfo'].append = instrInfo
return item
This will send the item off to your pipeline for processing.
As I said, some gross assumptions about what you're trying to do, and the format of your information, but hopefully this gets you started.
Separately, the print function probably isn't necessary. When the item is returned, it's displayed in the terminal (or log) as the spider runs.
Good luck!
Related
I have a scraper that works correctly and can get it into a CSV file easily, but it always returns the values in a weird order.
I checked to make sure the items.py fields were in the right order, and tried moving around the fields in the spider, but I can't figure out why it's yielding them in a weird way.
import scrapy
from scrapy.spiders import CrawlSpider
from scrapy import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from sofifa_scraper.items import Player
class FifaInfoScraper(scrapy.Spider):
name = "player2_scraper"
start_urls = ["https://www.futhead.com/19/players/?level=all_nif&bin_platform=ps"]
def parse(self,response):
for href in response.css("li.list-group-item > div.content > a::attr(href)"):
yield response.follow(href, callback = self.parse_name)
def parse_name(self,response):
item = Player()
item['name'] = response.css("div[itemprop = 'child'] > span[itemprop = 'title']::text").get() #Get player name
club_league_nation = response.css("div.col-xs-5 > a::text").getall() #club, league, nation are all stored under same selectors, so pull them all at once
item['club'],item['league'],item['nation'] = club_league_nation #split the selected info from club_league_nation into 3 seperate categories
yield item
I'd like the scraper to return the player name in the first column, and am not too concerned with the order after that. Player name always ends up in another column though, and happens when I'm only pulling the name and one other value as well.
Just add FEED_EXPORT_FIELDS in your settings.py (documentation):
FEED_EXPORT_FIELDS = ["name", "club", "league", "nation"]
I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.
I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}
your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please
I currently have a Spider-based spider that I wrote for crawling an input JSON array of start_urls:
from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
import json
import datetime
import re
class AtlanticFirearmsSpider(Spider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls='[]', *args, **kwargs):
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
self.start_urls = json.loads(start_urls)
def parse(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
I can call it from the command line like so, and it does a wonderful job:
scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'
However, I'm trying to add a CrawlSpider-based spider for crawling the entire site that inherits from it and re-uses the parse method logic. My first attempt looked like this:
class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
name = "atlantic_firearms_crawler"
start_urls = [
"http://www.atlanticfirearms.com"
]
rules = (
# I know, I need to update these to LxmlLinkExtractor
Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
)
Running this spider with
scrapy crawl atlantic_firearms_crawler
crawls the site but never parses any items. I think it's because CrawlSpider apparently has its own definition of parse, so somehow I'm screwing things up.
When I change callback='parse' to callback='parse_item' and rename the parse method in AtlanticFirearmsSpider to parse_item, it works wonderfully, crawling the whole site and parsing items successfully. But then if I try to call my original atlantic_firearms spider again, it errors out with NotImplementedError, apparently because Spider-based spiders really want one to define the parse method as parse.
What's the best way for me to re-use my logic between these spiders so that I can both feed a JSON array of start_urls as well as do full-site crawls?
You can avoid multiple inheritance here.
Combine both spiders in a single one. If start_urls would be passed from the command-line - it would behave like a CrawlSpider, otherwise like a regular spider:
from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor
import json
class AtlanticFirearmsSpider(CrawlSpider):
name = "atlantic_firearms"
allowed_domains = ["atlanticfirearms.com"]
def __init__(self, start_urls=None, *args, **kwargs):
if start_urls:
self.start_urls = json.loads(start_urls)
self.rules = []
self.parse = self.parse_response
else:
self.start_urls = ["http://www.atlanticfirearms.com/"]
self.rules = [
Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
]
super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
def parse_response(self, response):
l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
product = l.load_item()
return product
Or, alternatively, just extract the logic inside the parse() method into a library function and call from both spiders that would not be related, separate spiders.
This is driving me nuts. It drove me to consolidate and simplify a lot of code, but I just can't fix the problem. Here is an example of two spiders I wrote. The top one has a memory leak that causes the memory to slowly expand until its full.
They are almost Identical and they use the same items and everything else outside of the spider so I do not think there is anything in the rest of my code to blame. I have also isolated bits of code here and there, tried deleting variables towards the end. I've looked over the scrapy docs and I am still stumped. Anyone have any magic to work?
import scrapy
from wordscrape.items import WordScrapeItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class EnglishWikiSpider(CrawlSpider):
name='englishwiki'
allowed_domains = ['en.wikipedia.org']
start_urls = [
'http://en.wikipedia.org/wiki/'
]
rules = (
Rule(SgmlLinkExtractor(allow=('/wiki/', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# This takes all the text that is in that div and extracts it, only the text, not html tags (see: //text())
# if it meets the conditions of my regex
english_text = response.xpath('//*[#id="mw-content-text"]//text()').re(ur'[a-zA-Z\'-]+')
english_dict = {}
for i in english_text:
if len(i) > 1:
word = i.lower()
if word in english_dict:
english_dict[word] += 1
else:
english_dict[word] = 1
# Dump into json string and put it in the word item, it will be ['word': {<<jsondict>>}, 'site' : url, ...]
jsondump = json.dumps(english_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
The second, and stable spider:
import scrapy
from wordscrape.items import WordScrapeItem
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class NaverNewsSpider(CrawlSpider):
name='navernews'
allowed_domains = ['news.naver.com']
start_urls = [
'http://news.naver.com',
'http://news.naver.com/main/read.nhn?oid=001&sid1=102&aid=0007354749&mid=shm&cid=428288&mode=LSD&nh=20150114125510',
]
rules = (
Rule(SgmlLinkExtractor(allow=('main/read\.nhn', )), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = WordScrapeItem()
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[#id="articleBodyContents"]//text()').re(ur'[\uac00-\ud7af]+')
# Go through all hangul syllables found and adds to value or adds key
hangul_dict = {}
for i in hangul_syllables:
if i in hangul_dict:
hangul_dict[i] += 1
else:
hangul_dict[i] = 1
jsondump = json.dumps(hangul_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
I think Jepio is right in his comment. I think the spidder is finding too many links to follow and therefore having to store them all in the interim perdiod.
EDIT: So, the problem is that it is storing all of those links in memory instead of on disk and it eventually fills up all my memory. The solution was to run scrapy with a job directory, and that forces them to be stored on disk where there is plenty of space.
$ scrapy crawl spider -s JOBDIR=somedirname
Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['http://www.whitecase.com/aabbas/']
names = hxs.select('//td[#class="altRow"][1]/a/#href').re('/.a\w+')
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
def parse(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['school'] = hxs.select('//td[#class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
and when I run
C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase
I get
[Nu] ERROR: Could not find spider for domain: wcase
The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong?
Thanks for your help!
Please also check the version of scrapy. The latest version uses "name" instead of "domain_name" attribute to uniquely identify a spider.
Have you included the spider in SPIDER_MODULES list in your scrapy_settings.py?
It's not written in the tutorial anywhere that you should to this, but you do have to.
These two lines look like they're causing trouble:
u = names.pop()
rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),)
Only one rule will be followed each time the script is run. Consider creating a rule for each URL.
You haven't created a parse_item callback, which means that the rule does nothing. The only callback you've defined is parse, which changes the default behaviour of the spider.
Also, here are some things that will be worth looking into.
CrawlSpider doesn't like having its default parse method overloaded. Search for parse_start_url in the documentation or the docstrings. You'll see that this is the preferred way to override the default parse method for your starting URLs.
NuSpider.hxs is called before it's defined.
I believe you have syntax errors there. The name = hxs... will not work because you don't get defined before the hxs object.
Try running python yourproject/spiders/domain.py to get syntax errors.
You are overriding the parse method, instead of implementing a new parse_item method.