I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items
Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)
Related
I am using a scrapy spider for URL crawling for my research project. My spider is based on the code from bhattraideb (Scrapy follow all the links and get status) and slightly edited to fit my needs better.
At the moment I’m restarting the spider every time when changing the allowed domain and start URL since I need the output for each allowed domain in a separate file. Since my list of URLs is growing this is getting very tedious to do...
I tried to iterate over a .csv importing both columns with the allowed_domains and start_urls as list using "i" and "while", however it always clashes with the classes.
I'd appreciate any help :-)
see : How to loop through multiple URLs to scrape from a CSV file in Scrapy?
For info : when not using a CSV, you can also do something like this, reading start_urls from a list:
# https://www.food.com/recipe/all/healthy?pn=1
list_url = 'https://www.food.com/recipe/all/healthy?pn='
start_urls = [list_url + str(page)]
Increment the page variable for the next page, until next page is None.
First i have grabed all the coin link from the website and requested to those link.
But scrapy do'nt requesting serially from the link list.after requesting to thos link scraping data successfully but when saving to csv file it making a blank row every time after one succesfull scraped item.Result Screenshot
I am expecting that it will request serially from the link list and it will not make any blank row.how can i do that?
I am using python 3.6 and scrapy version 1.5.1
My code:
import scrapy
class MarketSpider(scrapy.Spider):
name = 'market'
allowed_domains = ['coinmarketcap.com']
start_urls = ['http://coinmarketcap.com/']
def parse(self, response):
Coin = response.xpath('//*[#class="currency-name-container link-secondary"]/#href').extract()
for link in Coin:
absolute_url = response.urljoin(link)
yield scrapy.Request(absolute_url,callback=self.website_link)
def website_link(self,response):
link = response.xpath('//*[#class="list-unstyled details-panel-item--links"]/li[2]/a/#href').extract()
name = response.xpath('normalize-space(//h1)').extract()
yield{'Name': name ,'Link': link}
I think scrapy is visiting pages in a multi-threaded (producer/consumer) fashion. This can explain the non-sequential aspect of your result.
To verify this hypothesis, you could change your config to use a single thread.
For the blank link, are you sure any of your name or link variable contains a \n ?
Scrapy is an asynchronous framework - multiple requests are executed concurrently and the responses are parsed as they are received.
The only way to reliably control which responses are parsed first is to turn this feature off, e.g. by setting CONCURRENT_REQUESTS to 1.
This would make your spider less efficient though, and this kind of control of the parse order is rarely necessary, so I would avoid it if possible.
The extra newlines in csv exports on windows are a known issue, and will be fixed in the next scrapy release.
I've written a spider to crawl the boardgamegeek.com/browse/boardgame site for information regarding boardgames in the list.
My problem is that when pulling two specific selectors in my code, a response is not always received for those selectors, sometimes it returns a selector object other times it doesn't. After inspecting the response during debugging, the dynamically loaded selectors don't exist in the code.
My two offending lines
bggspider.py
bg['txt_cnt'] = response.xpath(
selector_paths.SEL_TXT_REVIEWS).extract_first()
bg['vid_cnt'] = response.xpath(
selector_paths.SEL_VID_REVIEWS).extract_first()
Where the selectors are defined as
selector_paths.py
SEL_TXT_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Text Reviews")]/text()'
SEL_VID_REVIEWS = '//div[#class="panel-inline-
links"]/a[contains(text(), "All Video Reviews")]/text()'
After yielding the bg item, in the pipeline the attributes are processed where a check is performed since many boardgames have very little information for various parts of the page.
pipelines.py
if item['txt_cnt']:
item['txt_cnt'] = int(re.findall('\d+', item['txt_cnt'])[0])
else:
item['txt_cnt'] = 0
if item['vid_cnt']:
item['vid_cnt'] = int(re.findall('\d+', item['vid_cnt'])[0])
else:
item['vid_cnt'] = 0
The aim of the field processing is just to grab the numerical value in the string which is the number of text and video reviews for a boardgame.
I'm assuming I'm missing something that has to do with Splash since I'm getting selector items for some/most queries but still missing many.
I am running the ScrapySplash docker container locally, localhost:8050.
Code for the spider can be found here. BGGSpider on Github
Any help or information about how to remedy this problem or how ScrapySplash works would be appreciated.
I am fairly new to Python, Scrapy and this board, so please bear with me, as I try to illustrate my problem.
My goal is to collect the names (and possibly prices) of all available hotels in Berlin on booking.com for a specific date (see for example the predefined start_url) with the help of Scrapy.
I think the crucial parts are:
I want to paginate through all next pages until the end.
On each page I want to collect the name of every hotel and the name should be saved respectively.
If I run "scrapy runspider bookingspider.py -o items.csv -t csv" for my code below, the terminal shows me that it crawls through all available pages, but in the end I only get an empty items.csv.
Step 1 seems to work, as the terminal shows succeeding urls are being crawled (e.g. [...]offset=15, then [...]offset=30). Therefore I think my problem is step 2.
For step 2 one needs to define a container or block, in which each hotel information is contained seperately and can serve as the basis for a loop, right?
I picked "div class="sr_item_content sr_item_content_slider_wrapper"", since every hotel block has this element at a superordinate level, but I am really unsure about this part. Maybe one has to consider a higher level
(but which element should I take, since they are not the same across the hotel blocks?).
Anyway, based on that I figured out the remaining XPath to the element, which contains the hotel name.
I followed two tutorials with similar settings (though different websites), but somehow it does not work here.
Maybe you have an idea, every help is very much appreciated. Thank you!
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.http.request import Request
class HotelItem(Item):
title = Field()
price = Field()
class BookingCrawler(CrawlSpider):
name = 'booking_crawler'
allowed_domains = ['booking.com']
start_urls = ['http://www.booking.com/searchresults.html?checkin_monthday=25;checkin_year_month=2016-10;checkout_monthday=26;checkout_year_month=2016-10;class_interval=1;dest_id=-1746443;dest_type=city;offset=0;sb_travel_purpose=leisure;si=ai%2Cco%2Cci%2Cre%2Cdi;src=index;ss=Berlin']
custom_settings = {
'BOT_NAME': 'booking-scraper',
}
def parse(self, response):
s = Selector(response)
index_pages = s.xpath('//div[#class="results-paging"]/a/#href').extract()
if index_pages:
for page in index_pages:
yield Request(response.urljoin(page), self.parse)
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
items = []
for hotel in hotels:
item = HotelItem()
item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0]
item['price'] = hotel.xpath('//div[#class="sr-prc--num sr-prc--final"]/text()').extract()[0]
items.append(item)
for item in items:
yield item
I think the problem may be with your XPath on this line:
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
From this SO question it looks like you need to define something more along the lines of:
//div[contains(#class, 'sr_item_content') and contains(#class, 'sr_item_content_slider_wrapper')]
To help you debug further, you could try outputting the contents of index_pages first to see if it is definitely returning what you expect on that level.
Also, check Xpath Visualiser (also mentioned in the question), which can help with building Xpath.
I am trying to write a spider to crawl certain pages base on the data or info on the index page. And then store the result in a database.
For example, let say I would like to crawl stackoverflow.com/questions/tagged/scrapy
I would go through the index page, if the question is not in my database, then I would store the answer count in the database, then follow the link of the question and crawl that page.
if the question is already in the database, but the number of answer is greater than the one in the database: crawl that page again.
if the question is already in the database and the answer counter is the same: skip this question.
At the moment I could get all the links and answer counts(in this example)on the index page.
but I don't know how to make the spider to follow the link to the question page base on the answer count.
Is there a way to do this with one spider instead of having two spiders, one spider is getting all the links on the index page, compares the data with the database, exports a json or csv file, and then passes it to another spider to crawl the question page?
Just use BaseSpider. That way you can make all the logic depend on the content you are scraping. I personally prefer BaseSpider since it gives you a lot more control over the scraping process.
The spider should look something like this (this is more of a pseudo code):
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
from myproject.items import MyItem
class StackOverflow(BaseSpider):
name = 'stackoverflow.com'
allowed_domains = ['stackoverflow.com']
start_urls = ['http://stackoverflow.com/questions']
def parse(self, response):
hxs = HtmlXPathSelector(response)
for question in hxs.select('//question-xpath'):
question_url = question.select('./question-url')
answer_count = question.select('./answer-count-xpath')
# you'll have to write the xpaths and db logic yourself
if get_db_answer_count(question_url) != answer_count[0]:
yield Request(question_url, callback = self.parse_question)
def parse_question(self, response):
insert_question_and_answers_into_db
pass
This is what the CrawlSpider and the Rules do (be sure to check out the example). You could first get the information from the index site (though your approach counting answers is somehow flawed: what if a user deleted a post and a new one has been added) and decide on each sub page, if you want to get its information or not.
Put simple: use the spider on the index pages and follow its questions. When given a question, check if you want to get the information or drop/ignore the question.