How to get number of Items scraped by Python Scrapy tool? - python

I am using Python Scrapy tool to extract Data from website. I am able to scrape the Data. Now I want the count of Items scraped from a particular Website. How can I get the Number of items scraped? Is there some built in class for that in Scrapy? Any help will be appreciated. Thanks..

Based on the example here, I solved the same problem like this:
1.write a custom web service like this to count the item downloaded:
from scrapy.webservice import JsonResource
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class ItemCountResource(JsonResource):
ws_name = 'item_count'
def __init__(self, crawler, spider_name=None):
JsonResource.__init__(self, crawler)
self.item_scraped_count = 0
dispatcher.connect(self.scraped, signals.item_scraped)
self._spider_name = spider_name
self.isLeaf = spider_name is not None
def scraped(self):
self.item_scraped_count += 1
def render_GET(self, txrequest):
return self.item_scraped_count
def getChild(self, name, txrequest):
return ItemCountResource(name, self.crawler)
2.register the service in settings.py like this:
WEBSERVICE_RESOURCES = {
'path.to.ItemResource.ItemCountResource': 1,
}
3.visite http://localhost:6080/item_count will get the item crawled.

Related

Scrapy use item and save data in a json file

I want to use scrapy item and manipulate data and saving all in json file (using json file like a db).
# Spider Class
class Spider(scrapy.Spider):
name = 'productpage'
start_urls = ['https://www.productpage.com']
def parse(self, response):
for product in response.css('article'):
link = product.css('a::attr(href)').get()
id = link.split('/')[-1]
title = product.css('a > span::attr(content)').get()
product = Product(self.name, id, title, price,'', link)
yield scrapy.Request('{}.json'.format(link), callback=self.parse_product, meta={'product': product})
yield scrapy.Request(url=response.url, callback=self.parse, dont_filter=True)
def parse_product(self, response):
product = response.meta['product']
for size in json.loads(response.body_as_unicode()):
product.size.append(size['name'])
if self.storage.update(product.__dict__):
product.send('url')
# STORAGE CLASS
class Storage:
def __init__(self, name):
self.name = name
self.path = '{}.json'.format(self.name)
self.load() """Load json database"""
def update(self, new_item):
# .... do things and update data ...
return True
# Product Class
class Product:
def __init__(self, name, id, title, size, link):
self.name = name
self.id = id
self.title = title
self.size = []
self.link = link
def send(self, url):
return # send notify...
Spider class search for products in main page of start_url, then it parse product page to catch also sizes.
Finally it search if there are updates on self.storage.update(product.__dict__) and if it's true send a notification.
How can I implement Item in my code? I thought I could insert it in Product Class, but I can't include send method...
You should define the item you want. And yield it after parsed.
Last, run the command:
scrapy crawl [spider] -o xx.json
PS:
Default scrapy had support export json file.
#Jadian's answer will get you a file with JSON in it, but not quite db like access to it. In order to do this properly from a design stand point I would follow the below instructions. You don't have to use mongo either there are plenty of other nosql dbs available that use JSON.
What I would recommend in this situation is that you build out the items properly using scrapy.Item() classes. Then you can use json.dumps into mongoDB. You will need to assign a PK to each item, but mongo is basically made to be a non relational json store. So what you would do is then create an item pipeline which checks for the PK of the item and if its found and no details are changed then raise DropItem() else update/store new data into the mongodb. You could even pipe into the json exporter if you wanted to probably, but I think just dumping the python object to json into mongo is the way to go and then mongo will present you with json to work with on the front end.
I hope that you understand this answer, but I think from a design point this will be a much easier solution since mongo is basically a non relational data store based on JSON, and you will be dividing your item pipeline logic into its own area instead of cluttering your spider with it.
I would provide a code sample, but most of mine are using ORM for SQL db. Mongo is actually easier to use than this...

Order a json by field using scrapy

I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with
I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.
But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component (middlewares, pipelines, exporters, signals, etc...) but no one seems useful for this purpose. I'm arrived at the conclusion that a solution to solve this problem doesn't exist at all in scrapy (except, maybe, a very elaborated trick), and you are forced to order things in a second phase. Do you agree, or do you have some idea? I copy here the code of my scraper.
Spider:
# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader
class EulerSpider(scrapy.Spider):
name = 'euler'
allowed_domains = ['projecteuler.net']
start_urls = ["https://projecteuler.net/archives"]
def parse(self, response):
numpag = response.css("div.pagination a[href]::text").extract()
maxpag = int(numpag[len(numpag) - 1])
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
for i in range(2, maxpag + 1):
next_page = "https://projecteuler.net/archives;page=" + str(i)
yield response.follow(next_page, self.parse_next)
return [scrapy.Request("https://projecteuler.net/archives", self.parse)]
def parse_next(self, response):
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
def parse_problems(self, response):
l = ItemLoader(item=Problem(), response=response)
l.add_css("title", "h2")
l.add_css("id", "#problem_info")
l.add_css("content", ".problem_content")
yield l.load_item()
Item:
import re
import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags
def extract_first_number(text):
i = re.search('\d+', text)
return int(text[i.start():i.end()])
def array_to_value(element):
return element[0]
class Problem(scrapy.Item):
id = scrapy.Field(
input_processor=MapCompose(remove_tags, extract_first_number),
output_processor=Compose(array_to_value)
)
title = scrapy.Field(input_processor=MapCompose(remove_tags))
content = scrapy.Field()
If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.
This is how Scrapy's built-in JsonItemExporter is implemented.
With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().
Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.
By now I've found a working solution using pipeline:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.list_items = []
self.file = open('euler.json', 'w')
def close_spider(self, spider):
ordered_list = [None for i in range(len(self.list_items))]
self.file.write("[\n")
for i in self.list_items:
ordered_list[int(i['id']-1)] = json.dumps(dict(i))
for i in ordered_list:
self.file.write(str(i)+",\n")
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
self.list_items.append(item)
return item
Though it may be non optimal, because the guide suggests in another example:
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

Scrapy callback doesn't work at all in my script

I've written a script in python scrapy to parse name and prices of different items available in a webpage. I tried to implement logic in my script the way I've learnt so far. However, when I execute it, I get the following error. I suppose I can't make the callback method work properly. Here is the script I've tried with:
The spider names "sth.py" contains:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
class SephoraSpider(CrawlSpider):
name = "sephorasp"
def start_requests(self):
yield Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
def parse_pages(self, response):
for link in response.xpath('//ul[#class="nav-primary"]//a[contains(#class,"level0")]/#href').extract():
yield Request(url = link, callback = self.parse_inner_pages)
def parse_inner_pages(self, response):
for links in response.xpath('//li[contains(#class,"amshopby-cat")]/a/#href').extract():
yield Request(url = links, callback = self.target_page)
def target_page(self, response):
for titles in response.xpath('//div[#class="product-info"]'):
product = titles.xpath('.//div[contains(#class,"product-name")]/a/text()').extract_first()
rate = titles.xpath('.//span[#class="price"]/text()').extract_first()
yield {'Name':product,'Price':rate}
"items.py" contains:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
Partial error looks like:
if cookie.secure and request.type != "https":
AttributeError: 'WrappedRequest' object has no attribute 'type'
Here is the total error log:
"https://www.dropbox.com/s/kguw8174ye6p3q9/output.log?dl=0"
Looks like you are running scrapy v1.1 when the current release is v1.4. As far as I remember there was a bug regarding some early 1.something version and WrappedRequest object used for handling cookies.
Try upgrading to v1.4:
pip install scrapy --upgrade

wrong Xpath in IMDB spider scrapy

Here:
IMDB scrapy get all movie data
response.xpath("//*[#class='results']/tr/td[3]")
returns empty list. I tried to change it to:
response.xpath("//*[contains(#class,'chart full-width')]/tbody/tr")
without success.
Any help please? Thanks.
I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. The Problem statement is to get All movie data from the given site. It involves two things. First is to go through all the pages that contain the list of all the movies of that year. While the Second one is to get the link to each movie and then here you do your own magic.
The problem you faced is with the getting the xpath for the link to each movies. This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). Anyways, following is the xpath you would require.
FIRST :
We take div class nav as a landmark and find the lister-page-next next-page class in its children.
response.xpath("//div[#class='nav']/div/a[#class='lister-page-next next-page']/#href").extract_first()
Here this will give : Link for the next page | returns None if at the last page (since next-page tag not present)
SECOND :
This is the original doubt by the OP.
#Get the list of the container having the title, etc
list = response.xpath("//div[#class='lister-item-content']")
#From the container extract the required links
paths = list.xpath("h3[#class='lister-item-header']/a/#href").extract()
Now all you would need to do is loop through each of these paths element and request the page.
Thanks for your answer. I eventually used your xPath like so:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import MovieItem
IMDB_URL = "http://imdb.com"
class IMDBSpider(CrawlSpider):
name = 'imdb'
# in order to move the next page
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[#class='nav']/div/a[#class='lister-page-next next-page']",)),
callback="parse_page", follow= True),)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2017
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
# movies are sorted by number of votes
yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))
def parse_page(self, response):
content = response.xpath("//div[#class='lister-item-content']")
paths = content.xpath("h3[#class='lister-item-header']/a/#href").extract() # list of paths of movies in the current page
# all movies in this page
for path in paths:
item = MovieItem()
item['MainPageUrl'] = IMDB_URL + path
request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
request.meta['item'] = item
yield request
# make sure that the start_urls are parsed as well
parse_start_url = parse_page
def parse_movie_details(self, response):
pass # lots of parsing....
Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year>

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.

Categories

Resources