Scrapy use item and save data in a json file - python

I want to use scrapy item and manipulate data and saving all in json file (using json file like a db).
# Spider Class
class Spider(scrapy.Spider):
name = 'productpage'
start_urls = ['https://www.productpage.com']
def parse(self, response):
for product in response.css('article'):
link = product.css('a::attr(href)').get()
id = link.split('/')[-1]
title = product.css('a > span::attr(content)').get()
product = Product(self.name, id, title, price,'', link)
yield scrapy.Request('{}.json'.format(link), callback=self.parse_product, meta={'product': product})
yield scrapy.Request(url=response.url, callback=self.parse, dont_filter=True)
def parse_product(self, response):
product = response.meta['product']
for size in json.loads(response.body_as_unicode()):
product.size.append(size['name'])
if self.storage.update(product.__dict__):
product.send('url')
# STORAGE CLASS
class Storage:
def __init__(self, name):
self.name = name
self.path = '{}.json'.format(self.name)
self.load() """Load json database"""
def update(self, new_item):
# .... do things and update data ...
return True
# Product Class
class Product:
def __init__(self, name, id, title, size, link):
self.name = name
self.id = id
self.title = title
self.size = []
self.link = link
def send(self, url):
return # send notify...
Spider class search for products in main page of start_url, then it parse product page to catch also sizes.
Finally it search if there are updates on self.storage.update(product.__dict__) and if it's true send a notification.
How can I implement Item in my code? I thought I could insert it in Product Class, but I can't include send method...

You should define the item you want. And yield it after parsed.
Last, run the command:
scrapy crawl [spider] -o xx.json
PS:
Default scrapy had support export json file.

#Jadian's answer will get you a file with JSON in it, but not quite db like access to it. In order to do this properly from a design stand point I would follow the below instructions. You don't have to use mongo either there are plenty of other nosql dbs available that use JSON.
What I would recommend in this situation is that you build out the items properly using scrapy.Item() classes. Then you can use json.dumps into mongoDB. You will need to assign a PK to each item, but mongo is basically made to be a non relational json store. So what you would do is then create an item pipeline which checks for the PK of the item and if its found and no details are changed then raise DropItem() else update/store new data into the mongodb. You could even pipe into the json exporter if you wanted to probably, but I think just dumping the python object to json into mongo is the way to go and then mongo will present you with json to work with on the front end.
I hope that you understand this answer, but I think from a design point this will be a much easier solution since mongo is basically a non relational data store based on JSON, and you will be dividing your item pipeline logic into its own area instead of cluttering your spider with it.
I would provide a code sample, but most of mine are using ORM for SQL db. Mongo is actually easier to use than this...

Related

Scrapy item enriching from multiple websites

I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!
I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.

Pass a list of start_urls as parameter from Django to Scrapyd

I'm working in a little scraping platform using Django and Scrapy (scrapyd as API). Default spider is working as expected, and using ScrapyAPI (python-scrapyd-api) I'm passing a URL from Django and scrap data, I'm even saving results as JSON to a postgres instance. This is for a SINGLE URL pass as parameter.
When trying to pass a list of URLs, scrapy just takes the first URL from a list. I don't know if it's something about how Python or ScrapyAPI is treating or processing this arguments.
# views.py
# This is how I pass parameters from Django
task = scrapyd.schedule(
project=scrapy_project,
spider=scrapy_spider,
settings=scrapy_settings,
url=urls
)
# default_spider.py
def __init__(self, *args, **kwargs):
super(SpiderMercadoLibre, self).__init__(*args, **kwargs)
self.domain = kwargs.get('domain')
self.start_urls = [self.url] # list(kwargs.get('url'))<--Doesn't work
self.allowed_domains = [self.domain]
# Setup to tell Scrapy to make calls from same URLs
def start_requests(self):
...
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'original_url': url}, dont_filter=True)
Of course I can make some changes to my model so I can save every result iterating from the list of URLs and scheduling each URL using ScrapydAPI, but I'm wondering if this is a limitation of scrapyd itself or am I missing something about Python mechanics.
This is how ScrapydAPI is processing the schedule method:
def schedule(self, project, spider, settings=None, **kwargs):
"""
Schedules a spider from a specific project to run. First class, maps
to Scrapyd's scheduling endpoint.
"""
url = self._build_url(constants.SCHEDULE_ENDPOINT)
data = {
'project': project,
'spider': spider
}
data.update(kwargs)
if settings:
setting_params = []
for setting_name, value in iteritems(settings):
setting_params.append('{0}={1}'.format(setting_name, value))
data['setting'] = setting_params
json = self.client.post(url, data=data, timeout=self.timeout)
return json['jobid']
I think i'm implementing everything as expected but everytime, no matter what approach is used, only the first URL from the list of URLs is scraped

Pointing Scrapy at a local cache instead of performing a normal spidering process

I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.

Html elements from a xpath response not splitting when pipelined into postgresql table

I've written a webscraper using scrapy that parses html concert data about upcoming concerts from a table on vivid seats http://www.vividseats.com/concerts/awolnation-tickets.html
I'm able to successfully scrape the data for only some of the elements (i.e.eventName, eventLocation, eventCity, and eventState) but when I pipeline the item into the database, it enters the full collection of the scraped data into each row instead of separating each new concert ticket its own row. I saw another SO question where someone suggested apending each item into a items list but I tried that and got an error. If this was the solution, how could I implement this with both the parse method and the pipelines.py file? In addition to this, I am unable to scrape the data for the date/time , the links for the actual tickets, and the price for some reason. I tried making the column for the date/time the date-time type so maybe that caused a problem. I mainly need to do if my parse method is even structured properly as this is my first time using it. The code for the parse method and the pipelines.py is below. Thanks!
def parse(self, response):
tickets = Selector(response).xpath('//*[#itemtype="http://schema.org/Event"]')
for ticket in tickets:
item = ComparatorItem()
item['eventName'] =ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
item['eventLocation'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "name"]/text()').extract()
item['price'] =ticket.xpath('//*[#class="eventTickets lastChild"]/div/div/#data-origin-price').extract()
yield Request(url, self.parse_articles_follow_next_page)
item['ticketsLink'] =ticket.xpath('//*[#class="productionsTicketCol productionsTicketCol"]/a[#class="btn btn-primary"]/#href').extract()
item['eventDate'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/meta/#content').extract()
item['eventCity'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressLocality"]/text()').extract()
item['eventState'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressRegion"]/text()').extract()
#item['eventTime'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/div[#class = "productionsTime"]/text()').extract()
yield item
pipelines.py
from sqlalchemy.orm import sessionmaker
from models import Deals, db_connect, create_deals_table
class LivingSocialPipeline(object):
"""Livingsocial pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_deals_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
I think the problem here is not inserting data to database but the way you're extracting it. I see you are not using relative xpaths when you iterate over ticket selectors.
For example this line:
ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
will get you all elements with 'productionsEvent' class that are found in response and not all elements of this class relative to ticket selector. If you want to get children of ticket selector you need to use this xpath with dot at the beginning:
'.//*[#class="productionsEvent"]/text()'
this xpath will only take elements which are children of ticket selector, not all elements on page. Using absolute xpaths instead of relative ones is very common gotcha described in Scrapy docs.

How to get number of Items scraped by Python Scrapy tool?

I am using Python Scrapy tool to extract Data from website. I am able to scrape the Data. Now I want the count of Items scraped from a particular Website. How can I get the Number of items scraped? Is there some built in class for that in Scrapy? Any help will be appreciated. Thanks..
Based on the example here, I solved the same problem like this:
1.write a custom web service like this to count the item downloaded:
from scrapy.webservice import JsonResource
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class ItemCountResource(JsonResource):
ws_name = 'item_count'
def __init__(self, crawler, spider_name=None):
JsonResource.__init__(self, crawler)
self.item_scraped_count = 0
dispatcher.connect(self.scraped, signals.item_scraped)
self._spider_name = spider_name
self.isLeaf = spider_name is not None
def scraped(self):
self.item_scraped_count += 1
def render_GET(self, txrequest):
return self.item_scraped_count
def getChild(self, name, txrequest):
return ItemCountResource(name, self.crawler)
2.register the service in settings.py like this:
WEBSERVICE_RESOURCES = {
'path.to.ItemResource.ItemCountResource': 1,
}
3.visite http://localhost:6080/item_count will get the item crawled.

Categories

Resources