I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!
I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.
Related
This is a thing I've been encountering very often lately. I am supposed scrape data from multiple requests for a single item.
I've been using request meta to accumulate data between requests like this;
def parse_data(self, response):
data = 'something'
yield scrapy.Request(
url='url for another page for scraping images',
method='GET',
meta={'data': data}
)
def parse_images(self, response):
images = ['some images']
data = response.meta['data']
yield scrapy.Request(
url='url for another page for scraping more data',
method='GET',
meta={'images': images, 'data': data}
)
def parse_more(self, response):
more_data = 'more data'
images = response.meta['images']
data = response.meta['data']
yield item
In the last parse method, I scrape the final needed data and yield the item. However, this approach looks awkward to me. Is there any better way to scrape webpages like those or am I doing this correctly?
it's quite regular and correct approach keeping in mind that scrapy is async framework.
If you wish to have more plain code structure you can you use scrapy-inline-requests
But it will require more hassle than using meta from my perspective.
This is the proper way of tracking your item throughout requests. What I would do differently though is actually just set the item values like so:
item['foo'] = bar
item['bar'] = foo
yield scrapy.Request(url, callback=self.parse, meta={'item':item})
With this approach you only have to send one thing the item itself through each time. There will be some instances where this isnt desirable.
I want to use scrapy item and manipulate data and saving all in json file (using json file like a db).
# Spider Class
class Spider(scrapy.Spider):
name = 'productpage'
start_urls = ['https://www.productpage.com']
def parse(self, response):
for product in response.css('article'):
link = product.css('a::attr(href)').get()
id = link.split('/')[-1]
title = product.css('a > span::attr(content)').get()
product = Product(self.name, id, title, price,'', link)
yield scrapy.Request('{}.json'.format(link), callback=self.parse_product, meta={'product': product})
yield scrapy.Request(url=response.url, callback=self.parse, dont_filter=True)
def parse_product(self, response):
product = response.meta['product']
for size in json.loads(response.body_as_unicode()):
product.size.append(size['name'])
if self.storage.update(product.__dict__):
product.send('url')
# STORAGE CLASS
class Storage:
def __init__(self, name):
self.name = name
self.path = '{}.json'.format(self.name)
self.load() """Load json database"""
def update(self, new_item):
# .... do things and update data ...
return True
# Product Class
class Product:
def __init__(self, name, id, title, size, link):
self.name = name
self.id = id
self.title = title
self.size = []
self.link = link
def send(self, url):
return # send notify...
Spider class search for products in main page of start_url, then it parse product page to catch also sizes.
Finally it search if there are updates on self.storage.update(product.__dict__) and if it's true send a notification.
How can I implement Item in my code? I thought I could insert it in Product Class, but I can't include send method...
You should define the item you want. And yield it after parsed.
Last, run the command:
scrapy crawl [spider] -o xx.json
PS:
Default scrapy had support export json file.
#Jadian's answer will get you a file with JSON in it, but not quite db like access to it. In order to do this properly from a design stand point I would follow the below instructions. You don't have to use mongo either there are plenty of other nosql dbs available that use JSON.
What I would recommend in this situation is that you build out the items properly using scrapy.Item() classes. Then you can use json.dumps into mongoDB. You will need to assign a PK to each item, but mongo is basically made to be a non relational json store. So what you would do is then create an item pipeline which checks for the PK of the item and if its found and no details are changed then raise DropItem() else update/store new data into the mongodb. You could even pipe into the json exporter if you wanted to probably, but I think just dumping the python object to json into mongo is the way to go and then mongo will present you with json to work with on the front end.
I hope that you understand this answer, but I think from a design point this will be a much easier solution since mongo is basically a non relational data store based on JSON, and you will be dividing your item pipeline logic into its own area instead of cluttering your spider with it.
I would provide a code sample, but most of mine are using ORM for SQL db. Mongo is actually easier to use than this...
I've written a webscraper using scrapy that parses html concert data about upcoming concerts from a table on vivid seats http://www.vividseats.com/concerts/awolnation-tickets.html
I'm able to successfully scrape the data for only some of the elements (i.e.eventName, eventLocation, eventCity, and eventState) but when I pipeline the item into the database, it enters the full collection of the scraped data into each row instead of separating each new concert ticket its own row. I saw another SO question where someone suggested apending each item into a items list but I tried that and got an error. If this was the solution, how could I implement this with both the parse method and the pipelines.py file? In addition to this, I am unable to scrape the data for the date/time , the links for the actual tickets, and the price for some reason. I tried making the column for the date/time the date-time type so maybe that caused a problem. I mainly need to do if my parse method is even structured properly as this is my first time using it. The code for the parse method and the pipelines.py is below. Thanks!
def parse(self, response):
tickets = Selector(response).xpath('//*[#itemtype="http://schema.org/Event"]')
for ticket in tickets:
item = ComparatorItem()
item['eventName'] =ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
item['eventLocation'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "name"]/text()').extract()
item['price'] =ticket.xpath('//*[#class="eventTickets lastChild"]/div/div/#data-origin-price').extract()
yield Request(url, self.parse_articles_follow_next_page)
item['ticketsLink'] =ticket.xpath('//*[#class="productionsTicketCol productionsTicketCol"]/a[#class="btn btn-primary"]/#href').extract()
item['eventDate'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/meta/#content').extract()
item['eventCity'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressLocality"]/text()').extract()
item['eventState'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressRegion"]/text()').extract()
#item['eventTime'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/div[#class = "productionsTime"]/text()').extract()
yield item
pipelines.py
from sqlalchemy.orm import sessionmaker
from models import Deals, db_connect, create_deals_table
class LivingSocialPipeline(object):
"""Livingsocial pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_deals_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
I think the problem here is not inserting data to database but the way you're extracting it. I see you are not using relative xpaths when you iterate over ticket selectors.
For example this line:
ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
will get you all elements with 'productionsEvent' class that are found in response and not all elements of this class relative to ticket selector. If you want to get children of ticket selector you need to use this xpath with dot at the beginning:
'.//*[#class="productionsEvent"]/text()'
this xpath will only take elements which are children of ticket selector, not all elements on page. Using absolute xpaths instead of relative ones is very common gotcha described in Scrapy docs.
Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.
I have in my pipleline a method to check if the post date of the item is older then that found in mysql, so let lastseen be the newest datetime retrieved from database:
def process_item(self, item, spider):
if item['post_date'] < lastseen:
# set flag to close_spider
# raise DropItem("old item")
This code basically works except: I check the site on hourly basis just to get the new posts, if I don't stop the spider it will keep crawling on thousands of pages, if I stop the spider on flag, chances are few requests will not be processed, since they may came back in queue after spider closed, even though those might be newer in post date, having said that, is there a workaround for a more precise scraping?
Thanks,
Not sure if this fits your setup, but you can fetch lastseen from MySQL when initializing your spider and stop generating Requests in your callbacks when the response contains the item with postdate < lastseen, hence basically moving the logic to stop crawling directly inside the Spider instead of the pipeline.
It can sometimes be simpler to pass an argument to your spider
scrapy crawl myspider -a lastseen=20130715
and set property of your Spider to test in your callback (http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments)
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, lastseen=None):
self.lastseen = lastseen
# ...
def parse_new_items(self, reponse):
follow_next_page = True
# item fetch logic
for element in <some_selector>:
# get post_date
post_date = <extract post_date from element>
# check post_date
if post_date < self.lastseen:
follow_next_page = False
continue
item = MyItem()
# populate item...
yield item
# find next page to crawl
if follow_next_page:
next_page_url = ...
yield Request(url = next_page_url, callback=parse_new_items)