How to stop scrapy spider but process all wanted items? - python

I have in my pipleline a method to check if the post date of the item is older then that found in mysql, so let lastseen be the newest datetime retrieved from database:
def process_item(self, item, spider):
if item['post_date'] < lastseen:
# set flag to close_spider
# raise DropItem("old item")
This code basically works except: I check the site on hourly basis just to get the new posts, if I don't stop the spider it will keep crawling on thousands of pages, if I stop the spider on flag, chances are few requests will not be processed, since they may came back in queue after spider closed, even though those might be newer in post date, having said that, is there a workaround for a more precise scraping?
Thanks,

Not sure if this fits your setup, but you can fetch lastseen from MySQL when initializing your spider and stop generating Requests in your callbacks when the response contains the item with postdate < lastseen, hence basically moving the logic to stop crawling directly inside the Spider instead of the pipeline.
It can sometimes be simpler to pass an argument to your spider
scrapy crawl myspider -a lastseen=20130715
and set property of your Spider to test in your callback (http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments)
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, lastseen=None):
self.lastseen = lastseen
# ...
def parse_new_items(self, reponse):
follow_next_page = True
# item fetch logic
for element in <some_selector>:
# get post_date
post_date = <extract post_date from element>
# check post_date
if post_date < self.lastseen:
follow_next_page = False
continue
item = MyItem()
# populate item...
yield item
# find next page to crawl
if follow_next_page:
next_page_url = ...
yield Request(url = next_page_url, callback=parse_new_items)

Related

Scrapy item enriching from multiple websites

I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!
I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.

Scrapy Stop following requests for a specific target

My Scrapy spider has a bunch of independent target links to crawl.
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
Each link multiple pages that will be followed. i.e.
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?

Scrapy: Can't restart start_requests() properly

I have a scraper that initiates two pages - one of them is the main page, and the other is a .js file which containt long and lat coordinates I need to extract, because I need them later in the parsing process. I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its items.
For this purpose I am using the priority parameter in the Request method and I am saying that I want my .js page to be processed first. This works, but only around 70% of the time (must be due to the Scrapy's asynchronous requests). The rest 30% of the time I end up in my parse method trying to parse the .js long/lat coordinates, but having passed the main website page, so it's impossible to parse them.
For this reason, I tried to fix it this way:
when in parse() method, check which n-th url is that, if it is the first one and is not the .js one, restart the spider. However, when I restart the spider the next time it passes correctly the .js first, but after its processing the spider finished work and exits the script without an error as if it were completed.
Why is that happening, what is the difference with the processing of the pages when I restart the spider compared to when I just start it, and how can I fix this problem?
This is the code with sample outputs in both scenarios when I was trying to debug what is being executed and why it stops when being restarted.
class QuotesSpider(Spider):
name = "bot"
url_id = 0
home_url = 'https://website.com'
longitude = None
latitude = None
def __init__(self, cat=None):
self.cat = cat.replace("-", " ")
def start_requests(self):
print ("Starting spider")
self.start_urls = [
self.home_url,
self.home_url+'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
print ("Inside parse")
if self.url_id == 0 and response.url == self.home_url:
self.alert("Loaded main page before long/lat page, restarting", False)
for _ in self.start_requests():
yield _
else:
print ("Everything is good, url id is", str(self.url_id))
self.url_id +=1
if self.longitude is None:
for _ in self.parse_long_lat(response):
yield _
else:
print ("Calling parse cats")
for cat in self.parse_cats(response):
yield cat
def parse_long_lat(self, response):
print ("called long lat")
try:
self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)',
response.text).group(1)
self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)',
response.text).group(1)
print ("Extracted coords")
yield None
except AttributeError as e:
self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
yield None
def parse_cats(self, response):
pass
""" Parsing links code goes here """
Output when the spider starts correctly, gets first the .js page and second starts parsing the cats:
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats
And the script goes on and parses everything fine.
Output when the spider starts incorrectly, gets first the main page and restarts start_requests():
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
And the script stops its execution without and error as if it were completed.
P.S. If this matters, I did mention that the processing URL in the start_requests() is processed reversed order, but I find this natural due the the loop sequence and I expect the priority param to do its job (as it does it most of the time and it should do it as per Scrapy's docs).
As to why your Spider doesn't continue in the "restarting" case; you probably run afoul of duplicate requests being filtered/dropped. Since the page has already been visited, Scrapy thinks it's done.
So you would have to re-send these requests with a dont_filter=True argument:
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
# ^^^^^^^^^^^^^^^^ notice us forcing the Dupefilter to
# ignore duplicate requests to these pages
As to a better solution instead of this hacky approach, consider using InitSpider (for example, other methods exist). This guarantees your "initial" work got done and can be depended on.
(For some reason the class was never documented in the Scrapy docs, but it's a relatively simple Spider subclass: do some initial work, before starting the actual run.)
And here is a code-example for that:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
class QuotesSpider(InitSpider):
name = 'quotes'
allowed_domains = ['website.com']
start_urls = ['https://website.com']
# Without this method override, InitSpider behaves like Spider.
# This is used _instead of_ start_requests. (Do not override start_requests.)
def init_request(self):
# The last request that finishes the initialization needs
# to have the `self.initialized()` method as callback.
url = self.start_urls[0] + '/js-file-with-long-lat.js'
yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)
def parse_long_lat(self, response):
""" The callback for our init request. """
print ("called long lat")
# do some work and maybe return stuff
self.latitude = None
self.longitude = None
#yield stuff_here
# Finally, start our run.
return self.initialized()
# Now we are "initialized", will process `start_urls`
# and continue from there.
def parse(self, response):
print ("Inside parse")
print ("Everything is good, do parse_cats stuff here")
which would result in output like this:
2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)
So I finally handled it with a workaround:
I check what is the response.url received in parse() and based on that I send the further parsing to a corresponding method:
def start_requests(self):
self.start_urls = [
self.home_url,
self.home_url + 'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
if response.url != self.home_url:
for _ in self.parse_long_lat(response):
yield _
else:
for cat in self.parse_cats(response):
yield cat

Scrapy yield request from one spider to another

I have the following code:
#FirstSpider.py
class FirstSpider(scrapy.Spider):
name = 'first'
start_urls = ['https://www.basesite.com']
next_urls = []
def parse(self, response):
for url in response.css('bunch > of > css > here'):
self.next_urls.append(url.css('more > css > here'))
l = Loader(item=Item(), selector=url.css('more > css'))
l.add_css('add', 'more > css')
...
...
yield l.load_item()
for url in self.next_urls:
new_urls = self.start_urls[0] + url
yield scrapy.Request(new_urls, callback=SecondSpider.parse_url)
#SecondSpider.py
class SecondSpider(scrapy.Spider):
name = 'second'
start_urls = ['https://www.basesite.com']
def parse_url(self):
"""Parse team data."""
return self
# self is a HtmlResponse not a 'response' object
def parse(self, response):
"""Parse all."""
summary = self.parse_url(response)
return summary
#ThirdSpider.py
class ThirdSpider(scrapy.Spider):
# take links from second spider, continue:
I want to be able to pass the url scraped in Spider 1 to Spider 2 (in a different script). I'm curious as to why when I do, the 'response' is a HtmlResponse and not a response object ( When doing something similar to a method in the same class as Spider 1; I don't have this issue )
What am i missing here? How do i just pass the original response(s) to the second spider? ( and from the second onto the third, etc..?)
You could use Redis as shared resource between all spiders https://github.com/rmax/scrapy-redis
Run all N spiders (don't close on idle state), so each of them will be connected to same Redis and waiting tasks(url, request headers) from there;
As the side-effect push task data to Redis from X_spider with specific key (Y_spider name).
What about using inheritance? "parse" function names should be different.
If your first spider inherits from the second, it will be able to set the callback to self.parse_function_spider2

Html elements from a xpath response not splitting when pipelined into postgresql table

I've written a webscraper using scrapy that parses html concert data about upcoming concerts from a table on vivid seats http://www.vividseats.com/concerts/awolnation-tickets.html
I'm able to successfully scrape the data for only some of the elements (i.e.eventName, eventLocation, eventCity, and eventState) but when I pipeline the item into the database, it enters the full collection of the scraped data into each row instead of separating each new concert ticket its own row. I saw another SO question where someone suggested apending each item into a items list but I tried that and got an error. If this was the solution, how could I implement this with both the parse method and the pipelines.py file? In addition to this, I am unable to scrape the data for the date/time , the links for the actual tickets, and the price for some reason. I tried making the column for the date/time the date-time type so maybe that caused a problem. I mainly need to do if my parse method is even structured properly as this is my first time using it. The code for the parse method and the pipelines.py is below. Thanks!
def parse(self, response):
tickets = Selector(response).xpath('//*[#itemtype="http://schema.org/Event"]')
for ticket in tickets:
item = ComparatorItem()
item['eventName'] =ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
item['eventLocation'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "name"]/text()').extract()
item['price'] =ticket.xpath('//*[#class="eventTickets lastChild"]/div/div/#data-origin-price').extract()
yield Request(url, self.parse_articles_follow_next_page)
item['ticketsLink'] =ticket.xpath('//*[#class="productionsTicketCol productionsTicketCol"]/a[#class="btn btn-primary"]/#href').extract()
item['eventDate'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/meta/#content').extract()
item['eventCity'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressLocality"]/text()').extract()
item['eventState'] =ticket.xpath('//*[#class = "productionsVenue"]/span[#itemprop = "address"]/span[#itemprop = "addressRegion"]/text()').extract()
#item['eventTime'] =ticket.xpath('//*[#class = "productionsDateCol productionsDateCol sorting_3"]/div[#class = "productionsTime"]/text()').extract()
yield item
pipelines.py
from sqlalchemy.orm import sessionmaker
from models import Deals, db_connect, create_deals_table
class LivingSocialPipeline(object):
"""Livingsocial pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_deals_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Deals(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
I think the problem here is not inserting data to database but the way you're extracting it. I see you are not using relative xpaths when you iterate over ticket selectors.
For example this line:
ticket.xpath('//*[#class="productionsEvent"]/text()').extract()
will get you all elements with 'productionsEvent' class that are found in response and not all elements of this class relative to ticket selector. If you want to get children of ticket selector you need to use this xpath with dot at the beginning:
'.//*[#class="productionsEvent"]/text()'
this xpath will only take elements which are children of ticket selector, not all elements on page. Using absolute xpaths instead of relative ones is very common gotcha described in Scrapy docs.

Categories

Resources