Scrapy Deltafetch incremental crawling

Scrapy Deltafetch incremental crawling - python

I am working on Scrapy to scrap the website. And I want to extract only those items which have not been scraped in its previous run.
I am trying it on "https://www.ndtv.com/top-stories" website to extract only 1st headline if it is updated.
Below is my code:
import scrapy
from selenium import webdriver
from w3lib.url import url_query_parameter
class QuotesSpider(scrapy.Spider):
name = "test"
start_urls = [
'https://www.ndtv.com/top-stories',
]
def parse(self, response):
print ('testing')
print(response.url)
yield {
'heading': response.css('div.nstory_header a::text').extract_first(),
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
#'inc_crawling.middlewares.IncCrawlingSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_deltafetch.DeltaFetch': 100,
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
'scrapylib.deltafetch.DeltaFetch': 100,
'inc_crawling.middlewares.deltafetch.DeltaFetch': 100,
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
DELTAFETCH_ENABLED = True
DELTAFETCH_DIR = '/home/administrator/apps/inc_crawling'
DOTSCRAPY_ENABLED = True
I have updated above code in setting.py file:
I am running the above code using "scrapy crawl test -o test.json" command and after each run .db file and test.json file gets updated.
So, my expectation is whenever the 1st headline is updated only then .db gets updated.
kindly help me if there is any better approach to extract updated headline.

a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests.
Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.

Related

Can't get rid of a problematic page even when using rotation of proxies within scrapy

I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.

To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer

Scrapy-Splash Waiting for Page to Load

I'm new to scrapy and splash, and I need to scrape data from single page and regular web apps.
A caveat, though, is I'm mostly scraping data from internal tools and applications, so some require authentication and all of them require at least a couple of seconds loading time before the page fully loads.
I naively tried a Python time.sleep(seconds) and it didn't work. It seems like SplashRequest and scrapy.Request both run and yield results, basically. I then learned about LUA scripts as arguments to these requests, and attempted a LUA script with various forms of wait(), but it looks like the requests never actually run the LUA scripts. It finishes right away and my HTMl selectors don't find anything I'm looking for.
I'm following directions from here https://github.com/scrapy-plugins/scrapy-splash, and have their docker instance running on localhost:8050 and created a settings.py.
Anyone with experience here know what I might be missing?
Thanks!
spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import logging
import base64
import time
# from selenium import webdriver
# lua_script="""
# function main(splash)
# splash:set_user_agent(splash.args.ua)
# assert(splash:go(splash.args.url))
# splash:wait(5)
# -- requires Splash 2.3
# -- while not splash:select('#user-form') do
# -- splash:wait(5)
# -- end
# repeat
# splash:wait(5))
# until( splash:select('#user-form') ~= nil )
# return {html=splash:html()}
# end
# """
load_page_script="""
function main(splash)
splash:set_user_agent(splash.args.ua)
assert(splash:go(splash.args.url))
splash:wait(5)
function wait_for(splash, condition)
while not condition() do
splash:wait(0.5)
end
end
local result, error = splash:wait_for_resume([[
function main(splash) {
setTimeout(function () {
splash.resume();
}, 5000);
}
]])
wait_for(splash, function()
return splash:evaljs("document.querySelector('#user-form') != null")
end)
-- repeat
-- splash:wait(5))
-- until( splash:select('#user-form') ~= nil )
return {html=splash:html()}
end
"""
class HelpSpider(scrapy.Spider):
name = "help"
allowed_domains = ["secet_internal_url.com"]
start_urls = ['https://secet_internal_url.com']
# http_user = 'splash-user'
# http_pass = 'splash-password'
def start_requests(self):
logger = logging.getLogger()
login_page = 'https://secet_internal_url.com/#/auth'
splash_args = {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
'lua_source': load_page_script
}
#splash_args = {
# 'html': 1,
# 'png': 1,
# 'width': 600,
# 'render_all': 1,
# 'lua_source': lua_script
#}
yield SplashRequest(login_page, self.parse, endpoint='execute', magic_response=True, args=splash_args)
def parse(self, response):
# time.sleep(10)
logger = logging.getLogger()
html = response._body.decode("utf-8")
# Looking for a form with the ID 'user-form'
form = response.css('#user-form')
logger.info("####################")
logger.info(form)
logger.info("####################")

I figured it out!
Short Answer
My Spider class was configured incorrectly for using splash with scrapy.
Long Answer
Part of running splash with scrape is, in my case, running a local Docker instance that it uses to load my requests into for it to run the Lua scripts. An important caveat to note is the settings for splash as described in the github page must be a property of the spider class itself, so I added this code to my Spider:
custom_settings = {
'SPLASH_URL': 'http://localhost:8050',
# if installed Docker Toolbox:
# 'SPLASH_URL': 'http://192.168.99.100:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
}
Then I noticed my Lua code running, and the Docker container logs indicating the interactions. After fixing errors with the splash:select() my login script worked, as did my waits:
splash:wait( seconds_to_wait )
Lastly, I created a Lua script to handle logging in, redirecting, and gathering links and text from pages. My application is an AngularJS app, so I can't gather links or visit them except clicking. This script let me run through every link, click it, and gather content.
I suppose an alternative solution would have been to use end-to-end testing tools such as Selenium/WebDriver or Cypress, but I prefer to use scrapy to scrape and testing tools to test. To each their own (Python or NodeJS tools), I suppose.
Neat Trick
Another thing to mention that's really helpful for debugging, is when you're running the Docker instance for Scrapy-Splash, you can visit that URL in your browser and there's an interactive "request tester" that lets you test out Lua scripts and see rendered HTML results (for example, verifying login or page visits). For me, this url was http://0.0.0.0:8050, and this URL is set in your settings and should be configured to match with your Docker container.
Cheers!

scrape an api result page with scrapy

I have this url that the content of its response, contains some JSON data.
https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query=sadaf%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true
Everytime i paste this url in the browser with different queries, i get a nice JSON result. But in the scrapy or scrapy shell, i don't get any result. This is my scrapy spider class :
link = "https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&searchSessionId=BA939B3D93510DABB510328CBF3353131516800881576ssid&nearPages=true"
def start_requests(self):
files = [f for f in listdir('results/') if isfile(join('results/', f))]
for file in files:
with open('results/' + file, 'r', encoding="utf8") as tour_info:
tour = json.load(tour_info)
for hotel in tour["hotels"]:
yield scrapy.Request(self.link.format(hotel))
name = 'tripadvisor'
allowed_domains = ['tripadvisor.com']
def parse(self, response):
print(response.body)
For this code, in scrapy shell, i get this result:
b'{"normalized":{"query":""},"query":{},"results":[],"partial_content":false}'
In scrapy command line, by running the spider, i first got the Forbidden by robots.txt error for every url. I changed scrapy ROBOTSTXT_OBEY to False so it does not obey this file. Now i get [] for every request, but i should get a JSON object like this:
[
{
"urls":[
{
"url_type":"hotel",
"name":"Sadaf Hotel, Dubai, United Arab Emirates",
"type":"HOTEL",
"url":"\/Hotel_Review-g295424-d633008-Reviews-Sadaf_Hotel-Dubai_Emirate_of_Dubai.html"
}
],
.
.
.

Try removing the sessionID from the URL and maybe check how "unfriendly" your settings.py is. (Also see this blog)
But it could be way easier to use Wget, like wget 'https://www.tripadvisor.com/TypeAheadJson?action=API&types=geo%2Cnbrhd%2Chotel%2Ctheme_park&legacy_format=true&urlList=true&strictParent=true&query={}%20dubai%20hotel&max=6&name_depth=3&interleaved=true&scoreThreshold=0.5&strictAnd=false&typeahead1_5=true&disableMaxGroupSize=true&geoBoostFix=true&neighborhood_geos=true&details=true&link_type=hotel%2Cvr%2Ceat%2Cattr&rescue=true&uiOrigin=trip_search_Hotels&source=trip_search_Hotels&startTime=1516800919604&nearPages=true' -O results.json

What is the correct use of Proxy in scrapy?

my Code is
import scrapy
from scrapy import log
from scrapy.exceptions import IgnoreRequest
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://www.*****']
custom_settings = {
'DOWNLOAD_DELAY': '5',
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/path/to/proxy/list.txt'
def parse(self, response):
bannCheck = response.css('.lead ::text').extract_first();
for title in response.css('.seo-directory-doctor-link'):
yield {'title': title.css('a ::attr(href)').extract_first()}
next_page = response.css('li.seo-directory-page > a[rel=next] ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
That is the way where i try to use Proxy and CustomSettings with a Download Delay from 5 but it's not working.
I don't know the location of Settings.py and how can i configure it?
Maybe some one can give me a example for this code?
Hope for your Support
Thanks
EDIT: Now i know i have to create settings.py in the folder where my project is saved.
I try the example https://github.com/aivarsk/scrapy-proxies
But it don't work he don't use the proxy list.
Whats wrong?

I worked with proxies very well as implementing it in this way.
I used this scrapy-proxies, and this is my organisation of code :
Put the randomproxy.py beside settings.py.
Settings
Inside of your settings.py file put this :
# Retry many times since proxies often fail
RETRY_TIMES = 5
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# Fix path to this module
'botcrawler.randomproxy.RandomProxy': 600,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/home/user/botcrawler/botcrawler/proxy/list.txt'
Spider
And then in your spider code (in parse function), check if the proxy works fine by checking something on the page :
if not response.xpath('//title'):
yield Request(url=response.url, dont_filter=True)
Hope that helped. Regards.

scraping the file with html saved in local system

For example i had a site "www.example.com"
Actually i want to scrape the html of this site by saving on to local system.
so for testing i saved that page on my desktop as example.html
Now i had written the spider code for this as below
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
But when i run the above code i am getting this error as below
ValueError: Missing scheme in request url: example.html
Finally my intension is to scrape the example.html file that consists of www.example.com html code saved in my local system
Can any one suggest me on how to assign that example.html file in start_urls
Thanks in advance

You can crawl a local file using an url of the following form:
file:///path/to/file.html

You can use the HTTPCacheMiddleware, which will give you the ability to to a spider run from cache. The documentation for the HTTPCacheMiddleware settings is located here.
Basically, adding the following settings to your settings.py will make it work:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # Set to 0 to never expire
This however requires to do an initial spider run from the web to populate the cache.

In scrapy, You can scrape local file using:
class ExampleSpider(BaseSpider):
name = "example"
start_urls = ["file:///path_of_directory/example.html"]
def parse(self, response):
print response
hxs = HtmlXPathSelector(response)
I suggest you check it using scrapy shell 'file:///path_of_directory/example.html'

Just to share the way that I like to do this scraping with local files:
import scrapy
import os
LOCAL_FILENAME = 'example.html'
LOCAL_FOLDER = 'html_files'
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = [
f"file://{BASE_DIR}/{LOCAL_FOLDER}/{LOCAL_FILENAME}"
]
I'm using f-strings (python 3.6+)(https://www.python.org/dev/peps/pep-0498/), but you can change with %-formatting or str.format() as you prefer.

scrapy shell "file:E:\folder\to\your\script\Scrapy\teste1\teste1.html"
this works for me today on Windows 10.
I have to put the full path without the ////.

You can simple do
def start_requests(self):
yield Request(url='file:///path_of_directory/example.html')

If you view source code of scrapy Request for example github . You can understand what scrapy send request to http server and get needed page in response from server. Your filesystem is not http server. For testing purpose with scrapy you must setup http server. And then you can assign urls to scrapy like
http://127.0.0.1/example.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy Deltafetch incremental crawling - python

a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests. Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.

Related

Can't get rid of a problematic page even when using rotation of proxies within scrapy

Scrapy-Splash Waiting for Page to Load

scrape an api result page with scrapy

What is the correct use of Proxy in scrapy?

scraping the file with html saved in local system

Categories

Resources