Scrapy-Splash Waiting for Page to Load - python

I'm new to scrapy and splash, and I need to scrape data from single page and regular web apps.
A caveat, though, is I'm mostly scraping data from internal tools and applications, so some require authentication and all of them require at least a couple of seconds loading time before the page fully loads.
I naively tried a Python time.sleep(seconds) and it didn't work. It seems like SplashRequest and scrapy.Request both run and yield results, basically. I then learned about LUA scripts as arguments to these requests, and attempted a LUA script with various forms of wait(), but it looks like the requests never actually run the LUA scripts. It finishes right away and my HTMl selectors don't find anything I'm looking for.
I'm following directions from here https://github.com/scrapy-plugins/scrapy-splash, and have their docker instance running on localhost:8050 and created a settings.py.
Anyone with experience here know what I might be missing?
Thanks!
spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import logging
import base64
import time
# from selenium import webdriver
# lua_script="""
# function main(splash)
# splash:set_user_agent(splash.args.ua)
# assert(splash:go(splash.args.url))
# splash:wait(5)
# -- requires Splash 2.3
# -- while not splash:select('#user-form') do
# -- splash:wait(5)
# -- end
# repeat
# splash:wait(5))
# until( splash:select('#user-form') ~= nil )
# return {html=splash:html()}
# end
# """
load_page_script="""
function main(splash)
splash:set_user_agent(splash.args.ua)
assert(splash:go(splash.args.url))
splash:wait(5)
function wait_for(splash, condition)
while not condition() do
splash:wait(0.5)
end
end
local result, error = splash:wait_for_resume([[
function main(splash) {
setTimeout(function () {
splash.resume();
}, 5000);
}
]])
wait_for(splash, function()
return splash:evaljs("document.querySelector('#user-form') != null")
end)
-- repeat
-- splash:wait(5))
-- until( splash:select('#user-form') ~= nil )
return {html=splash:html()}
end
"""
class HelpSpider(scrapy.Spider):
name = "help"
allowed_domains = ["secet_internal_url.com"]
start_urls = ['https://secet_internal_url.com']
# http_user = 'splash-user'
# http_pass = 'splash-password'
def start_requests(self):
logger = logging.getLogger()
login_page = 'https://secet_internal_url.com/#/auth'
splash_args = {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
'lua_source': load_page_script
}
#splash_args = {
# 'html': 1,
# 'png': 1,
# 'width': 600,
# 'render_all': 1,
# 'lua_source': lua_script
#}
yield SplashRequest(login_page, self.parse, endpoint='execute', magic_response=True, args=splash_args)
def parse(self, response):
# time.sleep(10)
logger = logging.getLogger()
html = response._body.decode("utf-8")
# Looking for a form with the ID 'user-form'
form = response.css('#user-form')
logger.info("####################")
logger.info(form)
logger.info("####################")

I figured it out!
Short Answer
My Spider class was configured incorrectly for using splash with scrapy.
Long Answer
Part of running splash with scrape is, in my case, running a local Docker instance that it uses to load my requests into for it to run the Lua scripts. An important caveat to note is the settings for splash as described in the github page must be a property of the spider class itself, so I added this code to my Spider:
custom_settings = {
'SPLASH_URL': 'http://localhost:8050',
# if installed Docker Toolbox:
# 'SPLASH_URL': 'http://192.168.99.100:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
}
Then I noticed my Lua code running, and the Docker container logs indicating the interactions. After fixing errors with the splash:select() my login script worked, as did my waits:
splash:wait( seconds_to_wait )
Lastly, I created a Lua script to handle logging in, redirecting, and gathering links and text from pages. My application is an AngularJS app, so I can't gather links or visit them except clicking. This script let me run through every link, click it, and gather content.
I suppose an alternative solution would have been to use end-to-end testing tools such as Selenium/WebDriver or Cypress, but I prefer to use scrapy to scrape and testing tools to test. To each their own (Python or NodeJS tools), I suppose.
Neat Trick
Another thing to mention that's really helpful for debugging, is when you're running the Docker instance for Scrapy-Splash, you can visit that URL in your browser and there's an interactive "request tester" that lets you test out Lua scripts and see rendered HTML results (for example, verifying login or page visits). For me, this url was http://0.0.0.0:8050, and this URL is set in your settings and should be configured to match with your Docker container.
Cheers!

Related

How to implement proxyrack with scrapy

Recently I bought IP rotation service from proxyrack and I want to use with Scrapy. But as their python example, I'm getting confused to implement with Scrapy. Please help me. Here is their code but I want to apply with scrapy
import requests
username = "vranesevic"
password = "svranesevic"
PROXY_RACK_DNS = "megaproxy.rotating.proxyrack.net:222"
urlToGet = "http://ip-api.com/json"
proxy = {"http":"http://{}:{}#{}".format(username, password, PROXY_RACK_DNS)}
r = requests.get(urlToGet , proxies=proxy)
print("Response:\n{}".format(r.text))
you can follow scrapy documentation that how to set up a custom proxy and if you are not familiar with then here are the steps...
Step 1 - Go to Middlewares.py file and paste this. Change the URL 1st and use that provided from proxyrack and keep the HTTP. Also, set the proxy rack user and password inside basic_auth_header.
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta[“proxy”] = "http://192.168.1.1:8050"
request.headers[“Proxy-Authorization”] =
basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
Step 2 - Go to settings.py file and Enable Downloader_middleware or paste this at the bottom. Also, make sure you replace the word myproject and set your project name.
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
That's it and you are ready to go.

Can't get rid of a problematic page even when using rotation of proxies within scrapy

I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer

Scrapy not following the next parse function

I am trying to write a simple scraping script to scrape off google summer of code orgs with the tech that I require. Its work in progress. My parse function is working fine but whenever I callback into org function it doesn't throw any output.
# -*- coding: utf-8 -*-
import scrapy
class GsocSpider(scrapy.Spider):
name = 'gsoc'
allowed_domains = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
start_urls = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
def parse(self, response):
for href in response.css('li.organization-card__container a.organization-card__link::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback = self.parse_org)
def parse_org(self,response):
tech=response.css('li.organization__tag organization__tag--technology::text').extract()
#if 'python' in tech:
yield
{
'name':response.css('title::text').extract_first()
#'ideas_list':response.css('')
}
first of all, you are configuring incorrectly the allowed_domains, as it specifies in the documentation:
An optional list of strings containing domains that this spider is
allowed to crawl. Requests for URLs not belonging to the domain names
specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware is enabled.
Let’s say your target url is https://www.example.com/1.html, then add
'example.com' to the list.
As you can see, you need to include only the domains, and this is a filtering functionality (so other domains don't get crawled). Also this is optional, so I would actually recommend to not include it.
Also your css for getting tech is incorrect, it should be:
li.organization__tag.organization__tag--technology

(Python) Interacting with Ajax webpages via Scrapy

System: Windows 10, Python 2.7.15, Scrapy 1.5.1
Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.
Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
Initial Progress: Python and Scrapy successfully installed. The following code...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...successfully produces the following results (when -o to .csv)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.
I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.
Can anyone help point me to what I am missing? Thank you in advance.
Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.
Scrapy don't render dynamic content very well, you need something else to deal with Javascript. Try these:
scrapy + selenium
scrapy + splash
This blog post about scrapy + splash has a good introduction on the topic.

Scrapy Deltafetch incremental crawling

I am working on Scrapy to scrap the website. And I want to extract only those items which have not been scraped in its previous run.
I am trying it on "https://www.ndtv.com/top-stories" website to extract only 1st headline if it is updated.
Below is my code:
import scrapy
from selenium import webdriver
from w3lib.url import url_query_parameter
class QuotesSpider(scrapy.Spider):
name = "test"
start_urls = [
'https://www.ndtv.com/top-stories',
]
def parse(self, response):
print ('testing')
print(response.url)
yield {
'heading': response.css('div.nstory_header a::text').extract_first(),
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
#'inc_crawling.middlewares.IncCrawlingSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_deltafetch.DeltaFetch': 100,
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
'scrapylib.deltafetch.DeltaFetch': 100,
'inc_crawling.middlewares.deltafetch.DeltaFetch': 100,
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
DELTAFETCH_ENABLED = True
DELTAFETCH_DIR = '/home/administrator/apps/inc_crawling'
DOTSCRAPY_ENABLED = True
I have updated above code in setting.py file:
I am running the above code using "scrapy crawl test -o test.json" command and after each run .db file and test.json file gets updated.
So, my expectation is whenever the 1st headline is updated only then .db gets updated.
kindly help me if there is any better approach to extract updated headline.
a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests.
Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.

Categories

Resources