Unable to scrape while running scrapy spider sequentially - python

I'm new to scrapy and I'm trying to practice with and example, I want to run scrapy spiders sequentially but when I use the code from the documentation
(https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) while using crawler process it doesn't work. The spiders opens and close instantly without scraping data from the website. But when I run the spiders alone using "scrapy crawl" it works. I don't understand why spider scrape datas while I call it alone and doesn't scrape datas while I try to run it sequentially. If someone could help me with that it would be great.
Here's the code that I'm using:
class APASpider(scrapy.Spider):
name = 'APA_test'
allowed_domains = ['some_domain.com']
start_urls = ['startin_url']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
def parse(self, response):
for href in response.xpath('//a[#class="product-link"]/#href').extract():
yield SplashRequest(response.urljoin(href),self.parse_produits,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
for pages in response.xpath('//*[#id="loadmore"]/#href'):
yield SplashRequest(response.urljoin(pages.extract()),self.parse,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script,'timeout': 3600},
headers={'X-My-Header': 'value'},
)
def parse_produits(self,response):
Nom = response.xpath("//h1/text()").extract()
Poids = response.xpath('//p[#class="description"]/text()').extract()
item_APA = APAitem()
item_APA["Titre"] = Nom
item_APA["Poids"] = Poids
yield item_APA
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(APASpider)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Thank you

It's hard to tell exactly what is the issue there considering there are no log messages provided in the question.
That being said, I'll still try to answer as I've had this same issue a while ago.
There is this issue with scrapy_splash concerning local last_response = entries[#entries].response on Splash scripts. I'm assuming that you have it on your script, as I did.
The workaround I used was to check if history is not empty before taking last entry. (as suggested by github user kmike).

Related

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

I am trying to parse a list from a javascript website. When I run it, it only gives me back one entry on each column and then the spider shuts down. I have already set up my middleware settings. I am not sure what is going wrong. Thanks in advance!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[#id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[#class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[#class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
The .extract_first() (now .get()) you used will always return the first result. It's not an iterator so there is no sense to call it several times. You should try the .getall() method. That will be something like:
names = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-1"]/text()').getall()
sources = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-2"]/text()').getall()

How does Scrapy proceed with the urls given in the urls variable under start_requests?

Just wondering why when I have url = ['site1', 'site2'] and I run scrapy from script using .crawl() twice, in a row like
def run_spiders():
process.crawl(Spider)
process.crawl(Spider)
the output is:
site1info
site1info
site2info
site2info
as opposed to
site1info
site2info
site1info
site2info
Because as soon as you call process.start(), requests are handled asynchronously. The order is not guaranteed.
In fact, even if you only call process.crawl() once, you may sometimes get:
site2info
site1info
To run spiders sequentially from Python, see this other answer.
start_request uses the yield functionality. yield queues the requests. To understand it fully read this StackOverflow answer.
Here is the code example of how it works with start_urls in the start_request method.
start_urls = [
"url1.com",
"url2.com",
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse)
For custom request ordering this priority feature can be used.
def start_requests(self):
yield scrapy.Request(self.start_urls[0], callback=self.parse)
yield scrapy.Request(self.start_urls[1], callback=self.parse, priority=1)
the one with the higher number of priority will be yielded first from the queue. By default, priority is 0.

Running Scrapy for multiple times on same URL

I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
dont_filter
indicates that this request should not be filtered by the scheduler.
This is used when you want to perform an identical request multiple
times, to ignore the duplicates filter. Use it with care, or you will
get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse please read Parse Method

How to handle large number of requests in scrapy?

I'm crawling around 20 million urls. But before the request is actually made the process gets killed due to excessive memory usage (4 GB RAM). How can I handle this in scrapy so that the process doesn't gets killed ?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
urls = []
for d in range(0,20000000):
link = "http://example.com/"+str(d)
urls.append(link)
start_urls = urls
def parse(self, response):
yield response
I think I found the workaround.
Add this method to your spider.
def start_requests(self):
for d in range(1,26999999):
yield scrapy.Request("http://example.com/"+str(d), self.parse)
you dont have to specify the start_urls in the starting.
It will start generating URLs and start sending asynchronous requests and the callback will be called when the scrapy gets the response.In the start the memory usage will be more but later on it will take constant memory.
Along with this you can use
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
By using this you can pause the spider and resume it any time by using the same command
and in order to save CPU (and log storage requirements)
use
LOG_LEVEL = 'INFO'
in settings.py of the scrapy project.
I believe creating a big list of urls to use as start_urls may be causing the problem.
How about doing this instead?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://example.com/0"]
def parse(self, response):
for d in xrange(1,20000000):
link = "http://example.com/"+str(d)
yield Request(url=link, callback=self.parse_link)
def parse_link(self, response):
yield response

Scrapy only scrapes the first start url in a list of 15 start urls

I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.
I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}
your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please

Categories

Resources