Is it possible to crawl multiple start_urls list simultaneously - python

I have 3 URL files all of them have same structure so same spider can be used for all lists.
A special need is that all three need to be crawled simultaneously.
is it possible to crawl them simultaneously without creating multiple spiders?
I believe this answer
start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]
in Scrap multiple urls with scrapy only joins two list, but not to run them at the same time.
Thanks very much

use start_requests instead of start_urls ... this will work for u
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
for page in range(1,20):
yield self.make_requests_from_url('https://www.example.com/page-%s' %page)

Related

scrapy.spidermiddlewares.offsite DEBUG: Filtered offsite request to the website I want to scrape. Why don't I get to parse method?

My goal is to print something from the parse method when I iterate through the for loop in get_membership_no method.
I am using python3.8.5, Scrapy 1.7.3 when I run the code mentioned bellow I get "Filtered offsite request".
Here is the console output.
And here is my code.
import scrapy
import json
class BasisMembersSpider(scrapy.Spider):
name = 'basis'
allowed_domains = ['www.basis.org.bd']
def start_requests(self):
yield scrapy.Request(url="https://basis.org.bd/get-member-list?page=1&team=", callback=self.get_membership_no)
def get_membership_no(self, response):
data_array = json.loads(response.body)['data']
for data in data_array:
yield scrapy.Request(url='https://basis.org.bd/get-company-profile/{0}'.format(data['membership_no']), callback=self.parse)
def parse(self, response):
print("I want to get this line on console. thank you.")
The reason for this behavior is that you set allowed_domains = ['www.basis.org.bd'], which blocks requests to basis.org.bd.
You can either leave allowed_domains out completely or extend your list of allowed domains like this:
allowed_domains = ['www.basis.org.bd', 'basis.org.bd']
See the documentation for allowed_domains here for more information.
Removing "www." from allowed_domains worked for me.
I found this article really helpful.

How do I start crawling with multiple URLs of the same format in Scrapy

My Scrapy spider needs to start with URLs of the following format:
https://catalog.loc.gov/vwebv/search?searchArg={$variable}&searchCode=GKEY%5E*&searchType=1&limitTo=none&fromYear=&toYear=&limitTo=LOCA%3Dall&limitTo=PLAC%3Dall&limitTo=TYPE%3Dall&limitTo=LANG%3Dall&recCount=1000'
where $variable is a parameter that can be fed with as many values as possible (possibly even 1000 possible values).
How do I implement this?
You could overwrite the start_requests method to something like:
def start_requests(self):
base_url = 'https://catalog.loc.gov/vwebv/search?...'
variables = [...]
for variable in variables:
url = base_url.format(variable)
yield Request(url)

How to handle large number of requests in scrapy?

I'm crawling around 20 million urls. But before the request is actually made the process gets killed due to excessive memory usage (4 GB RAM). How can I handle this in scrapy so that the process doesn't gets killed ?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
urls = []
for d in range(0,20000000):
link = "http://example.com/"+str(d)
urls.append(link)
start_urls = urls
def parse(self, response):
yield response
I think I found the workaround.
Add this method to your spider.
def start_requests(self):
for d in range(1,26999999):
yield scrapy.Request("http://example.com/"+str(d), self.parse)
you dont have to specify the start_urls in the starting.
It will start generating URLs and start sending asynchronous requests and the callback will be called when the scrapy gets the response.In the start the memory usage will be more but later on it will take constant memory.
Along with this you can use
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
By using this you can pause the spider and resume it any time by using the same command
and in order to save CPU (and log storage requirements)
use
LOG_LEVEL = 'INFO'
in settings.py of the scrapy project.
I believe creating a big list of urls to use as start_urls may be causing the problem.
How about doing this instead?
class MySpider(Spider):
name = "mydomain"
allowed_domains = ["mydomain.com"]
start_urls = ["http://example.com/0"]
def parse(self, response):
for d in xrange(1,20000000):
link = "http://example.com/"+str(d)
yield Request(url=link, callback=self.parse_link)
def parse_link(self, response):
yield response

Webscrape from 2500 links - courses of action?

I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!
This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()
How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))

Python Scrapy: passing properties into parser

I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.
I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.
I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.
Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?
As paul t. suggested:
class MySpider(CrawlSpider):
def start_requests(self):
...
yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
...
def parse(self, response):
category = response.meta['category']
...
You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.
Same thing if you need to pass data from a parse function to a parse_item, for instance.

Categories

Resources