python scrapy Two-direction crawling with a spider

python scrapy Two-direction crawling with a spider - python

I am reading learning scrapy by Dimitrios Kouzis-Loukas. Actually I have a question of the Two-direction crawling with a spider part in chapter 3 page58.
The original code is like:
def parse(self, response):
# Get the next index URLs and yield Requests
next_selector = response.xpath('//*[contains(#class,"next")]//#href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URLs and yield Requests
item_selector = response.xpath('//*[#itemprop="url"]/#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)`
But from my understanding, should the second loop block be included into the first one so that we can first download the index page and then download all the information pages in the first page, after that move onto the next index page?
So I just wanna know the operating order of the original code, please help!

You can't really merge the two loops.
The Request objects yielded in them have different callbacks.
The first one will be processed by the parse method (which seems to be parsing a listing of multiple items), and the second by the parse_item method (probably parsing the details of a single item).
As for the order of scraping, scrapy (by default) uses a LIFO queue, which means the last request created will be processed first.
However, due to the asynchronous nature of scrapy, it's impossible to say what the exact order will be.

Related

Python web scaping recursively (next page)

from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page

You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...

Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11

The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.

Deny certain URLs

I'm currently using Scrapy for my project on university institutional repository where I need to get the external link for each university. Is there a way for me to deny certain URLs such as 'google.com' and 'twitter.com'. Below is what I have at the moment. I'm new to these so any help would be appreciated. Thank you!
import scrapy
class UtmSpider(scrapy.Spider):
name = 'utm'
start_urls = ['http://eprints.utm.my/id/eprint/']
def start_requests(self):
yield scrapy.Request('http://eprints.utm.my/id/eprint/', self.parse)
def parse(self, response):
for href in response.xpath('//a/#href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)

If your spider is simple then solution from Swift will work just fine.
If your spider/spiders have quite a lot of code in them checking urls every time you want to issue a request will pollute your code. In this case you can use DownloaderMiddleware pipeline.
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
If you search for 'IgnoreRequest' you'll find a description of how to implement a DownloaderMiddleware that will be able to discard certain requests.

ignore = ['google', 'twitter']
def parse(self, response):
for href in response.xpath('//a/#href').getall():
for kw in ignore:
if kw not in href.lower():
yield scrapy.Request(response.urljoin(href), self.parse)
As per request
EDIT:
You asked how you could exclude certain links that contain text like the examples you gave, Google and Twitter.
I have not changed what your code does, but simply added a conditional statement which will check if the href contains the keywords.
We create a list (our list of excluded terms). Then we will need to iterate that list each time we want to check a link, so the shorter the list of keywords the better.
If the keyword value is not part of the href string, we pass and continue the href link iteration. Otherwise we yield it instead.
Hope this helps

scrapy spider scraping data from link randomly why?

First i have grabed all the coin link from the website and requested to those link.
But scrapy do'nt requesting serially from the link list.after requesting to thos link scraping data successfully but when saving to csv file it making a blank row every time after one succesfull scraped item.Result Screenshot
I am expecting that it will request serially from the link list and it will not make any blank row.how can i do that?
I am using python 3.6 and scrapy version 1.5.1
My code:
import scrapy
class MarketSpider(scrapy.Spider):
name = 'market'
allowed_domains = ['coinmarketcap.com']
start_urls = ['http://coinmarketcap.com/']
def parse(self, response):
Coin = response.xpath('//*[#class="currency-name-container link-secondary"]/#href').extract()
for link in Coin:
absolute_url = response.urljoin(link)
yield scrapy.Request(absolute_url,callback=self.website_link)
def website_link(self,response):
link = response.xpath('//*[#class="list-unstyled details-panel-item--links"]/li[2]/a/#href').extract()
name = response.xpath('normalize-space(//h1)').extract()
yield{'Name': name ,'Link': link}

I think scrapy is visiting pages in a multi-threaded (producer/consumer) fashion. This can explain the non-sequential aspect of your result.
To verify this hypothesis, you could change your config to use a single thread.
For the blank link, are you sure any of your name or link variable contains a \n ?

Scrapy is an asynchronous framework - multiple requests are executed concurrently and the responses are parsed as they are received.
The only way to reliably control which responses are parsed first is to turn this feature off, e.g. by setting CONCURRENT_REQUESTS to 1.
This would make your spider less efficient though, and this kind of control of the parse order is rarely necessary, so I would avoid it if possible.
The extra newlines in csv exports on windows are a known issue, and will be fixed in the next scrapy release.

Spider error URL processing

I'm getting error with processing URL with scrapy 1.5.0, python 2.7.14.
class GoodWillOutSpider(Spider):
name = "GoodWillOutSpider"
allowded_domains = ["thegoodwillout.com"]
start_urls = [GoodWillOutURL]
def __init__(self):
logging.critical("GoodWillOut STARTED.")
def parse(self, response):
products = Selector(response).xpath('//div[#id="elasticsearch-results-container"]/ul[#class="product-list clearfix"]')
for product in products:
item = GoodWillOutItem()
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0]
item['link'] = "www.thegoodwillout.com" + product.xpath('//#href').extract()[0]
# item['image'] = "http:" + product.xpath("/div[#class='catalogue-product-cover']/a[#class='catalogue-product-cover-image']/img/#src").extract()[0]
# item['size'] = '**NOT SUPPORTED YET**'
yield item
yield Request(GoodWillOutURL, callback=self.parse, dont_filter=True, priority=16)
This is my class GoodWillOutSpider, and this is the error I get:
[scrapy.core.scraper] ERROR: Spider error processing <GET https://www.thegoodwillout.com/footwear> (referer: None)
line 1085, in parse item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()[0] IndexError: list index out of range
And I wanna know in the future, how can I get without asking here again the correct xpath for every site

The problem
If your scraper can't access data that you can see using your browsers developer tools, it is not seeing the same data as your browser.
This can mean one of two things:
Your scraper is being recognized as such and served different content
Some of the content is generated dynamically (usually through javascript)
The generic solution
The most straight-forward way of getting around both of these problems is using an actual browser.
There are many headless browsers available, and you can choose the best one for your needs.
For scrapy, scrapy-splash is probably the simplest option.
More specialized solutions
Sometimes, you can figure out what the reason for this different behavior is, and change your code.
This will usually be the more efficient solution, but might require significantly more work on your part.
For example, if your scraper is getting redirected, it is possible that you just need to use a different user agent string, pass some additional headers, or slow down your requests.
If the content is generated by javascript, you might be able to look at the page source (response.text or view source in a browser), and figure out what is going on.
After that, there are two possibilities:
Extract the data in an alternate way (like gangabass did for your previous question)
Replicate what the javascript is doing in your spider code (such as making additional requests, like in the current example)

IndexError: list index out of range
You need to check first if list has any values after extracting
item['name'] = product.xpath('//div[#class="name ng-binding"]').extract()
if item['name']:
item['name'] = item['name'][0]

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items

Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python scrapy Two-direction crawling with a spider - python

Related

Python web scaping recursively (next page)

Deny certain URLs

scrapy spider scraping data from link randomly why?

Spider error URL processing

if statement not working for spider in scrapy

Categories

Resources