Deny certain URLs - python

I'm currently using Scrapy for my project on university institutional repository where I need to get the external link for each university. Is there a way for me to deny certain URLs such as 'google.com' and 'twitter.com'. Below is what I have at the moment. I'm new to these so any help would be appreciated. Thank you!
import scrapy
class UtmSpider(scrapy.Spider):
name = 'utm'
start_urls = ['http://eprints.utm.my/id/eprint/']
def start_requests(self):
yield scrapy.Request('http://eprints.utm.my/id/eprint/', self.parse)
def parse(self, response):
for href in response.xpath('//a/#href').getall():
yield scrapy.Request(response.urljoin(href), self.parse)

If your spider is simple then solution from Swift will work just fine.
If your spider/spiders have quite a lot of code in them checking urls every time you want to issue a request will pollute your code. In this case you can use DownloaderMiddleware pipeline.
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
If you search for 'IgnoreRequest' you'll find a description of how to implement a DownloaderMiddleware that will be able to discard certain requests.

ignore = ['google', 'twitter']
def parse(self, response):
for href in response.xpath('//a/#href').getall():
for kw in ignore:
if kw not in href.lower():
yield scrapy.Request(response.urljoin(href), self.parse)
As per request
EDIT:
You asked how you could exclude certain links that contain text like the examples you gave, Google and Twitter.
I have not changed what your code does, but simply added a conditional statement which will check if the href contains the keywords.
We create a list (our list of excluded terms). Then we will need to iterate that list each time we want to check a link, so the shorter the list of keywords the better.
If the keyword value is not part of the href string, we pass and continue the href link iteration. Otherwise we yield it instead.
Hope this helps

Related

Python web scaping recursively (next page)

from this website:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=1
I need to scrape the next pages 2, 3 ...using Selenium or LXML.
I can only scrape the first page
You can try this:
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']/li")
for element in profileDetails:
print(element.text)
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
The above code will iterate and fetch the data until there are no numbers left.
If you want to fetch the name, department, email separately then try the below code :
nextNumberIsThere = True
i=1
while nextNumberIsThere:
driver.execute_script("document.body.scrollHeight");
profileDetails = driver.find_elements_by_xpath("//ul[#class='profile-details']")
for element in profileDetails:
name = element.find_element_by_xpath("./li[#class='fn']")
department = element.find_elements_by_xpath("./li[#class='org']")
email = element.find_element_by_xpath("./li[#class='email']")
print(name.text)
print(department.text)
print(email.text)
print("------------------------------")
next = driver.find_elements_by_xpath("//a[text()='"+str(i)+"']")
i+=1
if len(next) > 0:
next[0].click()
else:
nextNumberIsThere = False
I hope it helps...
Change start_rank in the url. For example:
https://search2.ucl.ac.uk/s/search.html?query=max&collection=website-meta&profile=_directory&tab=directory&f.Profile+Type%7Cg=Student&start_rank=11
The usual solution to this kind of problem is not to use a loop that iterates through "all the pages" (because you don't know how many there are up-front), but rather have some kind of queue, where scraping one page optionally adds subsequent pages to the queue, to be scraped later.
In your specific example, during the scraping of each page you could look for the link to "next page" and, if it's there, add the next page's URL to the queue, so it will be scraped following the current page; once you hit a page with no "next page" link, the queue will empty and scraping will stop.
A more complex example might include scraping a category page and adding each of its sub-categories as a subsequent page to the scraping queue, each of which might in turn add multiple item pages to the queue, etc.
Take a look at scraping frameworks like Scrapy which include this kind of functionality easily in their design. You might find some of its other features useful as well, e.g. its ability to find elements on the page using either XPath or CSS selectors.
The first example on the Scrapy homepage shows exactly the kind of functionality you're trying to implement:
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').get()}
for next_page in response.css('a.next-posts-link'):
yield response.follow(next_page, self.parse)
One important note about Scrapy: it doesn't use Selenium (at least not out-of-the-box), but rather downloads the page source and parses it. This means that it doesn't run JavaScript, which might be an issue for you if the website you're scraping is client-generated. In that case, you could look into solutions that combine Scrapy and Selenium (quick googling shows a bunch of them, as well as StackOverflow answers regarding this problem), or you could stick to your Selenium scraping code and implement a queuing mechanism yourself, without Scrapy.

python scrapy Two-direction crawling with a spider

I am reading learning scrapy by Dimitrios Kouzis-Loukas. Actually I have a question of the Two-direction crawling with a spider part in chapter 3 page58.
The original code is like:
def parse(self, response):
# Get the next index URLs and yield Requests
next_selector = response.xpath('//*[contains(#class,"next")]//#href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URLs and yield Requests
item_selector = response.xpath('//*[#itemprop="url"]/#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)`
But from my understanding, should the second loop block be included into the first one so that we can first download the index page and then download all the information pages in the first page, after that move onto the next index page?
So I just wanna know the operating order of the original code, please help!
You can't really merge the two loops.
The Request objects yielded in them have different callbacks.
The first one will be processed by the parse method (which seems to be parsing a listing of multiple items), and the second by the parse_item method (probably parsing the details of a single item).
As for the order of scraping, scrapy (by default) uses a LIFO queue, which means the last request created will be processed first.
However, due to the asynchronous nature of scrapy, it's impossible to say what the exact order will be.

scrapy spider scraping data from link randomly why?

First i have grabed all the coin link from the website and requested to those link.
But scrapy do'nt requesting serially from the link list.after requesting to thos link scraping data successfully but when saving to csv file it making a blank row every time after one succesfull scraped item.Result Screenshot
I am expecting that it will request serially from the link list and it will not make any blank row.how can i do that?
I am using python 3.6 and scrapy version 1.5.1
My code:
import scrapy
class MarketSpider(scrapy.Spider):
name = 'market'
allowed_domains = ['coinmarketcap.com']
start_urls = ['http://coinmarketcap.com/']
def parse(self, response):
Coin = response.xpath('//*[#class="currency-name-container link-secondary"]/#href').extract()
for link in Coin:
absolute_url = response.urljoin(link)
yield scrapy.Request(absolute_url,callback=self.website_link)
def website_link(self,response):
link = response.xpath('//*[#class="list-unstyled details-panel-item--links"]/li[2]/a/#href').extract()
name = response.xpath('normalize-space(//h1)').extract()
yield{'Name': name ,'Link': link}
I think scrapy is visiting pages in a multi-threaded (producer/consumer) fashion. This can explain the non-sequential aspect of your result.
To verify this hypothesis, you could change your config to use a single thread.
For the blank link, are you sure any of your name or link variable contains a \n ?
Scrapy is an asynchronous framework - multiple requests are executed concurrently and the responses are parsed as they are received.
The only way to reliably control which responses are parsed first is to turn this feature off, e.g. by setting CONCURRENT_REQUESTS to 1.
This would make your spider less efficient though, and this kind of control of the parse order is rarely necessary, so I would avoid it if possible.
The extra newlines in csv exports on windows are a known issue, and will be fixed in the next scrapy release.

Scrapy crawled pages, but scraped 0 items

I am fairly new to Python, Scrapy and this board, so please bear with me, as I try to illustrate my problem.
My goal is to collect the names (and possibly prices) of all available hotels in Berlin on booking.com for a specific date (see for example the predefined start_url) with the help of Scrapy.
I think the crucial parts are:
I want to paginate through all next pages until the end.
On each page I want to collect the name of every hotel and the name should be saved respectively.
If I run "scrapy runspider bookingspider.py -o items.csv -t csv" for my code below, the terminal shows me that it crawls through all available pages, but in the end I only get an empty items.csv.
Step 1 seems to work, as the terminal shows succeeding urls are being crawled (e.g. [...]offset=15, then [...]offset=30). Therefore I think my problem is step 2.
For step 2 one needs to define a container or block, in which each hotel information is contained seperately and can serve as the basis for a loop, right?
I picked "div class="sr_item_content sr_item_content_slider_wrapper"", since every hotel block has this element at a superordinate level, but I am really unsure about this part. Maybe one has to consider a higher level
(but which element should I take, since they are not the same across the hotel blocks?).
Anyway, based on that I figured out the remaining XPath to the element, which contains the hotel name.
I followed two tutorials with similar settings (though different websites), but somehow it does not work here.
Maybe you have an idea, every help is very much appreciated. Thank you!
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.http.request import Request
class HotelItem(Item):
title = Field()
price = Field()
class BookingCrawler(CrawlSpider):
name = 'booking_crawler'
allowed_domains = ['booking.com']
start_urls = ['http://www.booking.com/searchresults.html?checkin_monthday=25;checkin_year_month=2016-10;checkout_monthday=26;checkout_year_month=2016-10;class_interval=1;dest_id=-1746443;dest_type=city;offset=0;sb_travel_purpose=leisure;si=ai%2Cco%2Cci%2Cre%2Cdi;src=index;ss=Berlin']
custom_settings = {
'BOT_NAME': 'booking-scraper',
}
def parse(self, response):
s = Selector(response)
index_pages = s.xpath('//div[#class="results-paging"]/a/#href').extract()
if index_pages:
for page in index_pages:
yield Request(response.urljoin(page), self.parse)
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
items = []
for hotel in hotels:
item = HotelItem()
item['title'] = hotel.xpath('div[1]/div[1]/h3/a/span/text()').extract()[0]
item['price'] = hotel.xpath('//div[#class="sr-prc--num sr-prc--final"]/text()').extract()[0]
items.append(item)
for item in items:
yield item
I think the problem may be with your XPath on this line:
hotels = s.xpath('//div[#class="sr_item_content sr_item_content_slider_wrapper"]')
From this SO question it looks like you need to define something more along the lines of:
//div[contains(#class, 'sr_item_content') and contains(#class, 'sr_item_content_slider_wrapper')]
To help you debug further, you could try outputting the contents of index_pages first to see if it is definitely returning what you expect on that level.
Also, check Xpath Visualiser (also mentioned in the question), which can help with building Xpath.

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.
Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.
Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.
from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem
class SunSpider(Spider):
name = "Sun"
allowed_domains = ['sunbiz.org']
start_urls = [
'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = BizzyItem()
item["company"] = sel.xpath('//td[1]/a/text()').extract()
item["status"] = sel.xpath('//td[3]/text()').extract()
if item["status"] != 'Active':
pass
else:
items.append(item)
return items
Crawling Multiple Times?
I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)
It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).
Mysterious if Statement
Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list. If you were to print your result, even though in the xpath you selected down to text(), you would find it looked like this:
[u'string'] # Note the brackets
#^ no little u if you are running this with Python 3.x
You want the first element (only member) of that list, [0]. Fortunately, you can add it right to the method chain you have already constructed for extract:
item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]
Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass.)

Categories

Resources