Recently I bought IP rotation service from proxyrack and I want to use with Scrapy. But as their python example, I'm getting confused to implement with Scrapy. Please help me. Here is their code but I want to apply with scrapy
import requests
username = "vranesevic"
password = "svranesevic"
PROXY_RACK_DNS = "megaproxy.rotating.proxyrack.net:222"
urlToGet = "http://ip-api.com/json"
proxy = {"http":"http://{}:{}#{}".format(username, password, PROXY_RACK_DNS)}
r = requests.get(urlToGet , proxies=proxy)
print("Response:\n{}".format(r.text))
you can follow scrapy documentation that how to set up a custom proxy and if you are not familiar with then here are the steps...
Step 1 - Go to Middlewares.py file and paste this. Change the URL 1st and use that provided from proxyrack and keep the HTTP. Also, set the proxy rack user and password inside basic_auth_header.
from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta[“proxy”] = "http://192.168.1.1:8050"
request.headers[“Proxy-Authorization”] =
basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
Step 2 - Go to settings.py file and Enable Downloader_middleware or paste this at the bottom. Also, make sure you replace the word myproject and set your project name.
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
That's it and you are ready to go.
I've created a script using scrapy implementing rotation of proxies within it to parse the address from few hundreds of similar links like this. I've supplied those links from a csv file within the script.
The script is doing fine until it encounters any response url like this https://www.bcassessment.ca//Property/UsageValidation. Given that once the script starts getting that link, it can't bypass that. FYI, I'm using meta properties containing lead_link to make use of original link instead of redirected link as a retry, so I should be able to bypass that barrier.
It doesn't happen when I use proxies within requests library. To be clearer - while using requests library, the script does encounter this page /Property/UsageValidation but bypass that successfully after few retries.
The spider is like:
class mySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'stackoverflow_spider.middlewares.ProxiesMiddleware': 100,
}
}
def start_requests(self):
with open("output_main.csv","r") as f:
reader = csv.DictReader(f)
for item in list(reader):
lead_link = item['link']
yield scrapy.Request(lead_link,self.parse,meta={"lead_link":lead_link,"download_timeout":20}, dont_filter=True)
def parse(self,response):
address = response.css("h1#mainaddresstitle::text").get()
print(response.meta['proxy'],address)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT':'Mozilla/5.0',
'LOG_LEVEL':'ERROR',
})
c.crawl(mySpider)
c.start()
How can I let the script not to encounter that page?
PS I've attached few of the links within a text file in case anyone wants to give a try.
To make session safe proxy implementation for scrapy app You
need to add additional cookiejar meta key to place where you assign proxy to request.meta like this:
....
yield scrapy.Request(url=link, meta = {"proxy":address, "cookiejar":address})
In this case scrapy cookiesMiddleware will create additional cookieSession for each proxy.
related specifics of scrapy proxy implementation mentioned in this answer
I am trying to write a simple scraping script to scrape off google summer of code orgs with the tech that I require. Its work in progress. My parse function is working fine but whenever I callback into org function it doesn't throw any output.
# -*- coding: utf-8 -*-
import scrapy
class GsocSpider(scrapy.Spider):
name = 'gsoc'
allowed_domains = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
start_urls = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
def parse(self, response):
for href in response.css('li.organization-card__container a.organization-card__link::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback = self.parse_org)
def parse_org(self,response):
tech=response.css('li.organization__tag organization__tag--technology::text').extract()
#if 'python' in tech:
yield
{
'name':response.css('title::text').extract_first()
#'ideas_list':response.css('')
}
first of all, you are configuring incorrectly the allowed_domains, as it specifies in the documentation:
An optional list of strings containing domains that this spider is
allowed to crawl. Requests for URLs not belonging to the domain names
specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware is enabled.
Let’s say your target url is https://www.example.com/1.html, then add
'example.com' to the list.
As you can see, you need to include only the domains, and this is a filtering functionality (so other domains don't get crawled). Also this is optional, so I would actually recommend to not include it.
Also your css for getting tech is incorrect, it should be:
li.organization__tag.organization__tag--technology
System: Windows 10, Python 2.7.15, Scrapy 1.5.1
Goal: Retrieve text from within html markup for each of the link items on the target website, including those revealed (6 at a time) via the '+ SEE MORE ARCHIVES' button.
Target Website: https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
Initial Progress: Python and Scrapy successfully installed. The following code...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...successfully produces the following results (when -o to .csv)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
However, the spider will not touch any of the the info buried by the Ajax button. I've done a fair amount of Googling and digesting of documentation, example articles, and 'help me' posts. I am under the impression that to get the spider to actually see the ajax-buried info, that I need to simulate some sort of request. Variously, the correct type of request might be something to do with XHR, a scrapy FormRequest, or other. I am simply too new to web archetecture in general to be able to surmise the answer.
I hacked together a version of the initial code that calls a FormRequest, which seems to be able to still reach the initial page just fine, yet incrementing the only parameter that appears to change (when inspecting the xhr calls sent out when physically clicking the button on the page) does not appear to have an effect. That code is here...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...and the results are the same as before, except the 6 output lines are repeated, as a block, 9 extra times.
Can anyone help point me to what I am missing? Thank you in advance.
Postscript: I always seem to get heckled out of my chair whenever I seek help for coding problems. If I am doing something wrong, please have mercy on me, I will do whatever I can to correct it.
Scrapy don't render dynamic content very well, you need something else to deal with Javascript. Try these:
scrapy + selenium
scrapy + splash
This blog post about scrapy + splash has a good introduction on the topic.
I am working on Scrapy to scrap the website. And I want to extract only those items which have not been scraped in its previous run.
I am trying it on "https://www.ndtv.com/top-stories" website to extract only 1st headline if it is updated.
Below is my code:
import scrapy
from selenium import webdriver
from w3lib.url import url_query_parameter
class QuotesSpider(scrapy.Spider):
name = "test"
start_urls = [
'https://www.ndtv.com/top-stories',
]
def parse(self, response):
print ('testing')
print(response.url)
yield {
'heading': response.css('div.nstory_header a::text').extract_first(),
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
#'inc_crawling.middlewares.IncCrawlingSpiderMiddleware': 543,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_deltafetch.DeltaFetch': 100,
'scrapy_crawl_once.CrawlOnceMiddleware': 100,
'scrapylib.deltafetch.DeltaFetch': 100,
'inc_crawling.middlewares.deltafetch.DeltaFetch': 100,
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
DELTAFETCH_ENABLED = True
DELTAFETCH_DIR = '/home/administrator/apps/inc_crawling'
DOTSCRAPY_ENABLED = True
I have updated above code in setting.py file:
I am running the above code using "scrapy crawl test -o test.json" command and after each run .db file and test.json file gets updated.
So, my expectation is whenever the 1st headline is updated only then .db gets updated.
kindly help me if there is any better approach to extract updated headline.
a good way to implement this would be to override the DUPEFILTER_CLASS to check your database before doing the actual requests.
Scrapy uses a dupefilter class to avoid getting the same request twice, but it only works for running spiders.