I am building a spider that seeks to use selenium as well as a proxy. The main goal is to make the spider as rigid as possible in avoiding getting caught for webscraping. I know that scrapy has the module 'scrapy-rotating-proxies' but I'm having trouble verifying that scrapy would check the status of the chromedriver's success in requesting a webpage and if it fails due to getting caught then run the process of switching the proxy.
Second, I am somewhat unsure of how a proxy is handled by my computer. For example, if in any case when I set a proxy value is this value consistent for anything that makes a request on my computer? Ie. will scrapy and webdriver have the same proxy values as long as one of them sets the value? Especially if scrapy has a proxy value, will any selenium webdriver instantiated inside of the class definition inherit that proxy?
I'm quite inexperienced with these tools and would really appreciate some help!
I've tried looking for a method to test and check the proxy value of selenium as well as scrapy to compare
#gets the proxies and sets the value of the scrapy proxy list in settings
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
#Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
new_proxy_list = []
for i in range(1,30):
#Get a proxy from the pool
proxy = next(proxy_pool)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
#Grab and append proxy if valid
new_proxy_list.append(proxy)
except:
#Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work.
#We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url
print("Skipping. Connnection error")
#add to settings proxy list
settings.ROTATING_PROXY_LIST = new_proxy_list
Related
Hi thanks for taking the time to read this, I have residential proxies that i want to use in scrapy but in scrapy documentation we have proxies like this
http://some_proxy_server:port or http://username:password#some_proxy_server:port
The proxies i have look like this
['username:password_session-xxxxx:some_proxy_server:port'
'username:password_session-xxxx1:some_proxy_server:port',
'username:password_session-xxxx2_session:some_proxy_server:port']
so if you see normally proxy server and port change but username and password stay the same and we have list of server:port that rotates but when i start scraper with my proxies it only sends requests to some_proxy_server:port which is same for all the proxies but session is different, i have tried adding this in middleware, used these libraries but all these treat these proxies the same.
scrapy-rotated-proxy==0.1.5
scrapy-rotating-proxies==0.6.2
so my question is how do i use the proxies where we have different session_id instead of host and port
http://username#password-session_id#host:port
Update:
here is the middleware i used
with open('files/proxies.txt', 'r') as f:
proxies = f.read()
proxies=proxies.split('\n')
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
print('_______', random.choice(proxies))
request.meta["proxy"] = random.choice(proxies)
request.headers["Proxy-Authorization"] = basic_auth_header("username", "password")
I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")
There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.
Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.
I've written a script in python using Scrapy to send a request to a webpage through proxy without changing anything in the settings.py or DOWNLOADER_MIDDLEWARES. It is working great now. However, the only thing I can't make use of is creating a list of proxies so that If one fails another will be in use. How can I twitch this portion os.environ["http_proxy"] = "http://176.58.125.65:80" to get list of proxies one by one as it supports only one. Any help on this will be highly appreciated.
This is what I've tried so far (working one):
import scrapy, os
from scrapy.crawler import CrawlerProcess
class ProxyCheckerSpider(scrapy.Spider):
name = 'lagado'
start_urls = ['http://www.lagado.com/proxy-test']
os.environ["http_proxy"] = "http://176.58.125.65:80" #can't modify this portion to get list of proxies
def parse(self, response):
stat = response.css(".main-panel p::text").extract()[1:3]
yield {"Proxy-Status":stat}
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(ProxyCheckerSpider)
c.start()
I do not want to change anything in the settings.py or create any custom middleware to serve the purpose. I wish to achieve the same (externally) like I did above with a single proxy. Thanks.
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port or http://username:password#some_proxy_server:port.
from official docs: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
So you need to write your own middleware that would do:
Catch failed responses
If response failed because of proxy:
replace request.meta['proxy'] value with new proxy ip
reschedule request
Alternative you can look into scrapy extensions packages that are already made to solve this: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
I'm working with http://robobrowser.readthedocs.org/en/latest/readme.html, (a new python library based on the beautiful soup and request libraries) within django. My django app contains :
def index(request):
p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/'
pr="http://10.10.1.10:3128/"
setProxy(pr)
browser = RoboBrowser(history=True)
postedmessage = browser.open(p)
return HttpResponse(postedmessage)
I would like to add a proxy to my code but can't find a reference in the docs on how to do this. Is it possible to do this?
EDIT:
following your recommendation I've changed the code to
pr="http://10.10.1.10:3128/"
setProxy(pr)
browser = RoboBrowser(history=True)
with:
def setProxy(pr):
import os
os.environ['HTTP_PROXY'] = pr
return
I'm now getting:
Django Version: 1.6.4
Exception Type: LocationParseError
Exception Value:
Failed to parse: Failed to parse: 10.10.1.10:3128
Any ideas on what to do next? I can't find a reference to this error
After some recent API cleanup in RoboBrowser, there are now two relatively straightforward ways to control proxies. First, you can configure proxies in your requests session, and then pass that session to your browser. This will apply your proxies to all requests made through the browser.
from requests import Session
from robobrowser import RoboBrowser
session = Session()
session.proxies = {'http': 'http://my.proxy.com/'}
browser = RoboBrowser(session=session)
Second, you can set proxies on a per-request basis. The open, follow_link, and submit_form methods of RoboBrowser now accept keyword arguments for requests.Session.send. For example:
browser.open('http://stackoverflow.com/', proxies={'http': 'http://your.proxy.com'})
Since RoboBrowser uses the request library, you can try to set the proxies as mentioned in the request docs by setting the environment variables HTTP_PROXY and HTTPS_PROXY.
This question necessarily comes in two forms, because I don't know the better route to a solution.
A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...
When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)
What is the most efficient solution?
(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)
(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.
For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://open.proxymesh.com:[port number]"
proxy_user_pass = "username:password"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site )
The idea is in random.choice. Every request has a chance of changing IP.
Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.
from scrapy import log
from settings import USER_AGENT_LIST
import random
import telnetlib
import time
# 15% ip change
class RetryChangeProxyMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 15:
log.msg('Changing proxy')
tn = telnetlib.Telnet('127.0.0.1', 9051)
tn.read_until("Escape character is '^]'.", 2)
tn.write('AUTHENTICATE "<PASSWORD HERE>"\r\n')
tn.read_until("250 OK", 2)
tn.write("signal NEWNYM\r\n")
tn.read_until("250 OK", 2)
tn.write("quit\r\n")
tn.close()
log.msg('>>>> Proxy changed. Sleep Time')
time.sleep(10)
# 30% useragent change
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 30:
log.msg('Changing UserAgent')
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
log.msg('>>>> UserAgent changed')