Hi thanks for taking the time to read this, I have residential proxies that i want to use in scrapy but in scrapy documentation we have proxies like this
http://some_proxy_server:port or http://username:password#some_proxy_server:port
The proxies i have look like this
['username:password_session-xxxxx:some_proxy_server:port'
'username:password_session-xxxx1:some_proxy_server:port',
'username:password_session-xxxx2_session:some_proxy_server:port']
so if you see normally proxy server and port change but username and password stay the same and we have list of server:port that rotates but when i start scraper with my proxies it only sends requests to some_proxy_server:port which is same for all the proxies but session is different, i have tried adding this in middleware, used these libraries but all these treat these proxies the same.
scrapy-rotated-proxy==0.1.5
scrapy-rotating-proxies==0.6.2
so my question is how do i use the proxies where we have different session_id instead of host and port
http://username#password-session_id#host:port
Update:
here is the middleware i used
with open('files/proxies.txt', 'r') as f:
proxies = f.read()
proxies=proxies.split('\n')
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
print('_______', random.choice(proxies))
request.meta["proxy"] = random.choice(proxies)
request.headers["Proxy-Authorization"] = basic_auth_header("username", "password")
Related
Let's say I have these multiple proxies and I want to use them to request information from an API
proxies = open("proxies.txt", encoding="UTF-8", errors="ignore").read().splitlines()
proxies = list(proxies)
https_proxy = random.choice(proxies)
proxyDict = {"http" : "http://"+https_proxy}
req2 = c.get(url=url2,timeout=timeout, proxies = proxyDict).json()
example = req2['data']
This does not seem to work for me, I use this script and it seems like my IP is still being limited even though I request with a proxy, any help will be appreciated.
I am building a spider that seeks to use selenium as well as a proxy. The main goal is to make the spider as rigid as possible in avoiding getting caught for webscraping. I know that scrapy has the module 'scrapy-rotating-proxies' but I'm having trouble verifying that scrapy would check the status of the chromedriver's success in requesting a webpage and if it fails due to getting caught then run the process of switching the proxy.
Second, I am somewhat unsure of how a proxy is handled by my computer. For example, if in any case when I set a proxy value is this value consistent for anything that makes a request on my computer? Ie. will scrapy and webdriver have the same proxy values as long as one of them sets the value? Especially if scrapy has a proxy value, will any selenium webdriver instantiated inside of the class definition inherit that proxy?
I'm quite inexperienced with these tools and would really appreciate some help!
I've tried looking for a method to test and check the proxy value of selenium as well as scrapy to compare
#gets the proxies and sets the value of the scrapy proxy list in settings
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]'):
#Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0],i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
proxy_pool = cycle(proxies)
url = 'https://httpbin.org/ip'
new_proxy_list = []
for i in range(1,30):
#Get a proxy from the pool
proxy = next(proxy_pool)
try:
response = requests.get(url,proxies={"http": proxy, "https": proxy})
#Grab and append proxy if valid
new_proxy_list.append(proxy)
except:
#Most free proxies will often get connection errors. You will have retry the entire request using another proxy to work.
#We will just skip retries as its beyond the scope of this tutorial and we are only downloading a single url
print("Skipping. Connnection error")
#add to settings proxy list
settings.ROTATING_PROXY_LIST = new_proxy_list
I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")
There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.
Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.
Requests isn't using the proxies I pass to it. The site at the url I'm using shows which ip the request came from--and it's always my ip not the proxy ip. I'm getting my proxy ips from sslproxies.org which are supposed to be anonymous.
url = 'http://www.lagado.com/proxy-test'
proxies = {'http': 'x.x.x.x:xxxx'}
headers = {'User-Agent': 'Mozilla...etc'}
res = requests.get(url, proxies=proxies, headers=headers)
Are there certain headers that need to be used or something else that needs to be configured so that my ip is hidden from the server?
The docs state
that proxy URLs must include the scheme.
Where scheme is scheme://hostname. So you should add 'http://' or 'socks5://' to your proxy URL, depending on the protocol you're using.
I am new in scrapy. I found that for use http proxy but I want to use http and https proxy together because when I crawl the links there has http and https links. How do I use also http and https proxy?
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
#like here request.meta['proxy'] = "https://YOUR_PROXY_IP:PORT"
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
You could use standard environment variables with the combination of the HttpProxyMiddleware:
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
http_proxy
https_proxy
no_proxy
You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port.