Python3 Requests not using passed-in proxy - python

Requests isn't using the proxies I pass to it. The site at the url I'm using shows which ip the request came from--and it's always my ip not the proxy ip. I'm getting my proxy ips from sslproxies.org which are supposed to be anonymous.
url = 'http://www.lagado.com/proxy-test'
proxies = {'http': 'x.x.x.x:xxxx'}
headers = {'User-Agent': 'Mozilla...etc'}
res = requests.get(url, proxies=proxies, headers=headers)
Are there certain headers that need to be used or something else that needs to be configured so that my ip is hidden from the server?

The docs state
that proxy URLs must include the scheme.
Where scheme is scheme://hostname. So you should add 'http://' or 'socks5://' to your proxy URL, depending on the protocol you're using.

Related

How to pass proxy header in python

I am using the python requests library for a GET request.
The request goes via a proxy and I want to set a proxy header in the request (X-AUTH)
I could not find a way to set proxy header in requests. I want something similar to --proxy-header in curl. Setting it in normal headers does not seem to work.
import requests
proxies = {
"http": "http://myproxy:8000",
"https": "http://myproxy:8000",
}
r = requests.get("https://abc.xyz.com/some/endpoint", proxies=proxies, headers = {"X-AUTH": "mysecret"})
print(r.text)
I have changed the endpoints for privacy.
Curl call works when --proxy-header is used.

How to get info/data from blocked web sites with BeautifulSoup?

I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")
There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.
Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.

Python Requests behind proxy

I'm behind a corporate proxy (Isa Server).
When using urllib2 I can connect through the proxy to the internet without any problem, but when using the requests library I can't.
Here is my urllib2 code:
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen('http://www.google.com')
print page.getcode()
This prints '200' and works fine
However when doing the same with requests I get a 407 code and doesn't work.
proxy_dict = {
'http': 'http://10.20.23.5:8080',
'https': 'ftp://10.20.23.5:8080',
'ftp': 'https://10.20.23.5:8080'
}
page = requests.get('http://www.google.com', proxies=proxy_dict)
print page.status_code
print page.reason
This prints '407' and the reason: 'Proxy Authentication Required ( Forefront TMG requires authorization to fulfill the request. Access to the Web Proxy filter is denied. )'
Even if I pass to requests the proxies from urllib2 doesn't work either:
page = requests.get('http://http://www.google.com', proxies=urllib2.getproxies())
Urllib2 is doing something that requests is not.
Any help?
If your proxy requires authentication, you need to set up those variables:
proxy_dict = {
'http': 'http://username:password#10.20.23.5:8080',
'https': 'https://username:password#10.20.23.5:8080',
'ftp': 'ftp://username:password#10.20.23.5:8080'
}

requests library https get via proxy leads to error

Trying to send a simple get request via a proxy. I have the 'Proxy-Authorization' and 'Authorization' headers, don't think I needed the 'Authorization' header, but added it anyway.
import requests
URL = 'https://www.google.com'
sess = requests.Session()
user = 'someuser'
password = 'somepass'
token = base64.encodestring('%s:%s'%(user,password)).strip()
sess.headers.update({'Proxy-Authorization':'Basic %s'%token})
sess.headers['Authorization'] = 'Basic %s'%token
resp = sess.get(URL)
I get the following error:
requests.packages.urllib3.exceptions.ProxyError: Cannot connect to proxy. Socket error: Tunnel connection failed: 407 Proxy Authentication Required.
However when I change the URL to simple http://www.google.com, it works fine.
Do proxies use Basic, Digest, or some other sort of authentication for https? Is it proxy server specific? How do I discover that info? I need to achieve this using the requests library.
UPDATE
Its seems that with HTTP requests we have to pass in a Proxy-Authorization header, but with HTTPS requests, we need to format the proxy URL with the username and password
#HTTP
import requests, base64
URL = 'http://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<IP>:<PORT>}
token = base64.encodestring('%s:%s' %(user, password)).strip()
myheader = {'Proxy-Authorization': 'Basic %s' %token}
r = requests.get(URL, proxies = proxies, headers = myheader)
print r.status_code # 200
#HTTPS
import requests
URL = 'https://www.google.com'
user = <username>
password = <password>
proxy = {'http': 'http://<user>:<password>#<IP>:<PORT>}
r = requests.get(URL, proxies = proxy)
print r.status_code # 200
When sending an HTTP request, if I leave out the header and pass in a proxy formatted with user/pass, I get a 407 response.
When sending an HTTPS request, if I pass in the header and leave the proxy unformatted I get a ProxyError mentioned earlier.
I am using requests 2.0.0, and a Squid proxy-caching web server. Why doesn't the header option work for HTTPS? Why does the formatted proxy not work for HTTP?
The answer is that the HTTP case is bugged. The expected behaviour in that case is the same as the HTTPS case: that is, you provide your authentication credentials in the proxy URL.
The reason the header option doesn't work for HTTPS is that HTTPS via proxies is totally different to HTTP via proxies. When you route a HTTP request via a proxy, you essentially just send a standard HTTP request to the proxy with a path that indicates a totally different host, like this:
GET http://www.google.com/ HTTP/1.1
Host: www.google.com
The proxy then basically forwards this on.
For HTTPS that can't possibly work, because you need to negotiate an SSL connection with the remote server. Rather than doing anything like the HTTP case, you use the CONNECT verb. The proxy server connects to the remote end on behalf of the client, and from them on just proxies the TCP data. (More information here.)
When you attach a Proxy-Authorization header to the HTTPS request, we don't put it on the CONNECT message, we put it on the tunnelled HTTPS message. This means the proxy never sees it, so refuses your connection. We special-case the authentication information in the proxy URL to make sure it attaches the header correctly to the CONNECT message.
Requests and urllib3 are currently in discussion about the right place for this bug fix to go. The GitHub issue is currently here. I expect that the fix will be in the next Requests release.

Using urllib2 via proxy

I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text

Categories

Resources