Python mechanize doesn't work when HTTPS and Proxy Authentication required - python

I use Python 2.7.2 and Mechanize 0.2.5.
When I access the Internet, I have to go through a proxy server. I wrote the following codes, but an URLError occurred at the last line.. Does anyone have any solution about this?
import mechanize
br = mechanize.Browser()
br.set_debug_http(True)
br.set_handle_robots(False)
br.set_proxies({
"http" : "192.168.20.130:8080",
"https" : "192.168.20.130:8080",})
br.add_proxy_password("username", "password")
br.open("http://www.google.co.jp/") # OK
br.open("https://www.google.co.jp/") # Proxy Authentication Required

I don't recommend you to use Mechanize, it's outdated. Take a look at requests it will
make your life a lot easier. Using proxies with requests it's just this:
import requests
proxies = {
"http": "10.10.1.10:3128",
"https": "10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)

Related

How to get info/data from blocked web sites with BeautifulSoup?

I want to write a script with python 3.7. But first I have to scrape it.
I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.
If I use a VPN service I can enter these "banned" sites with Chrome browser.
I tried setting a proxy in pycharm, but I failed. I just got errors all the time.
What's the simplest and free way to solve this problem?
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'}) # that web site is blocked in my country
webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site.
page_soup = soup(webpage, "html.parser")
There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.
A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet.
When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.
You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do
import requests
proxies = { 'http': "http://xxx.xx.xx.xxx:yy",
'https': "https://xxx.xx.xx.xxx:yy"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
and expect to get a response.
The proxy should be configured to take your request and send you a response.
so, where can you get a proxy?
a. You could buy proxies from many providers.
b. Use a list of free proxies from the internet.
You don't need to buy proxies unless you are doing some massive scale scraping.
For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.
import requests
#replace the ip and port below with the ip and port you got from any of the free sites
proxies = { 'http': "http://182.52.51.155:39236",
'https': "https://182.52.51.155:39236"}
r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)
You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.
Note:
Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.
Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.
Here is a small example:
import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port",
'https': "https://host:port"}
text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?
If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:
# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port' }
If you don't have proxies at hand, you can fetch some proxies.

Python Requests behind proxy

I'm behind a corporate proxy (Isa Server).
When using urllib2 I can connect through the proxy to the internet without any problem, but when using the requests library I can't.
Here is my urllib2 code:
proxy = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen('http://www.google.com')
print page.getcode()
This prints '200' and works fine
However when doing the same with requests I get a 407 code and doesn't work.
proxy_dict = {
'http': 'http://10.20.23.5:8080',
'https': 'ftp://10.20.23.5:8080',
'ftp': 'https://10.20.23.5:8080'
}
page = requests.get('http://www.google.com', proxies=proxy_dict)
print page.status_code
print page.reason
This prints '407' and the reason: 'Proxy Authentication Required ( Forefront TMG requires authorization to fulfill the request. Access to the Web Proxy filter is denied. )'
Even if I pass to requests the proxies from urllib2 doesn't work either:
page = requests.get('http://http://www.google.com', proxies=urllib2.getproxies())
Urllib2 is doing something that requests is not.
Any help?
If your proxy requires authentication, you need to set up those variables:
proxy_dict = {
'http': 'http://username:password#10.20.23.5:8080',
'https': 'https://username:password#10.20.23.5:8080',
'ftp': 'ftp://username:password#10.20.23.5:8080'
}

requests via a SOCKs proxy

How can I make an HTTP request via a SOCKs proxy (simply using ssh -D as the proxy)? I've tried using requests with SOCK proxies but it doesn't appear to work (I saw this pull request). For example:
proxies = { "http": "socks5://localhost:9999/" }
r = requests.post( endpoint, data=request, proxies=proxies )
It'd be convenient to keep using the requests library, but I can also switch to urllib2 if that is known to work.
Since SOCKS support has been added to requests 2.10.0, it is remarkably simple, and very close to what you have
Install requests[socks]:
$ pip install requests[socks]
Set up your proxies variable, and make use of it:
>>> import requests
>>> proxies = {
"http":"socks5://localhost:9999",
"https":"socks5://localhost:9999"
}
>>> requests.get(
"https://api.ipify.org?format=json",
proxies=proxies
).json()
{u'ip': u'123.xxx.xxx.xxx'}
A few things to note are to not use a / on the end of the proxies URL, and that you can also use socks4:// as the scheme too if the SOCKS server doesn't support SOCKS5.
SOCKS support for requests is still pending. If you want, you can view my Github repository here to see my branch of the Socksipy library. This is the branch that is currently being integrated into requests; it will be some time before requests fully supports it, though.
https://github.com/Anorov/PySocks/
It should work okay with urllib2. Import sockshandler in your file, and follow the example inside of it. You'll want to create an opener like this:
opener = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, "localhost", 9050))
Then you can use opener.open(url) and it should tunnel through the proxy.

Using urllib2 via proxy

I am trying to use urllib2 through a proxy; however, after trying just about every variation of passing my verification details using urllib2, I either get a request that hangs forever and returns nothing or I get 407 Errors. I can connect to the web fine using my browser which connects to a prox-pac and redirects accordingly; however, I can't seem to do anything via the command line curl, wget, urllib2 etc. even if I use the proxies that the prox-pac redirects to. I tried setting my proxy to all of the proxies from the pac-file using urllib2, none of which work.
My current script looks like this:
import urllib2 as url
proxy = url.ProxyHandler({'http': 'username:password#my.proxy:8080'})
auth = url.HTTPBasicAuthHandler()
opener = url.build_opener(proxy, auth, url.HTTPHandler)
url.install_opener(opener)
url.urlopen("http://www.google.com/")
which throws HTTP Error 407: Proxy Authentication Required and I also tried:
import urllib2 as url
handlePass = url.HTTPPasswordMgrWithDefaultRealm()
handlePass.add_password(None, "http://my.proxy:8080", "username", "password")
auth_handler = url.HTTPBasicAuthHandler(handlePass)
opener = url.build_opener(auth_handler)
url.install_opener(opener)
url.urlopen("http://www.google.com")
which hangs like curl or wget timing out.
What do I need to do to diagnose the problem? How is it possible that I can connect via my browser but not from the command line on the same computer using what would appear to be the same proxy and credentials?
Might it be something to do with the router? if so, how can it distinguish between browser HTTP requests and command line HTTP requests?
Frustrations like this are what drove me to use Requests. If you're doing significant amounts of work with urllib2, you really ought to check it out. For example, to do what you wish to do using Requests, you could write:
import requests
from requests.auth import HTTPProxyAuth
proxy = {'http': 'http://my.proxy:8080'}
auth = HTTPProxyAuth('username', 'password')
r = requests.get('http://wwww.google.com/', proxies=proxy, auth=auth)
print r.text
Or you could wrap it in a Session object and every request will automatically use the proxy information (plus it will store & handle cookies automatically!):
s = requests.Session(proxies=proxy, auth=auth)
r = s.get('http://www.google.com/')
print r.text

urllib2 failing with https websites

Using urllib2 and trying to get an https page, it keeps failing with
Invalid url, unable to resolve
The url is
https://www.domainsbyproxy.com/default.aspx
but I have this happening on multiple https sites.
I am using python 2.7, and below is the code I am using to setup the connection
opener = urllib2.OpenerDirector()
opener.add_handler(urllib2.HTTPHandler())
opener.add_handler(urllib2.HTTPDefaultErrorHandler())
opener.addheaders = [('Accept-encoding', 'gzip')]
fetch_timeout = 12
response = opener.open(url, None, fetch_timeout)
The reason I am setting handlers manually is because I don't want redirects handled (which works fine). The above works fine for http requests, however https - fails.
Any clues?
You should be using HTTPSHandler instead of HTTPHandler
If you don't mind external libraries, consider the excellent requests module. It takes care of these quirks with urllib.
Your code, using requests is:
import requests
r = requests.get(url, headers={'Accept-encoding': 'gzip'}, timeout=12)

Categories

Resources