bypass proxy using tor and torctl - python

I am looking how to bypass pass proxy using tor and torctl, I looked on various steps and wrote this script in python.
After starting tor, ideally this should work
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"} )
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
#urllib2.urlopen('http://www.google.fr')
data = json.load(urllib2.urlopen("https://www.google.co.in/trends/hottrends/hotItems?geo=IN&mob=0&hvsm=0"))
which again gives this message :
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable
I have already started tor and enabled control port using
tor --controlport 9051
Do I need to make any other change?
EDIT
treaceback after change as per new answer of running tor on 1080 port
>>> import urllib2
>>> proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:1080"})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
>>> import json
>>> data = json.load(urllib2.urlopen("https://www.google.co.in/trends/hottrends/hotItems?geo=IN&mob=0&hvsm=0"))
('216.58.220.3', 443)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 407, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 445, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable

You can use tor as a SOCKS proxy.
Start tor with:
tor SOCKSPort 1080
And use the SOCKS 5 proxy at 127.0.0.1:1080.
Also check this question on how to use a SOCKS proxy with urllib2.

Related

unable to open some url in urllib2 which can still be openned in browsers? [duplicate]

This question already has an answer here:
urllib2.urlopen cannot get image, but browser can
(1 answer)
Closed 6 years ago.
I can open this url in firefox or chrome, but I cannot open it with urllib2.
>>> req = urllib2.Request(r"http://ratedata.gaincapital.com/2014/.\01 January", headers={'User-Agent' : "Mozilla/5.1"})
>>> urllib2.urlopen(req)
Traceback (most recent call last):
File "<pyshell#134>", line 1, in <module>
urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 400: Bad Request
Very strange url but how can I fix this?
Replace white space with %20
>>> req = urllib2.Request(r"http://ratedata.gaincapital.com/2014/.\01%20January", headers={'User-Agent' : "Mozilla/5.1"})
>>> urllib2.urlopen(req)
<addinfourl at 139708797193896 whose fp = <socket._fileobject object at 0x7f10820eb2d0>>

urllib2 retrieve an arbitrary file based on URL and save it into a named file

I am writing a python script to use the urllib2 module as an equivalent to the command line utility wget. The only function I want for this is that it can be used to retrieve an arbitrary file based on URL and save it into a named file. I also only need to worry about two command line arguments, the URL from which the file is to be downloaded and the name of the file into which the content are to be saved.
Example:
python Prog7.py www.python.org pythonHomePage.html
This is my code:
import urllib
import urllib2
#import requests
url = 'http://www.python.org/pythonHomePage.html'
print "downloading with urllib"
urllib.urlretrieve(url, "code.txt")
print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.txt", "wb") as code:
code.write(data)
urllib seems to work but urllib2 does not seem to work.
Errors received:
File "Problem7.py", line 11, in <module>
f = urllib2.urlopen(url)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND
And the URL is doesn't exist at all; https://www.python.org/pythonHomePage.html is indeed a 404 Not Found page.
The difference between urllib and urllib2 then is that the latter automatically raises an exception when a 404 page is returned, while urllib.urlretrieve() just saves the error page for you:
>>> import urllib
>>> urllib.urlopen('https://www.python.org/pythonHomePage.html').getcode()
404
>>> import urllib2
>>> urllib2.urlopen('https://www.python.org/pythonHomePage.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND
If you wanted to save the error page, you can catch the urllib2.HTTPError exception:
try:
f = urllib2.urlopen(url)
data = f.read()
except urllib2.HTTPError as err:
data = err.read()
It is due to the different behavior by urllib and urllib2.
Since the web page returns a 404 error (webpage not found) urllib2 "catches" it while urllib downloads the html of the returned page regardless of the error.
If you want to print the html to the text file you can print the error:
import urllib2
try:
data = urllib2.urlopen('http://www.python.org/pythonHomePage.html').read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
print e.fp.read()
with open("code2.txt", "wb") as code:
code.write(e.fp.read())
req will be a Request object, fp will be a file-like object with the
HTTP error body, code will be the three-digit code of the error, msg
will be the user-visible explanation of the code and hdrs will be a
mapping object with the headers of the error.
More data about HTTP error: urllib2 documentation

why can't I use urllib2.urlopen for wikipedia site? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fetch a Wikipedia article with Python
>>> print urllib2.urlopen('http://zh.wikipedia.org/wiki/%E6%AF%9B%E6%B3%BD%E4%B8%9C').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
You need to provide a user-agent else you'll get a 403, like you did.
On Wikimedia wikis, if you don't supply a User-Agent header, or you
supply an empty or generic one, your request will fail with an HTTP
403 error. See our User-Agent policy. Other MediaWiki installations
may have similar policies.
So just add a user-agent to your code and it should work fine.
Try to download the page with wget of cURL.
If you can't then you might have a network problem.
If you can, then Wikipedia might block certain user agents. In that case, use urllib2's add_header to define a custom user agent (to imitate a browser request).

url is not accessible through wget e or script

Hello Guys! I want to access some web page through python script. The url is: http://www.idealo.de/preisvergleich/Shop/27039.html
When I access it through web browser it is OK. But when I want to access it with urllib2:
a = urllib2.urlopen("http://www.idealo.de/preisvergleich/Shop/27039.html")
It gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Also I tried to access it with wget:
wget http://www.idealo.de/preisvergleich/Shop/27039.html
The error is:
--2012-04-23 12:42:03-- http://www.idealo.de/preisvergleich/Shop/27039.html
Resolving www.idealo.de (www.idealo.de)... 62.146.49.133
Connecting to www.idealo.de (www.idealo.de)|62.146.49.133|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2012-04-23 12:42:03 ERROR 403: Forbidden.
Can anyone explain why it is so? And how can I access it using python?
They're blocking some user agents. If you try with the following:
wget -U "Mozilla/5.0" http://www.idealo.de/preisvergleich/Shop/27039.html
it works. So you have to find the way to fake the user agent in your python code to make it work.
Try this:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
a = opener.open("http://www.idealo.de/preisvergleich/Shop/27039.html")

A site is available but it always responses "Internal Server Error"

The code looks like:
url ="http://www.example.com"
for a in range(0,10):
opener = urllib2.build_opener()
urllib2.install_opener(opener)
postdata ="info=123456"+str(a)
urllib2.urlopen(url, postdata)
which just post some data to a specific URL(e.g. http://www.example.com), however, I always get the error message,
Traceback (most recent call last):
File "test.py", line 9, in <module>
urllib2.urlopen(url, postdata)
File "c:\Python26\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
I am sure the site is working, so how can I fix the problem? Any help would be greatly appreciated.
You say you're sure the site is working, yet it returns an error. Why don't you try doing whatever you did to determine that the site is working, while running a network logger like wireshark, and then run your test program to see if the two are really issuing the same queries. If not, you've found the problem.
Otherwise, take a look at the server's logs. A much more descriptive error message should be found there. If it's not your server, consider asking whoever does own it.
Some websites don't accept requests from urllib. Try to change the User-Agent.

Categories

Resources