why can't I use urllib2.urlopen for wikipedia site? [duplicate]

why can't I use urllib2.urlopen for wikipedia site? [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Fetch a Wikipedia article with Python
>>> print urllib2.urlopen('http://zh.wikipedia.org/wiki/%E6%AF%9B%E6%B3%BD%E4%B8%9C').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

You need to provide a user-agent else you'll get a 403, like you did.
On Wikimedia wikis, if you don't supply a User-Agent header, or you
supply an empty or generic one, your request will fail with an HTTP
403 error. See our User-Agent policy. Other MediaWiki installations
may have similar policies.
So just add a user-agent to your code and it should work fine.

Try to download the page with wget of cURL.
If you can't then you might have a network problem.
If you can, then Wikipedia might block certain user agents. In that case, use urllib2's add_header to define a custom user agent (to imitate a browser request).

Related

bypass proxy using tor and torctl

I am looking how to bypass pass proxy using tor and torctl, I looked on various steps and wrote this script in python.
After starting tor, ideally this should work
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"} )
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
#urllib2.urlopen('http://www.google.fr')
data = json.load(urllib2.urlopen("https://www.google.co.in/trends/hottrends/hotItems?geo=IN&mob=0&hvsm=0"))
which again gives this message :
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable
I have already started tor and enabled control port using
tor --controlport 9051
Do I need to make any other change?
EDIT
treaceback after change as per new answer of running tor on 1080 port
>>> import urllib2
>>> proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:1080"})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
>>> import json
>>> data = json.load(urllib2.urlopen("https://www.google.co.in/trends/hottrends/hotItems?geo=IN&mob=0&hvsm=0"))
('216.58.220.3', 443)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 407, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 520, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 445, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 528, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 503: Service Unavailable

You can use tor as a SOCKS proxy.
Start tor with:
tor SOCKSPort 1080
And use the SOCKS 5 proxy at 127.0.0.1:1080.
Also check this question on how to use a SOCKS proxy with urllib2.

Proxy Authentication error in Urllib2 (Python 2.7)

[Windows 7 64 bit; Python 2.7]
If I try to use Urllib2, I get this error
Traceback (most recent call last):
File "C:\Users\cYanide\Documents\Python Challenge\1\1.py", line 7, in <module>
response = urllib2.urlopen('http://python.org/')
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 438, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
**urllib2.HTTPError: HTTP Error 407: Proxy Authentication Required**
Now, I'm behind a college proxy which requires authentication so that's probably the reason why this is happening. But isn't Urllib2 supposed to pull the authentication and proxy information from the system settings?
I understand there's some extra code I can insert into my program to 'hardcode' the proxy information in the program but I really don't want to do that unless it's the last resort. It would hinder the portability of the program across computers with different authentication IDs and Passwords in the college.

Your program should see the environment variables which are set in Windows. So have these two environment variables in your Windows.
HTTP_PROXY = http://username:password#proxyserver.domain.com
HTTPS_PROXY = https://username:password#proxyserver.domain.com
And go ahead with executing your script. It should pick up the proper authenticators and proceed with the connection.

url is not accessible through wget e or script

Hello Guys! I want to access some web page through python script. The url is: http://www.idealo.de/preisvergleich/Shop/27039.html
When I access it through web browser it is OK. But when I want to access it with urllib2:
a = urllib2.urlopen("http://www.idealo.de/preisvergleich/Shop/27039.html")
It gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Also I tried to access it with wget:
wget http://www.idealo.de/preisvergleich/Shop/27039.html
The error is:
--2012-04-23 12:42:03-- http://www.idealo.de/preisvergleich/Shop/27039.html
Resolving www.idealo.de (www.idealo.de)... 62.146.49.133
Connecting to www.idealo.de (www.idealo.de)|62.146.49.133|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2012-04-23 12:42:03 ERROR 403: Forbidden.
Can anyone explain why it is so? And how can I access it using python?

They're blocking some user agents. If you try with the following:
wget -U "Mozilla/5.0" http://www.idealo.de/preisvergleich/Shop/27039.html
it works. So you have to find the way to fake the user agent in your python code to make it work.
Try this:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
a = opener.open("http://www.idealo.de/preisvergleich/Shop/27039.html")

A site is available but it always responses "Internal Server Error"

The code looks like:
url ="http://www.example.com"
for a in range(0,10):
opener = urllib2.build_opener()
urllib2.install_opener(opener)
postdata ="info=123456"+str(a)
urllib2.urlopen(url, postdata)
which just post some data to a specific URL(e.g. http://www.example.com), however, I always get the error message,
Traceback (most recent call last):
File "test.py", line 9, in <module>
urllib2.urlopen(url, postdata)
File "c:\Python26\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
I am sure the site is working, so how can I fix the problem? Any help would be greatly appreciated.

You say you're sure the site is working, yet it returns an error. Why don't you try doing whatever you did to determine that the site is working, while running a network logger like wireshark, and then run your test program to see if the two are really issuing the same queries. If not, you've found the problem.
Otherwise, take a look at the server's logs. A much more descriptive error message should be found there. If it's not your server, consider asking whoever does own it.

Some websites don't accept requests from urllib. Try to change the User-Agent.

Head First Programming: Error in example program

I have completed a program that is outlined in chapter 3 of Head First Programming.
Basically, the program searches a website and stores the price on that page. Then depending on which option the user selects, a certain message will be sent to the user's twitter account.
Source code from book's website: http://headfirstlabs.com/books/hfprog/chapter03/page108.py
When I run my program, and run the source code from the book's website, I get the same error.
Here is the error:
Traceback (most recent call last):
File "C:\Users\Krysten\Desktop\Ch3.py", line 28, in <module>
send_to_twitter(get_price())
File "C:\Users\Krysten\Desktop\Ch3.py", line 14, in send_to_twitter
resp = urllib.request.urlopen("http://twitter.com/statuses/update.json", params)
File "C:\Python31\lib\urllib\request.py", line 121, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python31\lib\urllib\request.py", line 356, in open
response = meth(req, response)
File "C:\Python31\lib\urllib\request.py", line 468, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python31\lib\urllib\request.py", line 394, in error
return self._call_chain(*args)
File "C:\Python31\lib\urllib\request.py", line 328, in _call_chain
result = func(*args)
File "C:\Python31\lib\urllib\request.py", line 476, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: Unauthorized
Is the error being caused because the book is somewhat outdated and twitter has to be accessed in a different way?

In most of Twitter API, basic authentication is deprecated. Use OAuth API.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

why can't I use urllib2.urlopen for wikipedia site? [duplicate] - python

Try to download the page with wget of cURL. If you can't then you might have a network problem. If you can, then Wikipedia might block certain user agents. In that case, use urllib2's add_header to define a custom user agent (to imitate a browser request).

Related

bypass proxy using tor and torctl

Proxy Authentication error in Urllib2 (Python 2.7)

url is not accessible through wget e or script

A site is available but it always responses "Internal Server Error"

Head First Programming: Error in example program

Categories

Resources