Accessing website using proxy in python - python

I am not able to access website. Given below is the code which I used it. I tried to use for application as well for "Multiple Circuit Tor Solution". Hopefully i will get help soon.
import urllib2
proxy_support = urllib2.ProxyHandler({'http':'80.82.69.72:3128'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler(debuglevel=1))
url_set_cookie = 'my_website_address'
req = urllib2.Request(url_set_cookie)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
opener.open(req)
Getting Error
URLError: <urlopen error [Errno 110] Connection timed out>

Try this:
import urllib2
proxy = urllib2.ProxyHandler({'http': '0.0.0.0:9090'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com/')
datum = response.read().decode("UTF-8")
response.close()
print datum
Refer this
See, if that helps.

Related

urllib.request.urlopen not working for a specific website

I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)

urllib redirect error

I'm trying to scrape tables using urllib and BeautifulSoup, and I get the error:
"urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found"
I've heard that this is related to the site requiring cookies, but I still get this error after my 2nd attempt:
import urllib.request
from bs4 import BeautifulSoup
import re
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
file = opener.open(testURL).read().decode()
soup = BeautifulSoup(file)
tables = soup.find_all('tr',{'style': re.compile("color:#4A3C8C")})
print(tables)
A fiew suggestions:
Use HTTPCookieProcessor if you must handle cookies.
You don't have to use a custom User-Agent, but if you want to simulate Mozilla you'll have to use it's full name. This site won't accept 'Mozilla/5.0' and will keep redirecting.
You can catch such exceptions with HTTPError.
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0'
opener.addheaders = [('user-agent', user_agent)]
try:
response = opener.open(testURL)
except urllib.error.HTTPError as e:
print(e)
except Exception as e:
print(e)
else:
file = response.read().decode()
soup = BeautifulSoup(file, 'html.parser')
... etc ...

HTTP Error 403: Forbidden with urlretrieve

I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)

urllib2.URLError: <urlopen error Tunnel connection failed: 403 Tunnel or SSL Forbidden>

Basically ,I'm trying to use python's urllib2. I want to connect and fetch the data from a site. The problem is that I get the error
urllib2.URLError: <urlopen error Tunnel connection failed: 403 Tunnel or SSL Forbidden>
After repeating my experiments with this library , I found that the code I had written worked well with https:// sites but not with http:// sites. I read a few earlier questions on stack overflow suggesting to add the header User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (to spoof the header).
I did that but still it failed.
After that I read this urllib2.HTTPError: HTTP Error 403: Forbidden
I tried that as well but that didn't work.
Here's my code
import urllib2
url = "http://the_site_i_want_to_connect"
hdr = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
req = urllib2.Request(url , headers=hdr)
p = urllib2.urlopen(req).read()
print p
PS: As I said , this works fine with https
Please help!
Thanks in advance!
This error looks like issue with your proxy settings please refer this blog

Python - urllib2 & cookielib

I am trying to open the following website and retrieve the initial cookie and use it for the second url-open BUT if you run the following code it outputs 2 different cookies. How do I use the initial cookie for the second url-open?
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
Output shows 2 different cookies every time as you can see:
<cookielib.CookieJar[<Cookie JSESSIONID=0DEEE8331DE7D0DFDC22E860E065085F for www.idcourts.us/repository>]>
<cookielib.CookieJar[<Cookie JSESSIONID=E01C2BE8323632A32DA467F8A9B22A51 for www.idcourts.us/repository>]>
This is not a problem with urllib. That site does some funky stuff. You need to request a couple of stylesheets for it to validate your session id:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# default User-Agent ('Python-urllib/2.6') will *not* work
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11'),
]
stylesheets = [
'https://www.idcourts.us/repository/css/id_style.css',
'https://www.idcourts.us/repository/css/id_print.css',
]
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
sessid = cj._cookies['www.idcourts.us']['/repository']['JSESSIONID'].value
# Note the +=
opener.addheaders += [
('Referer', 'https://www.idcourts.us/repository/start.do'),
]
for st in stylesheets:
# da trick
opener.open(st+';jsessionid='+sessid)
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
# perhaps need to keep updating the referer...
Not an actual answer (but far too long for a comment); possibly useful to anyone else trying to answer this.
Despite my best attempts, I can't figure this out.
Looking in Firebug, the cookie seems to remain the same (works properly) for Firefox.
I added urllib2.HTTPSHandler(debuglevel=1) to debug what headers Python is sending, and it does appear to resend the cookie.
I also added all the Firefox request headers to see if that would help (it didn't):
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
..
]
My test code:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), urllib2.HTTPSHandler(debuglevel=1))
opener.addheaders = [
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Language', 'en-gb,en;q=0.5'),
('Accept-Encoding', 'gzip,deflate'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
('Keep-Alive', '115'),
('Connection', 'keep-alive'),
('Cache-Control', 'max-age=0'),
('Referer', 'https://www.idcourts.us/repository/partySearch.do'),
]
home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
I feel like I'm missing something obvious.
I think, it is a problem with the server it is Setting a new cookie for each request.

Categories

Resources