Grab and Rotate proxies in Python - python

I'm creating a bot to register in a webform, I need to grab proxies from certain websites like proxyscrap and download them into a text file.
Then the python script runs the browser and defines then connect to the working and fastest one first.
Can I do that?
If yes please provide me with the code

I have other method is free method not like proxyscrap. I suggest to use http-request-randomizer from here Link
Using it is so simple
from http_request_randomizer.requests.proxy import RequestProxy
req_proxy = RequestProxy()
pro=req_proxy.randomize_proxy()
pro1=str(pro).split(' ')[0]
Then use pro1 as a proxy for your request it will generate new proxy each time, or if you need to use a custom proxy then:
def is_bad_proxy(pip):
try:
proxy_handler = urllib2.ProxyHandler({'http': pip})
opener = urllib2.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
req=urllib2.Request('http://www.example.com') # change the URL to test here
sock=urllib2.urlopen(req)
except urllib2.HTTPError, e:
print 'Error code: ', e.code
return e.code
except Exception, detail:
print "ERROR:", detail
return True
return False

Related

Cant catch exeptions with urllib2

I have a script printing out the response from an API, but I cant seem to catch any exceptions. I think I've gone thru every question asked on this topic without any luck.
How can I check if the script will catch any errors/exceptions?
I'm testing the script on a site i know returns 403 Forbidden, but it does'nt show.
My script:
import urllib2
url_se = 'http://www.example.com'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'API to File')]
try:
request = opener.open(url_se)
except urllib2.HTTPError, e:
print e.code
except urllib2.URLError, e:
print e.args
except Exception:
import traceback
print 'Generic exception ' + traceback.format_exc()
response = request.read()
print response
Is this the right approach? Whats the best practice for catching exeptions concerning
urllib2
There is a bug in your program. If any exception occurs in try block then variable request becomes undefined in response = request.text() block.
Correct it as
import urllib2
url_se = 'http://www.example.com'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'API to File')]
try:
request = opener.open(url_se)
response = request.read()
print response
except urllib2.HTTPError, e:
print e.code
except urllib2.URLError, e:
print e.args
except Exception as e:
import traceback
print 'Generic exception ' + traceback.format_exc()
Test it on your machine you will surely see the Exceptions.
Catching exceptions individually only make sense if you are doing something specify with them otherwise if you only want to log the exceptions then a universal except block will do the job.

Python downloading hyperlink with urllib2

Still new with urllib2 so sorry if this is a basic question but I couldn't find an answer anywhere that made sense. Tried to make a basic web scraper on the raspberry pi with python 2.7.9 that finds a url in the site and downloads the file however I keep getting http 503 errors even if I loop over and retry it never goes through.
import cfscrape
import urllib2
def find_hyperlink( text, first, last):
try:
start = text.index(first) + len (first)
end = text.index (last, start)
return text[start:end]
except ValueError:
return ""
def download_hyperlink(name):
scraper = cfscrape.create_scraper()
r= scraper.get(name).content
link = find_hyperlink(r, 'Download (Save as...): <a href="', '">')
print link
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
try:
response = opener.open(link)
print 'downloading episode'
except urllib2.HTTPError as e:
if e.code == 503:
download_hyperlink(name)
else:
raise
download_hyperlink('http://www.kiss-anime.me/Anime-sword-art-online-episode-1')

Python - How do I save/load specific cookies?

I have the following Python code that is supposed to log into a website using the .ROBLOSECURITY cookie. It also includes an except IOERROR: function so that if the .ROBLOSECURITY cookie doesn't log in, it will use a username/password to log in AND it will save the cookies it gets from that.
import urllib2
import urllib
import cookielib
try:
cookielib.LWPCookieJar().load("cookies.txt") #Trying to load the cookie file
except IOError: #In case the cookies.txt fails to log in. I don't know if IOError is the correct error specification for an expired cookie
print "Loading stored cookies failed, manually logging in..."
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
authentication_url = 'https://www.roblox.com/newlogin'
payload = {
'username' : 'UsernameHere',
'password' : 'PasswordHere',
'' : 'Log In',
}
data = urllib.urlencode(payload)
req = urllib2.Request(authentication_url, data)
resp = urllib2.urlopen(req)
cj.save("cookies.txt") #Saving to the cookie file
tc = urllib2.urlopen("http://www.roblox.com/My/Money.aspx").read() #The hidden page
checksession = re.findall('My Transactions',tc) #Just checking for text that is only found on the hidden page
print checksession
I think that cookielib.LWPCookieJar().load("cookies.txt") is not working because it is also loading other cookies other than the .ROBLOSECURITY (Which I know logs in if you only use that). How do I either load ONLY the .ROBLOSECURITY cookie or save ONLY the .ROBLOSECURITY (So that other cookies don't interfere with the .ROBLOSECURITY logging in)?
Also, I'm not sure if my except IOError: will function correctly because I only know that that works if I change my cookielib.LWPCookieJar().load("cookies.txt") to cookielib.MozillaCookieJar().load("cookies.txt")
Lastly, how can I change my .ROBLOSECURITY's expiration date to something like 2050-12-31 24:00:00Z
So here was my final code:
import urllib #Idk if this is necessary, but I'm not gonna bother checking. Only time I used it was for urllib.urlencode, which idk if urllib2 can do
import urllib2
import cookiejar
try:
cj = cookielib.MozillaCookieJar("./cookies.txt")
cj.load()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
tc = opener.open("http://www.roblox.com/My/Money.aspx").read()
# print "Loading stored cookies succeeded, automatically logging in..."
# checksession = re.findall('My Transactions',tc)
# if len(checksession) > 0:
# print "Login success!"
except IOError:
print "Loading stored cookies failed, manually logging in..."
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
authentication_url = 'https://www.roblox.com/newlogin'
payload = {
'username' : 'USERNAMEHERE',
'password' : 'PASSWORDHERE',
'' : 'Log In', #In hindsight Idk if this is necessary, but I don't feel like checking now
}
data = urllib.urlencode(payload)
req = urllib2.Request(authentication_url, data)
resp = urllib2.urlopen(req)
cj.save("./cookies.txt") #./ to make sure its in the right directory
tc = opener.open("http://www.roblox.com/My/Money.aspx").read() #Only accessible if logged in
# checksession = re.findall('My Transactions',tc) #The rest is to check if log in was success by looking for specific phrase in the page
# if len(checksession) > 0:
# print "Login success!"

Python check if website exists

I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?
You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2
from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source
There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.
I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.
code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"
You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False
def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1
Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)

Python form POST using urllib2 (also question on saving/using cookies)

I am trying to write a function to post form data and save returned cookie info in a file so that the next time the page is visited, the cookie information is sent to the server (i.e. normal browser behavior).
I wrote this relatively easily in C++ using curlib, but have spent almost an entire day trying to write this in Python, using urllib2 - and still no success.
This is what I have so far:
import urllib, urllib2
import logging
# the path and filename to save your cookies in
COOKIEFILE = 'cookies.lwp'
cj = None
ClientCookie = None
cookielib = None
logger = logging.getLogger(__name__)
# Let's see if cookielib is available
try:
import cookielib
except ImportError:
logger.debug('importing cookielib failed. Trying ClientCookie')
try:
import ClientCookie
except ImportError:
logger.debug('ClientCookie isn\'t available either')
urlopen = urllib2.urlopen
Request = urllib2.Request
else:
logger.debug('imported ClientCookie succesfully')
urlopen = ClientCookie.urlopen
Request = ClientCookie.Request
cj = ClientCookie.LWPCookieJar()
else:
logger.debug('Successfully imported cookielib')
urlopen = urllib2.urlopen
Request = urllib2.Request
# This is a subclass of FileCookieJar
# that has useful load and save methods
cj = cookielib.LWPCookieJar()
login_params = {'name': 'anon', 'password': 'pass' }
def login(theurl, login_params):
init_cookies();
data = urllib.urlencode(login_params)
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
try:
# create a request object
req = Request(theurl, data, txheaders)
# and open it to return a handle on the url
handle = urlopen(req)
except IOError, e:
log.debug('Failed to open "%s".' % theurl)
if hasattr(e, 'code'):
log.debug('Failed with error code - %s.' % e.code)
elif hasattr(e, 'reason'):
log.debug("The error object has the following 'reason' attribute :"+e.reason)
sys.exit()
else:
if cj is None:
log.debug('We don\'t have a cookie library available - sorry.')
else:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
# save the cookies again
cj.save(COOKIEFILE)
#return the data
return handle.read()
# FIXME: I need to fix this so that it takes into account any cookie data we may have stored
def get_page(*args, **query):
if len(args) != 1:
raise ValueError(
"post_page() takes exactly 1 argument (%d given)" % len(args)
)
url = args[0]
query = urllib.urlencode(list(query.iteritems()))
if not url.endswith('/') and query:
url += '/'
if query:
url += "?" + query
resource = urllib.urlopen(url)
logger.debug('GET url "%s" => "%s", code %d' % (url,
resource.url,
resource.code))
return resource.read()
When I attempt to log in, I pass the correct username and pwd,. yet the login fails, and no cookie data is saved.
My two questions are:
can anyone see whats wrong with the login() function, and how may I fix it?
how may I modify the get_page() function to make use of any cookie info I have saved ?
There are quite a few problems with the code that you've posted. Typically you'll want to build a custom opener which can handle redirects, https, etc. otherwise you'll run into trouble. As far as the cookies themselves so, you need to call the load and save methods on your cookiejar, and use one of subclasses, such as MozillaCookieJar or LWPCookieJar.
Here's a class I wrote to login to Facebook, back when I was playing silly web games. I just modified it to use a file based cookiejar, rather than an in-memory one.
import cookielib
import os
import urllib
import urllib2
# set these to whatever your fb account is
fb_username = "your#facebook.login"
fb_password = "secretpassword"
cookie_filename = "facebook.cookies"
class WebGamePlayer(object):
def __init__(self, login, password):
""" Start up... """
self.login = login
self.password = password
self.cj = cookielib.MozillaCookieJar(cookie_filename)
if os.access(cookie_filename, os.F_OK):
self.cj.load()
self.opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(self.cj)
)
self.opener.addheaders = [
('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
'Windows NT 5.2; .NET CLR 1.1.4322)'))
]
# need this twice - once to set cookies, once to log in...
self.loginToFacebook()
self.loginToFacebook()
self.cj.save()
def loginToFacebook(self):
"""
Handle login. This should populate our cookie jar.
"""
login_data = urllib.urlencode({
'email' : self.login,
'pass' : self.password,
})
response = self.opener.open("https://login.facebook.com/login.php", login_data)
return ''.join(response.readlines())
test = WebGamePlayer(fb_username, fb_password)
After you've set your username and password, you should see a file, facebook.cookies, with your cookies in it. In practice you'll probably want to modify it to check whether you have an active cookie and use that, then log in again if access is denied.
If you are having a hard time making your POST requests to work (like I had with a login form), it definitely pays to quickly install the Live HTTP headers extension to Firefox (http://livehttpheaders.mozdev.org/index.html). This small extension can, among other things, show you the exact POST data that are sent when you manually log in.
In my case, I had banged my head against the wall for hours because the site insisted on an extra field with 'action=login' (doh!).
Please using ignore_discard and ignore_expires while save cookie, in mine case it saved OK.
self.cj.save(cookie_file, ignore_discard=True, ignore_expires=True)

Categories

Resources