Still new with urllib2 so sorry if this is a basic question but I couldn't find an answer anywhere that made sense. Tried to make a basic web scraper on the raspberry pi with python 2.7.9 that finds a url in the site and downloads the file however I keep getting http 503 errors even if I loop over and retry it never goes through.
import cfscrape
import urllib2
def find_hyperlink( text, first, last):
try:
start = text.index(first) + len (first)
end = text.index (last, start)
return text[start:end]
except ValueError:
return ""
def download_hyperlink(name):
scraper = cfscrape.create_scraper()
r= scraper.get(name).content
link = find_hyperlink(r, 'Download (Save as...): <a href="', '">')
print link
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
try:
response = opener.open(link)
print 'downloading episode'
except urllib2.HTTPError as e:
if e.code == 503:
download_hyperlink(name)
else:
raise
download_hyperlink('http://www.kiss-anime.me/Anime-sword-art-online-episode-1')
Related
I am connected to the web via VPN and I would like to connect to news site to grab, well, news. For this a library exists: finNews. And this is the code:
import FinNews as fn
cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
print(cnbc_feed.get_news())
print(cnbc_feed.possible_topics())
Now because of the VPN the connection wont work and it throws:
<urlopen error [WinError 10061] No connection could be made because
the target machine actively refused it ( client - server )
So I started separately to understand how to make a connection work and it does work (return is "connected"):
import urllib.request
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
url = "http://www.google.com/search?q=test"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
print ("Connected")
except (HTTPError, URLError) as err:
print("No internet connection.")
Now I figured how to access news and how to make a connection via VPN, but I cant bring both together. I want to grab the news via the library through VPN?! I am fairly new to Python so I guess I dont get the logic fully yet.
EDIT: I tried to combine with Feedparser, based on furas hint:
import urllib.request
import feedparser
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
#url = "http://www.google.com/search?q=test"
#url = "http://www.reddit.com/r/python/.rss"
url = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
#print ("Connected")
feed = feedparser.parse(html)
#print (feed['feed']['link'])
print ("Number of RSS posts :", len(feed.entries))
entry = feed.entries[1]
print ("Post Title :",entry.title)
except (HTTPError, URLError) as err:
print("No internet connection.")
But same error....this is a big nut to crack...
May I ask for your advice? Thank you :)
I'm creating a bot to register in a webform, I need to grab proxies from certain websites like proxyscrap and download them into a text file.
Then the python script runs the browser and defines then connect to the working and fastest one first.
Can I do that?
If yes please provide me with the code
I have other method is free method not like proxyscrap. I suggest to use http-request-randomizer from here Link
Using it is so simple
from http_request_randomizer.requests.proxy import RequestProxy
req_proxy = RequestProxy()
pro=req_proxy.randomize_proxy()
pro1=str(pro).split(' ')[0]
Then use pro1 as a proxy for your request it will generate new proxy each time, or if you need to use a custom proxy then:
def is_bad_proxy(pip):
try:
proxy_handler = urllib2.ProxyHandler({'http': pip})
opener = urllib2.build_opener(proxy_handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
req=urllib2.Request('http://www.example.com') # change the URL to test here
sock=urllib2.urlopen(req)
except urllib2.HTTPError, e:
print 'Error code: ', e.code
return e.code
except Exception, detail:
print "ERROR:", detail
return True
return False
I am looking to parse data from a large number of webpages using Python (>10k) and I am finding that the function I have written to do this often encounters a timeout error every 500 loops. I have attempted to fix this with a try - except code block, but i would like to improve the function so it will re-attempt to open the url four or five times before returning the error. Is there an elegant way to do this?
My code below:
def url_open(url):
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
s = urlopen(req,timeout=50).read()
except urllib.request.HTTPError as e:
if e.code == 404:
print(str(e))
else:
print(str(e))
s=urlopen(req,timeout=50).read()
raise
return BeautifulSoup(s, "lxml")
I've used a pattern like this for retrying in the past:
def url_open(url):
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
retrycount = 0
s = None
while s is None:
try:
s = urlopen(req,timeout=50).read()
except urllib.request.HTTPError as e:
print(str(e))
if canRetry(e.code):
retrycount+=1
if retrycount > 5:
raise
# thread.sleep for a bit
else:
raise
return BeautifulSoup(s, "lxml")
You just have to define canRetry somewhere else.
It would be great if someone could help me with multi-threading this script and write the output to a text file.
I am really new at coding, so please help me out.
#!/usr/bin/python
from tornado import ioloop, httpclient
from BeautifulSoup import BeautifulSoup
from mechanize import Browser
import requests
import urllib2
import socket
import sys
def handle_request(response):
print response.code
global i
i = 0
i -= 1
if i == 0:
http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
try:
br = Browser()
br.set_handle_robots(False)
res = br.open(url, None, 2.5)
data = res.get_data()
soup = BeautifulSoup(data)
title = soup.find('title')
if soup.title != None:
print url, title.renderContents(), '\n'
i += 1
except urllib2.URLError, e:
print "Oops, timed out?", '\n'
except socket.error,e:
print "Oops, timed out?", '\n'
except socket.timeout:
print "Oops, timed out?", '\n'
print 'Processing of list completed, Cheers!!'
sys.exit()
try:
ioloop.IOLoop.instance().start()
except KeyboardInterrupt:
ioloop.IOLoop.instance().stop()
I am trying to grep the HTTP title of a list of hosts.
The basic idea you have already implemented is an non-blocking HTTP client.
def handle_request(response):
if response.error:
print "Error:", response.error
else:
print response.body
for url in ["http://google.com", "http://twitter.com"]:
http_client = httpclient.AsyncHTTPClient()
http_client.fetch(url, handle_request)
You could loop over your urls and the callback will be called as soon the respone for a specific url becomes availible.
I wouldn't mix up mechanize, ioloop,... if not necessary.
Apart from that, I recommend grequests. It is a lightweight tool which satisfies your requirements.
import grequests
from bs4 import BeautifulSoup
urls = ['http://google.com', 'http://www.python.org/']
rs = (grequests.get(u) for u in urls)
res = grequests.map(rs)
for r in res:
soup = BeautifulSoup(r.text)
print "%s: %s" % (r.url, soup.title.text)
I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?
You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2
from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source
There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.
I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.
code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"
You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False
def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1
Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)