I am using Python Requests + Cfscrape Module to Bypass the Cloudflare Enabled website but sometimes it does not validate the URL Properly brings 403 Status Header.
Also, I am using Tor Proxy for Find the Blocked URLs
import sys
import requests
import cfscrape
# Create the session and set the proxies.
proxies = {'http': 'socks5://127.0.0.1:9050',
'https': 'socks5://127.0.0.1:9050'}
# Start Session
#s = requests.Session()
s = cfscrape.create_scraper() # https://github.com/Anorov/cloudflare-scrape/issues/103
# Proxy Connection
s.proxies = proxies
# Bypass Cloudflare Enabled website - https://support.cloudflare.com/hc/en-us/articles/203306930-Does-Cloudflare-block-Tor-
scraper = cfscrape.create_scraper(sess=s, delay=10)
try:
#user input
LINK = input('Enter a URL: ')
response = scraper.get(LINK)
except requests.ConnectionError as e:
print("OOPS!! Connection Error - May be the URL is Not Valid or Can't Bypass them")
except requests.Timeout as e:
print("OOPS!! Timeout Error")
except requests.RequestException as e:
print("OOPS!! General Error (Enter a Valid URL) - Add HTTP/HTTPS infront of the URL")
except (KeyboardInterrupt, SystemExit):
print("Ok ok, quitting")
sys.exit(1)
else:
if response.history:
print("URL was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
break
else:
print(response.status_code, response.url + " - Current Live and Active URL")
Related
Trying to bypass Cloudflare's wall, I wanted to get access to cf_clearence...
I tried cfscrape, link to the package [link].
import cfscrape
cookie_value, user_agent = cfscrape.get_cookie_string("https://somesite.com")
request += "Cookie: %s\r\nUser-Agent: %s\r\n" % (cookie_value, user_agent)
print(request)
This should return cf_clearance, and __cfduid, but in our case its returning,
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://somesite.com
Also tried cf_clearance, to make a Cloudflare challenge pass successfully, the code that I tried:
from playwright.sync_api import sync_playwright
from cf_clearance import sync_cf_retry, sync_stealth
import requests
# not use cf_clearance, cf challenge is fail
proxies = {
"all": "socks5://localhost:7890"
}
res = requests.get('https://somesite.com', proxies=proxies)
if '<title>Please Wait... | Cloudflare</title>' in res.text:
print("cf challenge fail")
# get cf_clearance
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, proxy={"server": "socks5://localhost:7890"})
page = browser.new_page()
sync_stealth(page, pure=True)
page.goto('https://somesite.com')
res = sync_cf_retry(page)
if res:
cookies = page.context.cookies()
for cookie in cookies:
if cookie.get('name') == 'cf_clearance':
cf_clearance_value = cookie.get('value')
print(cf_clearance_value)
ua = page.evaluate('() => {return navigator.userAgent}')
print(ua)
else:
print("cf challenge fail")
browser.close()
# use cf_clearance, must be same IP and UA
headers = {"user-agent": ua}
cookies = {"cf_clearance": cf_clearance_value}
res = requests.get('https://somesite.com', proxies=proxies, headers=headers, cookies=cookies)
if '<title>Please Wait... | Cloudflare</title>' not in res.text:
print("cf challenge success")
The above code was from here.
Tried it with and without proxies.
With proxies:
O/P:
Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it
Without proxies:
O/P:
cf challenge fail
.
.
.
NotImplementedError: Encountered recaptcha. Check whether your proxy is an elite proxy.
I am connected to the web via VPN and I would like to connect to news site to grab, well, news. For this a library exists: finNews. And this is the code:
import FinNews as fn
cnbc_feed = fn.CNBC(topics=['finance', 'earnings'])
print(cnbc_feed.get_news())
print(cnbc_feed.possible_topics())
Now because of the VPN the connection wont work and it throws:
<urlopen error [WinError 10061] No connection could be made because
the target machine actively refused it ( client - server )
So I started separately to understand how to make a connection work and it does work (return is "connected"):
import urllib.request
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
url = "http://www.google.com/search?q=test"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
print ("Connected")
except (HTTPError, URLError) as err:
print("No internet connection.")
Now I figured how to access news and how to make a connection via VPN, but I cant bring both together. I want to grab the news via the library through VPN?! I am fairly new to Python so I guess I dont get the logic fully yet.
EDIT: I tried to combine with Feedparser, based on furas hint:
import urllib.request
import feedparser
proxy = "http://user:pw#proxy:port"
proxies = {"http":"http://%s" % proxy}
#url = "http://www.google.com/search?q=test"
#url = "http://www.reddit.com/r/python/.rss"
url = "https://timesofindia.indiatimes.com/rssfeedstopstories.cms"
headers={'User-agent' : 'Mozilla/5.0'}
try:
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPHandler(debuglevel=1))
urllib.request.install_opener(opener)
req = urllib.request.Request(url, None, headers)
html = urllib.request.urlopen(req).read()
#print (html)
#print ("Connected")
feed = feedparser.parse(html)
#print (feed['feed']['link'])
print ("Number of RSS posts :", len(feed.entries))
entry = feed.entries[1]
print ("Post Title :",entry.title)
except (HTTPError, URLError) as err:
print("No internet connection.")
But same error....this is a big nut to crack...
May I ask for your advice? Thank you :)
I can't connect with page. Here is my code and error witch I have:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import urllib
someurl = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET"
req = Request(someurl)
try:
response = urllib.request.urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
Error code: 403
Some websites require a browser-like "User-Agent" header, other requires specific cookies. In this case, I found out by trial and error that both are required. What you need to do is:
Send an initial request with a browser-like user-agent. This will fail with 403, but you will also obtain a valid cookie in the response.
Send a second request with the same user-agent and the cookie that you got before.
In code:
import urllib.request
from urllib.error import URLError
# This handler will store and send cookies for us.
handler = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(handler)
# Browser-like user agent to make the website happy.
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
request = urllib.request.Request(url, headers=headers)
for i in range(2):
try:
response = opener.open(request)
except URLError as exc:
print(exc)
print(response)
# Output:
# HTTP Error 403: Forbidden (expected, first request always fails)
# <http.client.HTTPResponse object at 0x...> (correct 200 response)
Or, if you prefer, using requests:
import requests
session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
for i in range(2):
response = session.get(url, cookies=jar, headers=headers)
print(response)
# Output:
# <Response [403]>
# <Response [200]>
You can use http.client. First, you need to open a connection with the server. And, after, make a GET request. Like this:
import http.client
conn = http.client.HTTPConnection("genecards.org:80")
conn.request("GET", "/cgi-bin/carddisp.pl?gene=MET")
try:
response = conn.getresponse().read().decode("UTF-8")
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("Everything is fine")
Trying to use requests to download a list of urls and catch the exception if it is a bad url. Here's my test code:
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
print "the url is good"
except ConnectionError,e:
print e
print "the url is bad"
The problem is if I pass in url = "http://www.google.com" everything works as it should and as expected since it is a good url.
http://www.google.com
the url is good
But if I pass in url = "http://www.google.com/thereisnothing.jpg"
I still get :
http://www.google.com/thereisnothing.jpg
the url is good
So its almost like its not even looking at anything after the "/"
just to see if the error checking is working at all I passed a bad hostname: #url = "http://somethingpotato.com"
Which kicked back the error message I expected:
http://somethingpotato.com
HTTPConnectionPool(host='somethingpotato.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b6cd15b90>: Failed to establish a new connection: [Errno -2] Name or service not known',))
the url is bad
What am I missing to make request capture a bad url not just a bad hostname?
Thanks
requests do not create a throwable exception at a 404 response. Instead you need to filter them out be checking to see if the status is 'ok' (HTTP response 200)
import requests
from requests.exceptions import ConnectionError
#goodurl
url = "http://www.google.com/nothing"
#badurl with good host
#url = "http://www.google.com/thereisnothing.jpg"
#url with bad host
#url = "http://somethingpotato.com"
print url
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
print "the url is bad"
except ConnectionError,e:
print e
print "the url is bad"
EDIT:
import requests
from requests.exceptions import ConnectionError
def printFailedUrl(url, response):
if isinstance(response, ConnectionError):
print "The url " + url + " failed to connect with the exception " + str(response)
else:
print "The url " + url + " produced the failed response code " + str(response.status_code)
def testUrl(url):
try:
r = requests.get(url, allow_redirects=True)
if r.status_code == requests.codes.ok:
print "the url is good"
else:
printFailedUrl(url, r)
except ConnectionError,e:
printFailedUrl(url, e)
def main():
testUrl("http://www.google.com") #'Good' Url
testUrl("http://www.google.com/doesnotexist.jpg") #'Bad' Url with 404 response
testUrl("http://sdjgb") #'Bad' url with inaccessable url
main()
In this case one function can handle both getting an exception or a request response passed into it. This way you can have separate responses for if the url returns some non 'good' (non-200) response vs an unusable url which throws an exception. Hope this has the information you need in it.
what you want is to check r.status_code. Getting r.status_code on "http://www.google.com/thereisnothing.jpg" will give you 404. you can put a condition for only 200 code URL to be "good".
I wanted to check if a certain website exists, this is what I'm doing:
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!
If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?
You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.
For python 2.7.x, you can use httplib:
import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
print('web site exists')
or urllib2:
import urllib2
try:
urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)
or for 2.7 and 3.x, you can install requests
import requests
response = requests.get('http://www.example.com')
if response.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
It's better to check that status code is < 400, like it was done here. Here is what do status codes mean (taken from wikipedia):
1xx - informational
2xx - success
3xx - redirection
4xx - client error
5xx - server error
If you want to check if page exists and don't want to download the whole page, you should use Head Request:
import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400
taken from this answer.
If you want to download the whole page, just make a normal request and check the status code. Example using requests:
import requests
response = requests.get('http://google.com')
assert response.status_code < 400
See also similar topics:
Python script to see if a web page exists without downloading the whole page?
Checking whether a link is dead or not using Python without downloading the webpage
How do you send a HEAD HTTP request in Python 2?
Making HTTP HEAD request with urllib2 from Python 2
from urllib2 import Request, urlopen, HTTPError, URLError
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print e.code
except URLError, e:
print e.reason
else:
print 'ok'
To answer the comment of unutbu:
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
Source
There is an excellent answer provided by #Adem Öztaş, for use with httplib and urllib2. For requests, if the question is strictly about resource existence, then the answer can be improved upon in the case of large resource existence.
The previous answer for requests suggested something like the following:
def uri_exists_get(uri: str) -> bool:
try:
response = requests.get(uri)
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
requests.get attempts to pull the entire resource at once, so for large media files, the above snippet would attempt to pull the entire media into memory. To solve this, we can stream the response.
def uri_exists_stream(uri: str) -> bool:
try:
with requests.get(uri, stream=True) as response:
try:
response.raise_for_status()
return True
except requests.exceptions.HTTPError:
return False
except requests.exceptions.ConnectionError:
return False
I ran the above snippets with timers attached against two web resources:
1) http://bbb3d.renderfarming.net/download.html, a very light html page
2) http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4, a decently sized video file
Timing results below:
uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239
uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007
uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224
uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007
As a last note: this function also works in the case that the resource host doesn't exist. For example "http://abcdefghblahblah.com/test.mp4" will return False.
I see many answers that use requests.get, but I suggest you this solution using only requests.head which is faster and also better for the webserver since it doesn't need to send back the body too.
import requests
def check_url_exists(url: str):
"""
Checks if a url exists
:param url: url to check
:return: True if the url exists, false otherwise.
"""
return requests.head(url, allow_redirects=True).status_code == 200
The meta-information contained in the HTTP headers in response to a HEAD request should be identical to the information sent in response to a GET request.
code:
a="http://www.example.com"
try:
print urllib.urlopen(a)
except:
print a+" site does not exist"
You can simply use stream method to not download the full file. As in latest Python3 you won't get urllib2. It's best to use proven request method. This simple function will solve your problem.
def uri_exists(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
return True
else:
return False
def isok(mypath):
try:
thepage = urllib.request.urlopen(mypath)
except HTTPError as e:
return 0
except URLError as e:
return 0
else:
return 1
Try this one::
import urllib2
website='https://www.allyourmusic.com'
try:
response = urllib2.urlopen(website)
if response.code==200:
print("site exists!")
else:
print("site doesn't exists!")
except urllib2.HTTPError, e:
print(e.code)
except urllib2.URLError, e:
print(e.args)