I am facing this issue where when I access the page source of a url from my local machine it works fine but when I run the same piece on code on a heroku machine it shows access denied.
I have tried changing the headers ( like adding Referers or changing the User-Agent) but none of those solutions are working.
LOCAL MACHINE
~/Development/repos/eater-list master python manage.py shell 1 ↵ 12051 21:15:32
>>> from accounts.zomato import *
>>> z = ZomatoAPI()
>>> response = z.page_source(url='https://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi')
>>> response[0:50]
'<!DOCTYPE html>\n<html lang="en" prefix="og: http'
>>> response[0:100]
'<!DOCTYPE html>\n<html lang="en" prefix="og: http://ogp.me/ns#" >\n<head>\n <meta charset="utf-8"
REMOTE MACHINE
~ $ python manage.py shell
Python 3.5.7 (default, Jul 17 2019, 15:27:27)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from accounts.zomato import *
>>> z = ZomatoAPI()
>>> response = z.page_source(url='https://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi')
>>> response
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi" on this server.<P>\nReference #18.56273017.1572225939.46ec5af\n</BODY>\n</HTML>\n'
>>>
ZOMATO API CODE
There is no change in headers or requests version.
class ZomatoAPI:
def __init__(self):
self.user_key = api_key
self.headers = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/77.0.3865.90 Safari/537.36',
'user-key': self.user_key}
def page_source(self, url):
fng = requests.session()
page_source = fng.get(
url, headers=self.headers).content.decode("utf-8")
return page_source
Will appreciate some advice on it.
Check the response HTTP status code. It might be that Heroku's IP is simply banned from Zomato. This is more common than one might believe -- services like Cloudflare will a lot of times put common IP's in a "banned list".
Here is what you should be looking for regarding HTTP status code to give you more context.
Related
I'm trying to connect to a website with the code below. There is no problem on Heroku. But I am getting error in DigitalOcean.
Code:
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
def web():
rq = requests.get("https://myurl.com/gts?search=word", headers = headers)
print(rq)
The error I get in DigitalOcean:
HTTPSConnectionPool(host='myurl.com', port=443): Max retries exceeded with url: /gts?search=word (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa364a89690>, 'Connection to myurl.com timed out. (connect timeout=None)'))
What I've done and failed:
I added code verify=False, disabled firewall from console. And I tried to access the site from the console with the "curl -I myurl.com" command. All failed.
Thank you for your help.
It seems like the issue is not caused by a bug in your code, but rather by firewall (or other rules) on the remote server side (sozluk.gov.tr).
I have tried to run curl -I https://sozluk.gov.tr and also your snippet on my local and also remote (Digital Ocean hosted) VM - both worked fine.
You mentioned that curl command from console (I assume remove VM console) did not work. This indicates, that the issue is on the network - rather than code - level.
I recommend to spin new VM in different region (to get IP from different IP pool) or use some kind of a proxy which is not blocked by remote server. You can check available regions (datacenters) on this link.
Responses
user#do-server:~$ curl -I https://sozluk.gov.tr
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 18 Feb 2023 11:18:37 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 108975
Last-Modified: Fri, 13 Jan 2023 12:38:15 GMT
Connection: keep-alive
Vary: Accept-Encoding
ETag: "63c150b7-1a9af"
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS
Accept-Ranges: bytes
user#do-server:~$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
>>> requests.get("https://sozluk.gov.tr/gts?search=word", headers = headers)
<Response [200]>
so I have some software who uses webscraping, but for some reason it doesn't seem to work. It's bizarre because when I run it in Google Colab, the code works fine and the url's can open and be scraped, but when I run it in my web application (and run it on my console using python3 run.py) it doesn't work.
Here is the code that is returning errors :
b = searchgoogle(query, num)
c = []
print(b)
for i in b:
extractor = extractors.ArticleExtractor()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/50.0.2661.102 Safari/537.36'
}
extractor = extractors.ArticleExtractor()
req = Request(url=i, headers=headers)
d = urlopen(req)
try:
if d.info()['content-type'].startswith('text/html'):
print ('its html')
resp = requests.get(i, headers=headers)
if resp.ok:
doc = extractor.get_content(resp.text)
c.append(comparetexts(text,doc,i))
else:
print(f'Failed to get URL: {resp.status_code}')
else:
print ('its not html')
except KeyError:
print( 'its not html')
print(i)
return c
The code returning errors is the "d = urlopen(req)"
There is code above the section I just put here but it has nothing to do with the errors. Anyways, thanks for your time!
(By the way, I checked my OPEN SSL version on python3 and it says : 'OpenSSL 1.1.1m 14 Dec 2021' so I think it's up to date)
This happens because your web application does not have SSL certification, so you should tell your script to ignore SSL verification when making the request, as specified here:
Python 3 urllib ignore SSL certificate verification
My Python code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
xml_data1 = "hello"
hostxml = "example.com/xmlrpc.php"
check_method = requests.post(hostxml,data='1', verify=False,headers=headers).text
print(check_method)
Output:
Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
Mod_Security can be configured to block POST requests on certain endpoints.
Recon the version of Mod_Security the server is using and find a vulnerability to bypass the security check.
Note that this is very illegal unless you have permission to test the server.
I'm writing a script to find out which full URLs a large number of shortened URLs lead to. I'm using the requests module to follow redirects and get the URL one would end up at if entering the URL in a browser. This works for almost all link shorteners, but fails for URLs form disq.us for reasons I can't figure out (i.e. for disq.us URL's I get the same url I enter, whereas when I enter it in a browser, I get redirected)
Below is a snippet which correctly resolves a bit.ly-shortened link but fails with a disq.us-link. I run it with Python 3.6.4 and version 2.18.4 of the requests module.
SO will not allow me to include shortened URLs in the question, so I'll leave those in a comment.
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
url1 = "SOME BITLY URL"
url2 = "SOME DISQ.US URL"
for url in [url1, url2]:
s = requests.Session()
s.headers['User-Agent'] = user_agent
r = s.get(url, allow_redirects=True, timeout=10)
print(r.url)
Your first URL is a 404 for me. Interestingly, I just tried this with the second url and it worked, but I used a different user agent. Then I tried it with your user agent, and it isn't redirecting.
This suggests that the webserver is doing something strange in response to that user agent string, and that the problem isn't with requests.
>>> import requests
>>> user_agent = 'foo'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'https://www.elsevier.com/connect/could-dissolvable-microneedles-replace-injected-vaccines'
vs.
>>> import requests
>>> user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'THE_DISCUS_URL'
I got curious, so I investigated a little more. The actual content of the response is a noscript tag with the link, and some javascript that does the redirect.
What's probably going on here is that if discus sees a real webbrowser user agent, it tries to redirect via javascript (and probably do a bunch of tracking in the process). On the other hand, if the user agent isn't familiar, the site assumes the visitor is a script, which probably can't do javascript, and just redirects.
I am building a broken link checker using Python 3.4 to help ensure the quality of a large collection of articles that I manage. Initially I was using GET requests to check if a link was viable, however I and trying to be as nice as possible when pinging the URLs I'm checking, so I both ensure that I do not check a URL that is tested as working more than once and I have attempted to do just head requests.
However, I have found a site that causes this to simply stop. It neither throws an error, nor opens:
https://www.icann.org/resources/pages/policy-2012-03-07-en
The link itself is fully functional. So ideally I'd like to find a way to process similar links. This code in Python 3.4 will reproduce the issue:
import urllib
import urllib.request
URL = 'https://www.icann.org/resources/pages/policy-2012-03-07-en'
req=urllib.request.Request(URL, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}, method='HEAD')>>> from http.cookiejar import CookieJar
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
As it does not throw an error, I really do not know how to troubleshoot this further beyond narrowing it down to the link that halted the entire checker. How can I check if this link is valid?
From bs4 import BeautifulSoup,SoupStrainer
import urllib2
import requests
import re
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
def getStatus(url):
a=requests.get(url,verify=False)
report = str(a.status_code)
return report
alllinks=[]
passlinks=[]
faillinks=[]
html_page = urllib2.urlopen("https://link")
soup = BeautifulSoup(html_page,"html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http*")}):
#print link.get('href')
status = getStatus(link.get('href'))
#print ('URL---->',link.get('href'),'Status---->',status)
link='URL---->',link.get('href'),'Status---->',status
alllinks.append(link)
if status == '200':
passlinks.append(link)
else:
faillinks.append(link)
print alllinks
print passlinks
print faillinks