python urllib.request - headers that are likely to work

python urllib.request - headers that are likely to work - python

Working on a little script to fetch info from websites. I'm having trouble with HTTP errors.
req = urllib.request.Request(lnk['href'],
headers={'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
page = urllib.request.urlopen(req)
When this triest to fetch, for example, http://www.guru99.com/node-js-tutorial.html I get a long series of errors, ending with 406 Unacceptable:
Traceback (most recent call last):
File "get_links.py", line 45, in <module>
page = urllib.request.urlopen(req)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 471, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 581, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 509, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 589, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
Googling around I have found that I should fix the headers (as I have done above) and lots of tutorials about how to fix the headers. Except - not much actually works.
Is there some set of good headers which are likely to not cause a problem with most sites? Is there some python module someone else has created that already includes commonly-working headers? Is there a good way to retry several times with different headers until you get a good response?
This seems like a problem everybody who does web scraping with Python deals with, and I haven't found a decent solution.

HTTP Error 406 Not acceptable
The HyperText Transfer Protocol (HTTP) 406 Not Acceptable client error
response code indicates that the server cannot produce a response
matching the list of acceptable values defined in the request's
proactive content negotiation headers, and that the server is
unwilling to supply a default representation.
So I can see the issue is with your both User-Agent: Mozilla/5.0 key and value. Here are the links of the bunch of correct User Agents,
devicesatlsas.com
developer.chrome.com
developer.mozilla.org
So change your code to the following,
headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'})
I know the answer is too late but hope this helps someone else.

The following set of headers seems to be working for most tested. If anyone else has suggestions, please offer them. I'm also interested in good solutions for trying different headers if one set doesn't work.
req = urllib.request.Request(lnk['href'],
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
page = urllib.request.urlopen(req)

I tried your code and I get the same error like expected.
I also tried it with the User-Agent my Chrome-Browser provides, this seems to work
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36
.. and also run a test without passing an explicit header which also returned http 200 (success). This will use the default header which is provided by the library, e.g.
python-requests/2.10.0
Hope this helps

Related

urllib.error.HTTPError: HTTP Error 403: Forbidden - multiple headers defined

I am trying to make a python3 script that iterates through a list of mods hosted on a shared website and download the latest one. I have gotten stuck on step one, go to the website and get the mod version list. I am trying to use urllib but it is throwing a 403: Forbidden error.
I thought it might be due to this being some sort of anti-scraping rejection from the server and I read that you could get around it via defining the headers. I ran wireshark while using my browser and was able to identify the headers it was sending out:
Host: ocsp.pki.goog\r\n
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\r\n
Accept: */*\r\n
Accept-Language: en-US,en;q=0.5\r\n
Accept-Encoding: gzip, deflate\r\n
Content-Type: application/ocsp-request\r\n
Content-Length: 83\r\n
Connection: keep-alive\r\n
\r\n
I believe I was able to define the header correctly, but I had to back two entries out as they gave a 400 error:
from urllib.request import Request, urlopen
count = 0
mods = ['mod1', 'mod2', ...] #this has been created to complete the URL and has been tested to work
#iterate through all mods and download latest version
while mods:
url = 'https://Domain/'+mods[count]
#change the header to the browser I was using at the time of writing the script
req = Request(url)
#req.add_header('Host', 'ocsp.pki.goog\\r\\n') #this reports 400 bad request
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\\r\\n')
req.add_header('Accept', '*/*\\r\\n')
req.add_header('Accept-Language', 'en-US,en;q=0.5\\r\\n')
req.add_header('Accept-Encoding', 'gzip, deflate\\r\\n')
req.add_header('Content-Type', 'application/ocsp-request\\r\\n')
#req.add_header('Content-Length', '83\\r\\n') #this reports 400 bad request
req.add_header('Connection', 'keep-alive\\r\\n')
html = urlopen(req).read().decode('utf-8')
This still throws a 403: Forbidden error:
Traceback (most recent call last):
File "SCRIPT.py", line 19, in <module>
html = urlopen(req).read().decode('utf-8')
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm not sure what I'm doing wrong. I assume there is something wrong with the way I've defined my header values, but I am not sure what is wrong with them. Any help would be appreciated.

Why can't scrape some webpages using Python and bs4?

I've got this code with the purpose of getting the HTML code, and scrape it using bs4.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = '' #Here goes de the webpage.
# opening up connection and downloadind the page
uClient = uReq(myUrl)
pageHtml = uClient.read()
uClient.close()
#html parse
pageSoup = soup(pageHtml, "html.parser")
print(pageSoup)
However, it does not work, here are the errors shown by the terminal:
Traceback (most recent call last):
File "main.py", line 7, in <module>
uClient = uReq(myUrl)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

You are missing some headers that the site may require.
I suggests using requests package instead of urllib, as it's more flexible. See a working example below:
import requests
url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"
querystring = {"shape":"((wt_{F`m{e#njvAqoaXjzjFhecJ{ebIfi}L))"}
payload = ""
headers = {
'authority': "www.idealista.com",
'cache-control': "max-age=0",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'sec-fetch-site': "none",
'sec-fetch-mode': "navigate",
'sec-fetch-user': "?1",
'sec-fetch-dest': "document",
'accept-language': "en-US,en;q=0.9"
}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
print(response.text)
From there you can parse the body using bs4:
pageSoup = soup(response.text, "html.parser")
However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent header and IP address.

A HTTP 403 error which you have received means that the web server rejected the request for the page made by the script because it did not have permission/credentials to access it.
I can access the page in your example from here, so most likely what happened is that the web server noticed that you were trying to scrape it and banned your IP address from requesting any more pages. Web servers often do this to prevent scrapers from affecting its performance.
The web site explicitly forbids what you are trying to do in their terms here: https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en
So I would suggest you contact the site owner to request an API to use (this probably won't be free though).

Getting an error when using urllib.request.urlopen() on Python 3.2.1

I am using urllib.request to open a page source with Python 3.2.1, but I am getting an error saying urllib.error.HTTPError: HTTP Error 503: Service Unavailable. Please find the code and error below.
Code
import re
import urllib.request
html = urllib.request.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html").read().decode()
print (html)
Error
Traceback (most recent call last):
File "I:/Private/nabm/python/python_challenge/python_challenge_2.py", line 4, in <module>
html = urllib.request.urlopen("http://www.pythonchallenge.com/pc/def/ocr.html").read().decode()
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 375, in open
response = meth(req, response)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 487, in http_response
'http', request, response, code, msg, hdrs)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 413, in error
return self._call_chain(*args)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 347, in _call_chain
result = func(*args)
File "C:\appl\Python\3.2.1\lib\urllib\request.py", line 495, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
Process finished with exit code 1
Could anyone see what could be causing this error?

HTTP error 503 means that the server wasn't able to respond at that moment, either due to overload or because it refused your connection. In other words, there is nothing you can change in your code to fix it.

I faced the same issue with some URLs and providing a header helped. When I looked more into it I found out that the servers sometimes identify that a bot is trying to access the website and so to prevent it they give a fake a connection error.
from urllib.request import urlopen, Request
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36."}
req = Request("url", headers=header)
response = urlopen(req, timeout=60)

I know it's been a while from the date. But I will post how I dealt with the "HTTP Error 503" in case it might help someone else.
First of all, I did put request.urlretrieve(...) into a try block to catch the error.
In my case it is true that the server I tried to access needs time to handle the requests. (The server I accessed is not Amazon.com or that sort which were said to prevent programs to access their content.)
Withing the try block, in the case the exception occurs, I made the program to wait for 20 seconds using time.sleep(20). This enables my program to complete.

Python traceback error using urllib2

I am really confused, new to Python and I am working on a script that scrapes a website for products on Python27. I am trying to use urllib2 to do this and when I run the script it prints multiple traceback errors. Suggestions?
Script:
import urllib2, zlib, json
url='https://launches.endclothing.com/api/products'
req = urllib2.Request(url)
req.add_header(':host','launches.endclothing.com');req.add_header(':method','GET');req.add_header(':path','/api/products');req.add_header(':scheme','https');req.add_header(':version','HTTP/1.1');req.add_header('accept','application/json, text/plain, */*');req.add_header('accept-encoding','gzip,deflate');req.add_header('accept-language','en-US,en;q=0.8');req.add_header('cache-control','max-age=0');req.add_header('cookie','__/');req.add_header('user-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36');
resp = urllib2.urlopen(req).read()
resp = zlib.decompress(bytes(bytearray(resp)),15+32)
data = json.loads(resp)
for product in data:
for attrib in product.keys():
print str(attrib)+' :: '+ str(product[attrib])
print '\n'
Error(s):
C:\Users\Luke>py C:\Users\Luke\Documents\EndBot2.py
Traceback (most recent call last):
File "C:\Users\Luke\Documents\EndBot2.py", line 5, in <module>
resp = urllib2.urlopen(req).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 391, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 409, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1181, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1148, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:499: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>

You're running into issues with SSL configuration of your request. I'm sorry, but I won't correct your code, because we're in 2016, and there's a wonderful library that you should use instead: requests
So its usage is pretty simple:
>>> user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
>>> result = requests.get('https://launches.endclothing.com/api/products', headers={'user-agent': user_agent})
>>> result
<Response [200]>
>>> result.json()
[{u'name': u'Adidas Consortium x HighSnobiety Ultraboost', u'colour': u'Grey', u'id': 30, u'releaseDate': u'2016-04-09T00:01:00+0100', …
You'll notice that I changed the user-agent in the previous query to have it working, because weirdly enough, the website is refusing API access to requests:
>>> result = requests.get('https://launches.endclothing.com/api/products')
>>> result
<Response [403]>
>>> result.text
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p></div><div class="error-right"><h3>What can I do to resolve this?</h3><p>If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p><p>If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.
Otherwise, now that you've tried requests and your life has changed, you might still run into this issue again. As you might read from many places on internet, this is related to SNI and outdated libraries and you might get headaches trying to figure this out. My best advice would be for you to upgrade to Python3, as the problem is likely to be solved by installing a new vanilla version of python and the libs involved.
HTH

Downloading HTTPS pages with urllib, error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

Im using newest Kubuntu with Python 2.7.6. I try to download a https page using the below code:
import urllib2
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'pl-PL,pl;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(main_page_url, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
However, I get such an error:
Traceback (most recent call last):
File "test.py", line 33, in <module>
page = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error>
How to solve this?
SOLVED!
I used url https://www.ssllabs.com given by #SteffenUllrich. It turned out that the server uses TLS 1.2, so I updated python to 2.7.10 and modified my code to:
import ssl
import urllib2
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'pl-PL,pl;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(main_page_url, headers=hdr)
try:
page = urllib2.urlopen(req,context=context)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
Now it downloads the page.

Im using newest Kubuntu with Python 2.7.6
The latest Kubuntu (15.10) uses 2.7.10 as far as I know. But assuming you use 2.7.6 which is contained in 14.04 LTS:
Works with facebook for me too, so it's probably the page issue. What now?
Then it depends on the site. Typical problems with this version of Python is missing support for Server Name Indication (SNI) which was only added to Python 2.7.9. Since lots of sites require SNI today (like everything using Cloudflare Free SSL) I guess this is the problem.
But, there are also other possibilities like multiple trust path which is only fixed with OpenSSL 1.0.2. Or simply missing intermediate certificates etc. More information and maybe also workarounds are only possible if you provide the URL or you analyze the situation yourself based on this information and the analysis from SSLLabs.

old version of python 2.7.3
use
requests.get(download_url, headers=headers, timeout=10, stream=True)
get the following Warning and Exception:
You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
SSLError(SSLError(1, '_ssl.c:504: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error')
just follow the advice, visit
Certificate verification in Python 2
run
pip install urllib3[secure]
and problem solved.

The above answer is only partially correct, you can add a fix to solve this issue:
Code:
def allow_unverified_content():
"""
A 'fix' for Python SSL CERTIFICATE_VERIFY_FAILED (mainly python 2.7)
"""
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context
Call it with no options:
allow_unverified_content()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python urllib.request - headers that are likely to work - python

Related

urllib.error.HTTPError: HTTP Error 403: Forbidden - multiple headers defined

Why can't scrape some webpages using Python and bs4?

Getting an error when using urllib.request.urlopen() on Python 3.2.1

Python traceback error using urllib2

Downloading HTTPS pages with urllib, error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error

Categories

Resources