I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –
Related
I get some response from external service. Then need to get url from this response and by this url download a file.
When i pass url to urlretrieve from response.text, urlretrieve return an Error.
But when i manually copy the url. Then set variable in python. url = 'https://My_service_site.com/9a57v4db5_2023-02-14.csv.gz'.
urlretrieve works fine and download the file to computer by this link.
response = requests.post(url, json=payload, headers=headers)
#method 1 - get error
url = response.text[17:-2] #get the link like 'https://my_provide_name.com/csv_exports/5704d5.csv.gz'
urlrtv = urllib.request.urlretrieve(url=url, filename='C:\\Users\\UserName\\Downloads\\test4.csv.gz')
>>return error: HTTP Error 404 Not Found
#method 2 - works fine
url2 = 'https://my_provide_name.com/csv_exports/5704d5.csv.gz'
urlrtv=urllib.request.urlretrieve(url=url2, filename='C:\\Users\\UserName\\Downloads\\test4.csv.gz')
>>works fine
When i copy url from method 1 and put in browser. It works fine.
Edit:
To be more precise i have tried to get url not like that response.text[17:-2]. Insted use json.loads to parse url from response. But still got the error
a = json.loads(response.text)
>>{'csv_file_url': 'https://service_name.com/csv_exports/746d6.csv.gz'}
url = a['csv_file_url']
print(url)
>>https://service_name.com/csv_exports/746d6.csv.gz
Solved: Just add time.sleep(3) before downloading file.
url = response.json()['csv_file_url']
time.sleep(3)
urlrtv = urllib.request.urlretrieve(url=url, filename=f'{storage_path}{filename}')
try to download images with python
but only this picture can't download it
i don't know the reason cause when i run it, it just stop just nothing happen
no image , no error code ...
here's the code plz tell me the reason and solution plz..
import urllib.request
num=404
def down(URL):
fullname=str(num)+"jpg"
urllib.request.urlretrieve(URL,fullname)
im="https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg"
down(im)
this code will work for you try to change the url that you use and see result :
import requests
pic_url = "https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg"
cookies = dict(BCPermissionLevel='PERSONAL')
with open('aa.jpg', 'wb') as handle:
response = requests.get(pic_url, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies,stream=True)
if not response.ok:
print (response)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
What #MoetazBrayek says in their comment (but not answer) is correct: the website you're querying is blocking the request.
It's common for sites to block requests based on user-agent or referer: if you try to curl https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg you will get an HTTP error (403 Access Denied):
❯ curl -I https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg
HTTP/2 403
Apparently The Sun wants a browser's user-agent, and specifically the string "mozilla" is enough to get through:
❯ curl -I -A mozilla https://www.thesun.co.uk/wp-content/uploads/2020/09/67d4aff1-ddd0-4036-a111-3c87ddc0387e.jpg
HTTP/2 200
You will have to either switch to the requests package or replace your url string with a proper urllib.request.Request object so you can customise more pieces of the request. And apparently urlretrieve does not support Request objects so you will also have to use urlopen:
req = urllib.request.Request(URL, headers={'User-Agent': 'mozilla'})
res = urllib.request.urlopen(req)
assert res.status == 200
with open(filename, 'wb') as out:
shutil.copyfileobj(res, out)
I have a list of URLs for Digikey product pages. The goal is to open each URL then scrape pricing info and create a BoM.
The challenge I am having is that after opening a few URLs, URLError starts occurring with 403 (Forbidden) - even though I can open these URLs in my (Chrome) browser (on Mac).
What reasons could there be to go from opening each URL to deciding my opening a URL is forbidden within the Python script? Thank you!
Here is the code:
from urllib.request import urlopen, Request, URLError
urls = ['https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RC0805JR-071KL',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=08055C333KAT2A',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=B72660M0251K072',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=HI1206T500R-10',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=LVR005NK-2',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RL1220S-120-F',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT330R',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=IND-LED',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CHV1206-JW-224ELF',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RAC03-3.3SGA',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=202R18W102KV4E',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=GRM32DR72H104KW10L',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CRE1S0505S3C',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SJ-3523-SMT-TR',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=ATM90E26-YU-RCT-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21F104ZBCNNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21A106KQCLRNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=535-9865-1-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=c',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21C180JBANNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=BLM15AG100SN1D',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT51R0',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SI8651BB-B-IS1']
#####################################
for url in urls:
print(url)
try:
with urlopen(url) as response:
html = response.read()
print (html)
print("DONE WITH THIS URL.")
except URLError as e:
print(e.reason)
Thanks to the comments, indeed digikey was assuming my code was a bot. The "workaround" included:
not using scripts in the URL
randomly selecting a different user agent if get a http 403.
Thank you.
I would like to get an html page and read the content. I use requests (python) and my code is very simple:
import requests
url = "http://www.romatoday.it"
r = requests.get(url)
print r.text
when I try to do this procedure I get ever:
Connection aborted.', error(110, 'Connection timed out')
If I open the url in a browser all work well.
If I use requests with other url all is ok
I think is a "http://www.romatoday.it" particularity but I don't understand what is the problem. Can you help me please?
Maybe the problem is that the comma here
>> url = "http://www.romatoday,it"
should be a dot
>> url = "http://www.romatoday.it"
I tried that and it worked for me
Hmm..Have you tried other packages, not 'requests'?
the code blow is same result as your code.
import urllib
url = "http://www.romatoday.it"
r = urllib.urlopen(url)
print r.read()
a picture that I captured after running your code.
I'm using requests module to retrieve content from the website kat.cr
and here is the code I used:
try:
response = requests.get('http://kat.cr')
response.raise_for_status()
except Exception as e:
print(e)
else:
return response.text
At first the code works just fine and I could retrieve the website source code, but then it stopped and I keep receiving this message "404 Client Error: Not Found for url: https://kat.cr"
I tried fixing this issue with user-agent like this:
from fake_useragent import UserAgent
try:
ua = UserAgent()
ua.update()
headers = {'User-Agent': ua.random}
response = requests.get(url, headers=headers)
response.raise_for_status()
except Exception as e:
print(e)
else:
return response.text
But this doesn't seem to work either
Can you please help me fix this problem and thanks.
I think that, as users suggested, you may be ip-blocked.
Try a proxy.
Proxies with Python 'Requests' module