I'm using requests module to retrieve content from the website kat.cr
and here is the code I used:
try:
response = requests.get('http://kat.cr')
response.raise_for_status()
except Exception as e:
print(e)
else:
return response.text
At first the code works just fine and I could retrieve the website source code, but then it stopped and I keep receiving this message "404 Client Error: Not Found for url: https://kat.cr"
I tried fixing this issue with user-agent like this:
from fake_useragent import UserAgent
try:
ua = UserAgent()
ua.update()
headers = {'User-Agent': ua.random}
response = requests.get(url, headers=headers)
response.raise_for_status()
except Exception as e:
print(e)
else:
return response.text
But this doesn't seem to work either
Can you please help me fix this problem and thanks.
I think that, as users suggested, you may be ip-blocked.
Try a proxy.
Proxies with Python 'Requests' module
Related
I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –
I am trying to run a REST request in windows 7 but it is not executed from the python code below.
The code works in ubuntu, but doesn't windows 7:
def get_load_names(url='http://<ip>:5000/loads_list'):
response = requests.get(url)
if response.status_code == 200:
jData = json.loads(response.content)
print(jData)
else:
print('error', response)
Also if I paste the url in the browser, I see the request output. So I assume there is something to do with the firewall.
I created rules to open the port 5000 for input and output, but no luck so far.
Unless you have a very specific reason for writing your own error handling, you should use the built-in raise_for_status()
import requests
import json
response = requests.get('http://<ip>:5000/loads_list')
response.raise_for_status()
jData = json.loads(response.text)
print(jData)
This will hopefully raise an informative error message that you can deal with.
I have a list of URLs for Digikey product pages. The goal is to open each URL then scrape pricing info and create a BoM.
The challenge I am having is that after opening a few URLs, URLError starts occurring with 403 (Forbidden) - even though I can open these URLs in my (Chrome) browser (on Mac).
What reasons could there be to go from opening each URL to deciding my opening a URL is forbidden within the Python script? Thank you!
Here is the code:
from urllib.request import urlopen, Request, URLError
urls = ['https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RC0805JR-071KL',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=08055C333KAT2A',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=B72660M0251K072',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=HI1206T500R-10',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=LVR005NK-2',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RL1220S-120-F',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT330R',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=IND-LED',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CHV1206-JW-224ELF',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RAC03-3.3SGA',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=202R18W102KV4E',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=GRM32DR72H104KW10L',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CRE1S0505S3C',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SJ-3523-SMT-TR',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=ATM90E26-YU-RCT-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21F104ZBCNNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21A106KQCLRNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=535-9865-1-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=c',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21C180JBANNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=BLM15AG100SN1D',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT51R0',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SI8651BB-B-IS1']
#####################################
for url in urls:
print(url)
try:
with urlopen(url) as response:
html = response.read()
print (html)
print("DONE WITH THIS URL.")
except URLError as e:
print(e.reason)
Thanks to the comments, indeed digikey was assuming my code was a bot. The "workaround" included:
not using scripts in the URL
randomly selecting a different user agent if get a http 403.
Thank you.
I am trying to get an http response from a website using the requests module. I get status code 410 in my response:
<Response [410]>
From the documentation, it appears that the forwarding url for the web content may not be intentionally available to the clients. Is this indeed the case, or am I missing something? Trying to confirm if the webpage can be scrapped at all:
url='http://www.b2i.us/profiles/investor/ResLibraryView.asp?ResLibraryID=81517&GoTopage=3&Category=1836&BzID=1690&G=666'
try:
response = requests.get(url)
except requests.exceptions.RequestException as e:
print(e)
Some webisites don't respond well to HTTP requests with 'python-requests' as a User Agent String.
You can get a 200 OK response if you set the User-Agent header to 'Mozilla'.
url='http://www.b2i.us/profiles/investor/ResLibraryView.asp?ResLibraryID=81517&GoTopage=3&Category=1836&BzID=1690&G=666'
headers={'User-Agent':'Mozilla/5'}
response = requests.get(url, headers=headers)
print(response)
< Response [200] >
This works for Mac OSX, but I am having issues with the same approach in Windows on a VMWare virtual machine I run automated tasks from. Why would the behavior be different? Is there a separate workaround for Window machines?
I am new to this so please help me. I am using urllib.request to open and reading webpages. Can someone tell me how can my code handle redirects, timeouts, badly formed URLs?
I have sort of found a way for timeouts, I am not sure if it is correct though. Is it? All opinions are welcomed! Here it is:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
except (HTTPError, URLError) as error:
logging.error('Data of %s not retrieved because %s\nURL: %s', name, error, url)
except timeout:
logging.error('socket timed out - URL %s', url)
Please help me as I am new to this. Thanks!
Take a look at the urllib error page.
So for following behaviours:
Redirect: HTTP code 302, so that's a HTTPError with a code. You could also use the HTTPRedirectHandler instead of failing.
Timeouts: You have that correct.
Badly formed URLs: That's a URLError.
Here's the code I would use:
from socket import timeout
import urllib.request
try:
text = urllib.request.urlopen("http://www.google.com", timeout=0.1).read()
except urllib.error.HTTPError as error:
print(error)
except urllib.error.URLError as error:
print(error)
except timeout as error:
print(error)
I can't finding a redirecting URL, so I'm not exactly sure how to check the HTTPError is a redirect.
You might find the requests package is a bit easier to use (it's suggested on the urllib page).
Using requests package I was able to find a better solution. With the only exception you need to handle are:
try:
r = requests.get(url, timeout =5)
except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop
except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one
except requests.exceptions.ConnectionError:
# Connection could not be completed
except requests.exceptions.RequestException as e:
# catastrophic error. bail.
And to get the text of that page, all you need to do is:
r.text