I have a list of URLs for Digikey product pages. The goal is to open each URL then scrape pricing info and create a BoM.
The challenge I am having is that after opening a few URLs, URLError starts occurring with 403 (Forbidden) - even though I can open these URLs in my (Chrome) browser (on Mac).
What reasons could there be to go from opening each URL to deciding my opening a URL is forbidden within the Python script? Thank you!
Here is the code:
from urllib.request import urlopen, Request, URLError
urls = ['https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RC0805JR-071KL',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=08055C333KAT2A',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=B72660M0251K072',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=HI1206T500R-10',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=LVR005NK-2',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RL1220S-120-F',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT330R',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=IND-LED',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CHV1206-JW-224ELF',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RAC03-3.3SGA',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=202R18W102KV4E',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=GRM32DR72H104KW10L',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CRE1S0505S3C',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SJ-3523-SMT-TR',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=ATM90E26-YU-RCT-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21F104ZBCNNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21A106KQCLRNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=535-9865-1-ND',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=c',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=CL21C180JBANNNC',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=BLM15AG100SN1D',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=RMCF0805JT51R0',
'https://www.digikey.com/scripts/DkSearch/dksus.dll?WT.z_header=search_go&lang=en&keywords=SI8651BB-B-IS1']
#####################################
for url in urls:
print(url)
try:
with urlopen(url) as response:
html = response.read()
print (html)
print("DONE WITH THIS URL.")
except URLError as e:
print(e.reason)
Thanks to the comments, indeed digikey was assuming my code was a bot. The "workaround" included:
not using scripts in the URL
randomly selecting a different user agent if get a http 403.
Thank you.
Related
I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –
I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated
use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)
I am using Requests API with Python2.7.
I am trying to download certain webpages through proxy servers. I have a list of available proxy servers. But not all proxy servers work as desired. Some proxies require authentication, others redirect to advertisement pages etc. In order to detect/verify incorrect responses, I have included two checks in my url requests code. It looks similar to this
import requests
proxy = '37.228.111.137:80'
url = 'http://www.google.ca/'
response = requests.get(url, proxies = {'http' : 'http://%s' % proxy})
if response.url != url or response.status_code != 200:
print 'incorrect response'
else:
print 'response correct'
print response.text
There are some proxy servers with which the requests.get call is successful and they pass these two conditions and still contain invalid html source in response.text attribute. However, if I use the same proxy in my FireFox browser and try to open the same webpage, I am displayed an invalid webpage, but my python script says that the response should be valid.
Can someone point to me that what other necessary checks I am missing to weed out incorrect html results?
or
How can I successfully verify if the webpage I intended to receive is correct?
Regards.
What is an "invalid webpage" when displayed by your browser? The server can return a HTTP status code of 200, but the content is an error message. You understand it to be an error message because you can comprehend it, a browser or code can not.
If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis.
I read that when I get this error I should specify better the url. I assume that I should specify between two displayed or accessible options. How can I do that?
In urllib or its tutorial I couldn't find anything. My assumption is true? Can I read somewhere the possible url?
When I open this url in my browser I am redirected to a new url.
The url I try to access: http://www.uniprot.org/uniprot/P08198_CSG_HALHA.fasta
The new url I am redirected: http://www.uniprot.org/uniprot/?query=replaces:P08198&format=fasta
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
# what now?
The status code 300 is returned from the server to tell you, your request is somehow not complete and you shall be more specific.
Testing the url, I tried to search from http://www.uniprot.org/ and entered into search "P08198". This resulted in page http://www.uniprot.org/uniprot/P08198 telling me
Demerged into Q9HM69, B0R8E4 and P0DME1. [ List ]
To me it seems, the query for some protein is not specific enough as this protein code was split to subcategories or subcodes Q9HM69, B0R8E4 and P0DME1.
Conclusion
Status code 300 is signal from server app, that your request is somehow ambiguous. The way, you can make it specific enough is application specific and has nothing to do with Python or HTTP status codes, you have to find more details about good url in the application logic.
So I ran into this issue and wanted to get the actual content returned.
turns out that this is the solution to my problem.
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
response = r.read()
Using Python 2.6.6 on CentOS 6.4
import urllib
#url = 'http://www.google.com.hk' #ok
#url = 'http://clients1.google.com.hk' #ok
#url = 'http://clients1.google.com.hk/complete/search' #ok (blank)
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' #fails
print url
page = urllib.urlopen(url).read()
print page
Using the first 3 URLs, the code works. But with the 4th URL, Python gives the following 302:
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
here.
</BODY></HTML>
The URL in my code is the same as the URL it tells me to use:
My URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Its URL: http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc
Google says URL moved, but the URLs are the same. Any ideas why?
Update: The URLs all work fine in a browser. But in Python command line the 4th URL is giving a 302.
urllib is ignoring the cookies and sending the new request without cookies, so it causes a redirect loop at that URL. To handle this you can use urllib2 (which is more up-to-date) and add a cookie handler:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open('http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc')
print response.read()
It most likely has to do with the headers and perhaps cookies. I did a quick test on the command-line using curl. It also gives me the 302 moved. The Location header it provides is different, as is the one in the document. If I follow the body URL I get a 204 response (weird). If I follow the Location header I end up getting a circular response like you indicate.
Perhaps important is the Set-Cookie header. It may be redirecting until it gets an appropriate cookie set. It may also be scanning the User-Agent and doing something based on that. Those are the big aspects that differentiate a browser from a tool like requests, or urlib. The browser creates sessions, stores cookies, and sends different headers.
I don't know why urllib fails (I get the same response), however requests lib works perfectly:
import requests
url = 'http://clients1.google.com.hk/complete/search?output=toolbar&hl=zh-CN&q=abc' # fails
print (requests.get(url).text)
If you use your favorite web debugger (Fiddler for me) and open up that URL in your browser, you'll see that you also get that initial 302 response. Your browser is just smart enough to redirect you automatically. So your code is returning the correct response. If you want your code to redirect to the new URL automatically, then you have to make your code smart enough to do so.