I'm writing a simple Python 3 script to retrieve HTML data. Here's my test script:
import urllib.request
url="http://techxplore.com/news/2015-05-audi-r8-e-tron-aims-high.html"
req = urllib.request.Request(
url,
data=None,
headers={
'User-agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11',
'Referer': 'http://www.google.com'
}
)
f = urllib.request.urlopen(req)
This works fine for most websites but returns the following error for certain ones:
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
The URL shown in the script is one of the sites that returns this error. Based on research from other posts and sites, it seems like manually setting the user-agent and/or the referer should solve the problem, but this script still times out. I'm not sure why this is occurring only for certain websites, and I don't know what else to try. I wold appreciate any suggestions the community could offer.
I tried the script again today without changing anything, and it worked perfectly. Looks like it was just something strange going on with the remote web server.
Related
import requests as requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'cache-control': 'private, max-age=0, no-cache'
}
htmlFile=requests.get('https://www.goodreads.com/shelf/show/1', headers=headers).text
sou=BeautifulSoup(htmlFile,"html.parser")
re = sou.find(class_="leftContainer").findAll(class_="bookTitle")
print(re)
The content is not dynamic,no need for JS or anything like that.
So why sometimes None object return?
while loop who trying again and again can solve the problem but that's not really a solution.
Error
After reproducing the issue, I deducted that the server returns a different
response after 100-200 requests. The difference between a standard, parseable response and the one that does not have the data to scrape is that the different one is given as an occasional error page and returns code [504]. This code essentially means the server timed out or errored, so a default HTML page is returned without the books, and thus an error in the code is brought.
Snippet from Test Case Response from Error:
<h1>
Goodreads request took too long.
</h1>
<p>
The latest request to the Goodreads servers took too long to respond. We have been notified of the issue and are looking into it.
</p>
Solution
This can be due to chance or to an imposed limit to prevent scrapers from using too many system resources, but this is more likely just a normal server-side error and is unavoidable. Simply include a try/except catch in the case of a bad response, and retry the request again!
I am having a problem in accessing a URL via ruby but it is working with python's requests library.
Here is what I am doing, this link https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN I want to access and start session with it and then need to hit link https://www.nseindia.com/api/option-chain-equities?symbol=SBIN' in the same session and this answer really helped me a lot but I need to do this in ruby. I have tried rest-client, net/http, httparty, httpclient, even when I am simply doing this
require 'rest-client'
request = RestClient.get 'https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN'
It goes for infinite time with no response, I tried same thing with headers too but still no response for infinite time.
Any help would be appreciated.
Thanks.
Are you able to confirm that RestClient is working for other urls, such as google.com?
require 'rest-client'
RestClient.get "https://www.google.com"
For what it's worth, I was able to make a successful get request to Google through RestClient, but not with the url you provided. However, I was able to get a response by specifying a User-Agent in the headers:
require 'rest-client'
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27"
=> Hangs...
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27", {"User-Agent": "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"}
=> RestClient::Unauthorized: 401 Unauthorized
I assume there is some authentication required if you want to get any useful data from the api.
I am using Python requests get method to query the MediaWiki API, but it takes a lot of time to receive the response. The same requests receive the response very fast through a web browser. I have the same issue requesting google.com. Here are the sample codes that I am trying in Python 3.5 on Windows 10:
response = requests.get("https://www.google.com")
response = requests.get("https://en.wikipedia.org/wiki/Main_Page")
response = requests.get("http://en.wikipedia.org/w/api.php?", params={'action':'query', 'format':'json', 'titles':'Labor_mobility'})
However, I don't face this issue retrieving other websites like:
response = requests.get("http://www.stackoverflow.com")
response = requests.get("https://www.python.org/")
This sounds like there is an issue with the underlying connection to the server, because requests to other URLs work. These come to mind:
The server might only allow specific user-agent strings
Try adding innocuous headers, e.g.: requests.get("https://www.example.com", headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"})
The server rate-limits you
Wait for a few minutes, then try again. If this solves your issue, you could slow down your code by adding time.sleep() to prevent being rate-limited again.
IPv6 does not work, but IPv4 does
Verify by executing curl --ipv6 -v https://www.example.com. Then, compare to curl --ipv4 -v https://www.example.com. If the latter is significantly faster, you might have a problem with your IPv6 connection. Check here for possible solutions.
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :
file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)
... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.
(Disclaimer : this is a free download, as seen on the band's website
Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :
'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'
However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.
I tried this with Python using the following code that replaces urlib with urllib2:
url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"
import urllib2
file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
bin_writer.write(data)
and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.
I did not open the .rar archives to inspect the two files.
This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.
One of my script runs perfectly on an XP system, but the exact script hangs on a 2003 system. I always use mechanize to send the http request, here's an example:
import socket, mechanize, urllib, urllib2
socket.setdefaulttimeout(60) #### No idea why it's not working
MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)', 'Referer': 'http://www.porn-w.org/ucp.php?mode=login'}
Request = urllib2.Request("http://google.com", None, Header)
Response = MechBrowser.open(Request)
I don't think there's anything wrong with my code, but each time when it comes to a certain http POST request to a specific url, it hangs on that 2003 computer (only on that url). What could be the reason of all this and how should I debug?
By the way, the script runs all right until several hours ago. And no setting is changed.
You could use Fiddler or Wire Shark to see what is happening at the HTTP-level.
It is also worth checking out if the machine has been blocked from making requests to the machine you are trying to access. Use a regular browser (with your own HTML form), and the HTTP library used by Mechanize and see if you can manually construct a request. Fiddler can also help you do this.