One of my script runs perfectly on an XP system, but the exact script hangs on a 2003 system. I always use mechanize to send the http request, here's an example:
import socket, mechanize, urllib, urllib2
socket.setdefaulttimeout(60) #### No idea why it's not working
MechBrowser = mechanize.Browser()
Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)', 'Referer': 'http://www.porn-w.org/ucp.php?mode=login'}
Request = urllib2.Request("http://google.com", None, Header)
Response = MechBrowser.open(Request)
I don't think there's anything wrong with my code, but each time when it comes to a certain http POST request to a specific url, it hangs on that 2003 computer (only on that url). What could be the reason of all this and how should I debug?
By the way, the script runs all right until several hours ago. And no setting is changed.
You could use Fiddler or Wire Shark to see what is happening at the HTTP-level.
It is also worth checking out if the machine has been blocked from making requests to the machine you are trying to access. Use a regular browser (with your own HTML form), and the HTTP library used by Mechanize and see if you can manually construct a request. Fiddler can also help you do this.
Related
I am having a problem in accessing a URL via ruby but it is working with python's requests library.
Here is what I am doing, this link https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN I want to access and start session with it and then need to hit link https://www.nseindia.com/api/option-chain-equities?symbol=SBIN' in the same session and this answer really helped me a lot but I need to do this in ruby. I have tried rest-client, net/http, httparty, httpclient, even when I am simply doing this
require 'rest-client'
request = RestClient.get 'https://www.nseindia.com/get-quotes/derivatives?symbol=SBIN'
It goes for infinite time with no response, I tried same thing with headers too but still no response for infinite time.
Any help would be appreciated.
Thanks.
Are you able to confirm that RestClient is working for other urls, such as google.com?
require 'rest-client'
RestClient.get "https://www.google.com"
For what it's worth, I was able to make a successful get request to Google through RestClient, but not with the url you provided. However, I was able to get a response by specifying a User-Agent in the headers:
require 'rest-client'
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27"
=> Hangs...
RestClient.get "https://www.nseindia.com/api/option-chain-equities?symbol=SBIN%27", {"User-Agent": "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"}
=> RestClient::Unauthorized: 401 Unauthorized
I assume there is some authentication required if you want to get any useful data from the api.
I am trying to do an automated task via python through the mechanize module:
Enter the keyword in a web form, submit the form.
Look for a specific element in the response.
This works one-time. Now, I repeat this task for a list of keywords.
And am getting HTTP Error 429 (Too many requests).
I tried the following to workaround this:
Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .
br=mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
br.addheaders = [('Connection', 'keep-alive')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')]
br.addheaders = [('Upgrade-Insecure-Requests','1')]
br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')]
br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`
Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .
Neither of the two methods worked.
You need to limit the rate of your requests to conform to what the server's configuration permits. (Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)
mechanize uses a heavily-patched version of urllib2 (Lib/site-packages/mechanize/_urllib2.py) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector.
So, the simplest method to patch its logic seems to add a handler to your Browser object
with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
that would stall until the request is eligible with e.g. a Token bucket or Leaky bucket algorithm e.g. as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
and finally return None to push the request to the following handlers
Since this is a common need, you should probably publish your implementation as an installable package.
I subclassed a CrawlSpider and want to extract data from website.
However, I always get redirected to the site's mobile version. I tried to change
the USER_AGENT variable in scrapy's settings to Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1, but still get redirected.
Is there another way to signal another client and avoid redirection?
There are two types of redirection supported in Scrapy:
RedirectMiddleware - Handle redirection of requests based on response status
MetaRefreshMiddleware - Handle redirection of requests based on meta-refresh html tag
So, maybe your html page uses second type of redirection?
See also:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#redirectmiddleware-settings
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#metarefreshmiddleware-settings
If I enter this URL in a browser it returns to me the valid XML data that I am interested in scraping.
http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0
However, if I do it from the server-side, it doesn't work as it previously did. Now it just returns this error, which seems to be the default error message
{u'silentError': 0, u'errorDescription': u"Something went wrong. We're working on getting it fixed as soon as we can.", u'errorSummary': u'Oops', u'errorIsWarning': False, u'error': 1357010, u'payload': None}
here is the code in question, I've tried multiple User Agents, to no avail:
import urllib2
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; he; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3'
uaheader = { 'User-Agent' : user_agent }
wallurl='http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=36343869811&filter=2&max_time=0&try_scroll_load=false&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=0'
req = urllib2.Request(wallurl, headers=uaheader)
resp = urllib2.urlopen(req)
pageData=convertTextToUnicode(resp.read())
print pageData #and get that error
What would be the difference between the server calls and my own browser aside from User Agents and IP addresses?
I tried the above url in both chrome and firefox. It works on chrome but fails on firefox. On chrome, I am signed into facebook while on Firefox, I am not.
This could be the reason for this discrepancy. You will need to provide authentication in your urllib2 based script that you have posted.
There is a existing question on authentication with urllib2.
I would like to write a program that changes my user agent string.
How can I do this in Python?
I assume you mean a user-agent string in an HTTP request? This is just an HTTP header that gets sent along with your request.
using Python's urllib2:
import urllib2
url = 'http://foo.com/'
# add a header to define a custon User-Agent
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req).read()
In urllib, it's done like this:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "MyStrangeUserAgent"
urllib._urlopener = AppURLopener()
and then just use urllib.urlopen normally. In urllib2, use req = urllib2.Request(...) with a parameter of headers=somedict to set all the headers you want (including user agent) in the new request object req that you make, and urllib2.urlopen(req).
Other ways of sending HTTP requests have other ways of specifying headers, of course.
Using Python you can use urllib to download webpages and use the version value to change the user-agent.
There is a very good example on http://wolfprojects.altervista.org/changeua.php
Here is an example copied from that page:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]
urllib2 is nice because it's built in, but I tend to use mechanize when I have the choice. It extends a lot of urllib2's functionality (though much of it has been added to python in recent years). Anyhow, if it's what you're using, here's an example from their docs on how you'd change the user-agent string:
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person#example.com")]
Best of luck.
As mentioned in the answers above, the user-agent field in the http request header can be changed using builtin modules in python such as urllib2. At the same time, it is also important to analyze what exactly the web server sees. A recent post on User agent detection gives a sample code and output, which gives a description of what the web server sees when a programmatic request is sent.
If you want to change the user agent string you send when opening web pages, google around for a Firefox plugin. ;) For example, I found this one. Or you could write a proxy server in Python, which changes all your requests independent of the browser.
My point is, changing the string is going to be the easy part; your first question should be, where do I need to change it? If you already know that (at the browser? proxy server? on the router between you and the web servers you're hitting?), we can probably be more helpful. Or, if you're just doing this inside a script, go with any of the urllib answers. ;)
Updated for Python 3.2 (py3k):
import urllib.request
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
url = 'http://www.google.com'
request = urllib.request.Request(url, b'', headers)
response = urllib.request.urlopen(request).read()