I am trying to access and parse a website at work using Python. The sites authorization is done via siteminder, so the usual urllib/urllib2 user password does not work.
Does anyone have an idea how to do that?
Thanks
NoamM
Just did this - I know its an oldie - but if anyone else looking to do this - use the requests library. I had done this in C# before and used mammoth amounts of code - but this is all it takes to login to my corporate siteminder system - nice. The request.session() object will persist redirects, headers and cookies - so all you need to worry about is posting the login form. I'm sure the variables will be different in your environment, but the process will be the same.
output.text will be the body of the target page you wanted to parse which you can then xpath or whatever.
import requests
r = requests.session()
postUrl = "https://loginUrl"
params = { 'USER': 'user',
'PASSWORD': 'pass',
'SMENC': 'ISO-8859-1',
'SMLOCALE': 'US-EN',
'target': '/redir.shtml?GOTO=redirecturl}',
'smauthreason': '0' }
r.post(postUrl, data=params)
getUrl = "http://urlFromBehindLogInYouWantDataFrom"
output = r.get(getUrl)
print(output.text)
First of all, you should find out what's happening when you authenticate through siteminder. Perhaps there's documentation for it, but if not it's not so hard to find out: the Network tab in Chrome or Safari's developer tools has all the information you need: HTTP Headers and Cookies for every network request. Firebug can give you that as well.
Once you have a clear idea of what's happening at each step of the authentication process, it's only a matter of replicating the same behavior in your script. urllib2 has support for cookies and headers. If you need something urllib2 doesn't provide, PycURL will probably do.
Agree with Martin - you need to just replicate what the browser does. Siteminder will pass you a token once successfully authenticated. I have to do this as well, will post once I find a good way.
Related
I am coding a web scraper for the website with the following Python code:
import requests
def scrape(url):
req = requests.get(url)
with open('out.html', 'w') as f:
f.write(req.text)
It works a few times but then an error HTML page is returned by the website (when I open my browser, I have a captcha to complete).
Is there a way to avoid this “ban” by for example changing the IP address?
As already mentioned in the comments and from yourself, changing the IP could help. To do this quite easily have a look at vpngate.py:
https://gist.github.com/Lazza/bbc15561b65c16db8ca8
An How to is provided at the link.
You can use a proxy with the requests library. You can find some free proxies at a couple different websites like https://www.sslproxies.org/ and http://free-proxy.cz/en/proxylist/country/US/https/uptime/level3 but not all of them work and they should not be trusted with sensitive information.
example:
proxy = {
"https": 'https://158.177.252.170:3128',
"http": 'https://158.177.252.170:3128'
}
response=requests.get('https://httpbin.org/ip', proxies=proxy)
I recently answered this on another question here, but using the requests-ip-rotator library to rotate IPs through API gateway is usually the most effective way.
It's free for the first million requests per region, and it means you won't have to give your data to unreliable proxy sites.
Late answer, I found this looking for IP-spoofing, but to the OP's question - as some comments point out, you may or may not actually be getting banned. Here's two things to consider:
A soft ban: they don't like bots. Simple solution that's worked for me in the past is to add headers, so they think you're a browser, e.g.,
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
On-page active elements, scripts or popups that act as content gates, not a ban per se - e.g., country/language selector, cookie config, surveys, etc. requiring user input. Not-as-simple solution: use a webdriver like Selenium + chromedriver to render the page including JS and then add "user" clicks to deal with the problems.
I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)
I been trying to get a cookie and post it to a url in later use in the program, but I cant seem to get the cookie parameters to work.
Right now I have
response = requests.get("url")
But how exactly do I retrive cookies from this url and post them to a new url (the same cookies). The tutorial in requests is somewhat vague on the topic and gives examples I cannot test. Hope someone can help with further examples.
This is python 2.7 btw.
You want to use a session:
s = requests.session()
response = s.get('url')
You use the session just like the requests module (it has the same methods), but it'll retain cookies for you and send them along on future requests.
I am a novice programmer attempting to access google insights using python. I can access sites which dont require cookies fine, but i cant seem to properly pass the cookies along. The cookines file was exported from mozilla firefox, is in the Z: drive which is also where im running python from.
Im also pretty sure my code for saving the file could be better done than reading and writing but I dont know how to do that either. Any helpo would be appreciated.
import urllib2
import cookielib
import os
url = "http://www.google.com/insights/search/overviewReport?q=eagles%2Ccsco&geo=US&cmpt=q&content=1&export=2"
cj = cookielib.MozillaCookieJar()
cj.load('cookies6.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
file = opener.open(url)
output = open('test2.csv','wb')
output.write(file.read())
output.close()
I haven't tested your code however:
As far as I can tell there seems to be nothing wrong with your code
I've tried the url you're searching and had no problems downloading the csv without any cookies
In my previous experience with google, you might be looking at the problem the wrong way, it is not that you don't have the right cookies but that google automatically blocks requests from bots. If this is the case you must replace the user agent http header to mimic an actual browser. Beware however that this is against googles terms of service and if you make too many requests per minute google will block all requests from your ip for about 8h.
I try to fetch a Wikipedia article with Python's urllib:
f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")
s = f.read()
f.close()
However instead of the html page I get the following response: Error - Wikimedia Foundation:
Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT
Wikipedia seems to block request which are not from a standard browser.
Anybody know how to work around this?
You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.
Straight from the examples
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.
I have used it myself for two projects, and it works very well.
Rather than trying to trick Wikipedia, you should consider using their High-Level API.
In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:
'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'
Or, if you want the HTML code, use 'action=render' like in:
'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'
You can also define a section to get just part of the content with something like 'section=3'.
You could then access it using the urllib2 module (as sugested in the chosen answer).
However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.
Refer to MediaWiki's FAQ if you need more information.
The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.
In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.
requests is awesome!
Here is how you can get the html content with requests:
import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text
Done!
Try changing the user agent header you are sending in your request to something like:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)
You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.
Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.
If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.
As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.
import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()
This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.