When I log into a page in my browser, I get 3 cookies: tips, ipb_member_id and ip_pass_hash. I need those last two to access some pages I can only see when logged in. When I log in via the browser it works fine, but under mechanize I only get the tips cookie.
Are there any flags I have to set up for this to work, or is there any module I might need? I can't link to the page here. Though I do know Python's Mechanize + cookielib stores the cookies correctly, since I already have a working version for it.
I am working on the same issue (I want to get all cookies loaded on a page).
I think it's impossible with mechanize. One reason is that it doesn't support javascript, so anything a little bit complex (such as a img loaded on a js event, which set a new cookie) will not work.
I am considering other options as webkit :http://stackoverflow.com/questions/4730906/automating-chrome
if you find a good way to gather all the cookies, let me know :)
Related
I've already seen multiple posts on Stackoverflow regarding this. However, some of the answers are outdated (such as using PhantomJS) and others didn't work for me.
I'm using selenium to scrape a few sports websites for their data. However, every time I try to scrape these sites, a few of them block me because they know I'm using chromedriver. I'm not sending very many requests at all, and I'm also using a VPN. I know the issue is with chromedriver because anytime I stop running my code but try opening these sites on chromedriver, I'm still blocked. However, when I open them in my default web browser, I can access them perfectly fine.
So, I wanted to know if anyone has any suggestions of how to avoid getting blocked from these sites when scraping them in selenium. I've already tried changing the '$cdc...' variable within the chromedriver, but that didn't work. I would greatly appreciate any ideas, thanks!
Obviously they can tell you're not using a common browser. Could it have something to do with the User Agent?
Try it out with something like Postman. See what the responses are. Try messing with the user agent and other request fields. Look at the request headers when you access the site with a regular browser (like chrome) and try to spoof those.
Edit: just remembered this and realized the page might be performing some checks in JS and whatnot. It's worth looking into what happens when you block JS on the site with a regular browser.
From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.
Say, I browse to a website (on intranet too) that require a login to access the contents. I will fill in the required fields... e.g. username, password and any captcha, etc. that is required for logging in from the browser itself.
Once I have logged in into the site, there are lots of goodies that can be scraped from several links and tabs on the first page after logged in.
Now, from this point forward (that is after logged in from the browser).. I want to control the page and downloads from urllib2... like going through page by page, download pdf and images on each page, etc.
I understand that we can use everything from urllib2 (or mechanize) directly (that is login to the page and do the whole thing).
But, for some sites.. it is really a pain to go through and find out the login mechanism, required hidden parameters, referrers, captcha, cookies and pop ups.
Please advise. Hope my question makes sense.
In summary, i want the initial login part done using the web browser manually... and then take over the automation for scraping through urllib2.
Did you consider Selenium? It's about browser automation instead of http requests (urllib2), and you can manipulate the browser in between steps.
You want to use the cookielib module.
http://docs.python.org/library/cookielib.html
You can log on using your browser, then export the cookies into a Netscape-style cookie.txt file. Then from python you'll be able to load this and fetch the resource you require. The cookie will be good until the website expires your session (often around 30 days).
import cookielib, urllib2
cj = cookielib.MozillaCookieJar()
cj.load('cookie.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/resource")
There are add-ons for Chrome and Firefox that will export the cookies in this format. For example:
https://chrome.google.com/webstore/detail/lopabhfecdfhgogdbojmaicoicjekelh
https://addons.mozilla.org/en-US/firefox/addon/export-cookies/
I am using mechanize to retrieve data from many web site. When I tried to log into www.douban.com , I found there are a lot of cookies not set when I log in success. Finally, I find they came from google analytics. They were set by javascript. However, mechanize can not handle javascript, so how to get these cookies. Without these cookies I still can not visit www.douban.com.
PhantomJS is a headless webkit-based client supporting all bells and wisthles, JavaScript included. It had Python API (PyPhantomJS) which was unfortunately removed due to lack of maintainer. You may still want to take a look.
Sorry to say that, but unless Your crawler knows how to run Javascript code, You are unable to fetch cookies set by Javascript.
I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.