I am using mechanize to retrieve data from many web site. When I tried to log into www.douban.com , I found there are a lot of cookies not set when I log in success. Finally, I find they came from google analytics. They were set by javascript. However, mechanize can not handle javascript, so how to get these cookies. Without these cookies I still can not visit www.douban.com.
PhantomJS is a headless webkit-based client supporting all bells and wisthles, JavaScript included. It had Python API (PyPhantomJS) which was unfortunately removed due to lack of maintainer. You may still want to take a look.
Sorry to say that, but unless Your crawler knows how to run Javascript code, You are unable to fetch cookies set by Javascript.
Related
I've already seen multiple posts on Stackoverflow regarding this. However, some of the answers are outdated (such as using PhantomJS) and others didn't work for me.
I'm using selenium to scrape a few sports websites for their data. However, every time I try to scrape these sites, a few of them block me because they know I'm using chromedriver. I'm not sending very many requests at all, and I'm also using a VPN. I know the issue is with chromedriver because anytime I stop running my code but try opening these sites on chromedriver, I'm still blocked. However, when I open them in my default web browser, I can access them perfectly fine.
So, I wanted to know if anyone has any suggestions of how to avoid getting blocked from these sites when scraping them in selenium. I've already tried changing the '$cdc...' variable within the chromedriver, but that didn't work. I would greatly appreciate any ideas, thanks!
Obviously they can tell you're not using a common browser. Could it have something to do with the User Agent?
Try it out with something like Postman. See what the responses are. Try messing with the user agent and other request fields. Look at the request headers when you access the site with a regular browser (like chrome) and try to spoof those.
Edit: just remembered this and realized the page might be performing some checks in JS and whatnot. It's worth looking into what happens when you block JS on the site with a regular browser.
Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
When I log into a page in my browser, I get 3 cookies: tips, ipb_member_id and ip_pass_hash. I need those last two to access some pages I can only see when logged in. When I log in via the browser it works fine, but under mechanize I only get the tips cookie.
Are there any flags I have to set up for this to work, or is there any module I might need? I can't link to the page here. Though I do know Python's Mechanize + cookielib stores the cookies correctly, since I already have a working version for it.
I am working on the same issue (I want to get all cookies loaded on a page).
I think it's impossible with mechanize. One reason is that it doesn't support javascript, so anything a little bit complex (such as a img loaded on a js event, which set a new cookie) will not work.
I am considering other options as webkit :http://stackoverflow.com/questions/4730906/automating-chrome
if you find a good way to gather all the cookies, let me know :)
I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem?
If all pages have the same JavaScript then maybe you could parse the HTML to find that piece of code, and from that get the value the cookie would be set to?
That would make your scraping quite vulnerable to changes in the third party website, but that's most often the case while scraping. (Please bear in mind that the third-party website owner may not like that you're getting the content this way.)
I responded to your other question as well: take a look at mechanize. It's probably the most fully featured scraping module I know: if the cookie is sent, then I'm sure you can get to it with this module.
Maybe you can execute the JavaScript code in a JavaScript engine with Python bindings (like python-spidermonkey or pyv8) and then retrieve the cookie. Or, as the javascript code is executed client side anyway, you may be able to convert the cookie-generating code to Python.
You could access the page using a real browser, via PAMIE, win32com or similar, then the JavaScript will be running in its native environment.