I am currently trying to write a small bot for a banking site that doesn't supply an API. Nevertheless, the security of the login page seems a little more ingenious than what I'd have expected, since even though I don't see any significant difference between Chrome and Python, it doesn't let requests made by Python through (I accounted for things such as headers and cookies)
I've been wondering if there is a tool to record requests in FireFox/Chrome/Any browser and replicate them in Python (or any other language)? Think selenium, but without the overhead of selenium :p
You can use Selenium web drivers to actually use browsers to make the requests for you.
In such cases, I usually checkout the request made by Chrome from my dev tools "Network" tab. Then I right click on the request and copy the request as cURL to run it on command line to see if it works perfectly. If it does, then I can be certain it can be achieved using Python's requests package.
Look into Phantomjs or casperjs. That is a complete browser that can be programmed using JavaScript
Related
Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.
I am running some crawls in order to test if the results I get deviate. For this effort I created two test suites, the first one I created with the requests and BeautifulSoup library, the other one is based on selenium. I would like to find out, if pages detect both bots in the same way.
But I am still unsure if I am right, by assuming that requests and BeautifulSoup are independent from Selenium.
I hope its not a dump question, but I haven't find any proper answer yet (maybe because of the wrong keywords). However, any help would be appreciated.
Thanks in advance
I checked the requests documentation. I wrote a mail to the developer, without any answer. And of course I checked on google. I found something about scrapy vs selenium but well... are requests and BeautyfulSoup related to scrapy?
The python requests module does not use Selenium, neither does BeautifulSoup. Both will run independent of a web browser. Both are pure python implementations.
Selenium automates browsers, so you'll present to a web service with the user-agent string and other variables that the browser you choose to drive with Selenium would present.
You can specify user-agent string when you use requests, or not, but requests doesn't drive a browser inherently, so you'll be presenting as a different entity from the user-agent perspective, like python-requests/2.18.4.
BeautifulSoup is a parser, and so it presents to a web service through another library (like requests); it doesn't have its own native presentation.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
I am using mechanize to retrieve data from many web site. When I tried to log into www.douban.com , I found there are a lot of cookies not set when I log in success. Finally, I find they came from google analytics. They were set by javascript. However, mechanize can not handle javascript, so how to get these cookies. Without these cookies I still can not visit www.douban.com.
PhantomJS is a headless webkit-based client supporting all bells and wisthles, JavaScript included. It had Python API (PyPhantomJS) which was unfortunately removed due to lack of maintainer. You may still want to take a look.
Sorry to say that, but unless Your crawler knows how to run Javascript code, You are unable to fetch cookies set by Javascript.
I have a basic structure set up for a crawler. Now I released it on some php driven websites and it works like a charm. Though now I want to it to build datasheets from ajax content.
At the moment I am using Mechanize for PYTHON and perl to build my crawler. Though Mechanize module does not execute AJAX. How do i get to the content that is build by asynchronomous ajax?
I know there is something called Selenium, a real browser to automate. But is this my only option?
Your can run a headless browser e.g phantomjs which understands JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.
There is another way but its messy.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey