Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.
Related
I've already seen multiple posts on Stackoverflow regarding this. However, some of the answers are outdated (such as using PhantomJS) and others didn't work for me.
I'm using selenium to scrape a few sports websites for their data. However, every time I try to scrape these sites, a few of them block me because they know I'm using chromedriver. I'm not sending very many requests at all, and I'm also using a VPN. I know the issue is with chromedriver because anytime I stop running my code but try opening these sites on chromedriver, I'm still blocked. However, when I open them in my default web browser, I can access them perfectly fine.
So, I wanted to know if anyone has any suggestions of how to avoid getting blocked from these sites when scraping them in selenium. I've already tried changing the '$cdc...' variable within the chromedriver, but that didn't work. I would greatly appreciate any ideas, thanks!
Obviously they can tell you're not using a common browser. Could it have something to do with the User Agent?
Try it out with something like Postman. See what the responses are. Try messing with the user agent and other request fields. Look at the request headers when you access the site with a regular browser (like chrome) and try to spoof those.
Edit: just remembered this and realized the page might be performing some checks in JS and whatnot. It's worth looking into what happens when you block JS on the site with a regular browser.
I am novice in python (c++ developer), I am trying to do some hands-on on web scraping on windows IE.
The problem which I am facing is, when I open a URL using "requests" library the server sends me a login page always. I figured out the problem. Its actually doing it because it presumes you are coming through IE tries to execute on function which uses some information from the SSO ( single signup object ) which is there executing on the background in Windows on the first login to the web server ( consider this as some weird setup.)
On observing this I changed my strategy & started using webbrowser lib.
Now, when I try to do a webbrowser.open("url"), the browser is opening the page properly which is great thing!!!
But, my problems now are :
1) I do not want that the browser page opened should be visible to the user ( some way that the browser is opened in background ). I tried to used this :
ie = webbrowser.BackgroundBrowser(webbrowser.iexplore)
ie.Visible = 0
ie.open('url')
but no success.
It opens the page which is visible to the user.
2) [This is main activity] I want to scrape the page which is opened in the web browser's IE page opened above. how to do?
I tried to dig into this link but did not find any APIs for getting the data.
Kindly help.
PS : I tried to use beautiful soup for scraping on some other web pages using requests. It was successful & I go the data I wanted. But not in this case.
The webbrowser module doesn't allow to do that. The get function you mentioned is to retrieve registered web browsers not to scrap a HTTP GET request.
I don't know what is triggering the behavior you described with IE, have you tried to change your User-Agent with IE ones? You can check this post for more details: Sending "User-agent" using Requests library in Python
I am currently trying to write a small bot for a banking site that doesn't supply an API. Nevertheless, the security of the login page seems a little more ingenious than what I'd have expected, since even though I don't see any significant difference between Chrome and Python, it doesn't let requests made by Python through (I accounted for things such as headers and cookies)
I've been wondering if there is a tool to record requests in FireFox/Chrome/Any browser and replicate them in Python (or any other language)? Think selenium, but without the overhead of selenium :p
You can use Selenium web drivers to actually use browsers to make the requests for you.
In such cases, I usually checkout the request made by Chrome from my dev tools "Network" tab. Then I right click on the request and copy the request as cURL to run it on command line to see if it works perfectly. If it does, then I can be certain it can be achieved using Python's requests package.
Look into Phantomjs or casperjs. That is a complete browser that can be programmed using JavaScript
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
I am using mechanize to retrieve data from many web site. When I tried to log into www.douban.com , I found there are a lot of cookies not set when I log in success. Finally, I find they came from google analytics. They were set by javascript. However, mechanize can not handle javascript, so how to get these cookies. Without these cookies I still can not visit www.douban.com.
PhantomJS is a headless webkit-based client supporting all bells and wisthles, JavaScript included. It had Python API (PyPhantomJS) which was unfortunately removed due to lack of maintainer. You may still want to take a look.
Sorry to say that, but unless Your crawler knows how to run Javascript code, You are unable to fetch cookies set by Javascript.