Open url in IE Then Scrape the same using python - python

I am novice in python (c++ developer), I am trying to do some hands-on on web scraping on windows IE.
The problem which I am facing is, when I open a URL using "requests" library the server sends me a login page always. I figured out the problem. Its actually doing it because it presumes you are coming through IE tries to execute on function which uses some information from the SSO ( single signup object ) which is there executing on the background in Windows on the first login to the web server ( consider this as some weird setup.)
On observing this I changed my strategy & started using webbrowser lib.
Now, when I try to do a webbrowser.open("url"), the browser is opening the page properly which is great thing!!!
But, my problems now are :
1) I do not want that the browser page opened should be visible to the user ( some way that the browser is opened in background ). I tried to used this :
ie = webbrowser.BackgroundBrowser(webbrowser.iexplore)
ie.Visible = 0
ie.open('url')
but no success.
It opens the page which is visible to the user.
2) [This is main activity] I want to scrape the page which is opened in the web browser's IE page opened above. how to do?
I tried to dig into this link but did not find any APIs for getting the data.
Kindly help.
PS : I tried to use beautiful soup for scraping on some other web pages using requests. It was successful & I go the data I wanted. But not in this case.

The webbrowser module doesn't allow to do that. The get function you mentioned is to retrieve registered web browsers not to scrap a HTTP GET request.
I don't know what is triggering the behavior you described with IE, have you tried to change your User-Agent with IE ones? You can check this post for more details: Sending "User-agent" using Requests library in Python

Related

Selenium Alternatives for Google Cloud VM instance?

Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Google serves its homepage to urllib2 when a local search is made

When a local search is done on Google, then the user clicks on the 'More ...' link below the map, the user is then brought to a page such as this.
If the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
is copied out and pasted back into a browser, one arrives, as expected, at the same page. Likewise when a browser is opened with WebDriver, directly accessing the URL brings WebDriver to the same page.
When an attempt is made, however, to request the same page with urllib2, Google serves it its home page (google.com), and it means, among other things, that lxml's extraction capabilities cannot be used.
While urllib2 is not the culprit here (perhaps Google does the same with all headless requests), is there any way of getting Google to serve the desired page? A quick tests with the requests library is indicating the same issue.
I think the big hint here is in the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
Do you notice how there is that hash character (#) in there? Everything following the hash component is never actually sent to the server, so the server can't process it. This indicates (in this case) that the page you are seeing in WebDriver and in your browser is a result of client side scripting.
When you load up the page, your browser sends a request for https://www.google.com/ncr and google returns the home page. The homepage contains javascript that analyses the component after the hash and uses it to generate the page that you expect to see. The browser and Webdriver can do this because they process the javascript. If you disable javascript in your browser and go to that link, you'll see that the page isn't generated either.
urllib2 however, does not process javascript. All it sees is the HTML that the website initially sent along with the javascript, but it can't process the javascript that actually generates the page you are expecting.
Google is serving the page you're asking for, but your problem is that urllib2 is not equipped to render it. To fix this, you'll have to use a scraping framework that supports Javascript. Optionally in this particular case, you could simply use the non-javascript version of Google for your scraping.

How to call ajax in a webpage from python script without browser emulation or headless brawser?

I am new to ajax and javascript.
I am crawling a website in which I am able to fetch relevant piece of details with the help of XPath after downloading webpage using Python(urllib2/request/mechanize).
In the webpage these are some information that is visible only after clicking a link. And that link calls XHR to fetch details that I found out using firefox's web developer tools. (Ctrl+Shift+Q or Tools >> Web Developer >> Network) I am showing that link and its javascript attributes that I can see using Firefox web developer tool (Ctrl+Shift+C or Tools >> Web Developer >> Inspector) in attached image under thick black rectanguls.
I am also able to see ajax request url, headers, response and parameters though same firefox web developer tools. same image is visible at http://i.stack.imgur.com/9jhfr.png
I am thinking that I have all the payload for POST request. How can I make http post call this using Python with the help of request/urllib2 etc? so in response I can see details which i show in webpage after clicking that link. Like
requests.get(url, data=<paramter_to_post which i can see in firefox>, headers=<request headers that I can see in firefox>)
I short
How to simulate ajax call using python? or how to fetch infromation which I see after clicking that link?
I can automate this task using Selenium/PhantomJS or other headless browser. but I want to solve this using HTTP Post and Get which is exactly happening in Firefox when I click the link.
Well first of all install firebug (https://getfirebug.com/)
Then go to your page, launch the firebug and go to the Net tab in firebug panel.
Now in this tab you can see all of the get/posts calls your firefox sends to the website.
Now you can click around, refresh the page and see what calls are being made. In your case click the button and you'll see new calls are being made, you will probably find it at html tab.
There you can find a call once you click on it you'll see the request and other details.
make a dictionary of the parameters and attach it to "data=" in your post. you can also make headers by making a dictionary of it and attaching it to the "headers=" in your post.
Take note: a lot of websites use cookies to identify if the calls are made by a legit browser so it might require quite a bit of fiddling with cookies and urls. !
it's hard to give examples if you don't give us the website.

python open web page and get source code

We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)

Categories

Resources