How to get html that I see in inspect element? - python

I'm programming a web-scraper app with python. The website I want to scrape data use JS.
How can I get the source that I see in inspect element?

With javascript pycurl will not work, you need Selenium to get the stuff you need.
import selenium
driver = selenium.webdriver.Firefox()
driver.get("your_url")
Make sure you have Firefox (or another browser selenium supports) installed.

Related

Selenium Chrome web driver inconsistently executes JS scripts on webpages

I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!
Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks

How do I input information into a website with python?

I have this python code, which accesses a website using the module webbrowser:
import webbrowser
webbrowser.open('kahoot.it')
How could I input information into a text box on this website?
I suggest you use Selenium for that matter.
Here is an example code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# driver = webdriver.Firefox() # Use this if you prefer Firefox.
driver = webdriver.Chrome()
driver.get('http://www.google.com/')
search_input = driver.find_elements_by_css_selector('input.gLFyf.gsfi')[0]
search_input.send_keys('some search string' + Keys.RETURN)
You can use Selenium better if you know HTML and CSS well. Knowing Javascript/JQuery may help too.
You need the specific webdriver to run it properly:
GeckoDriver (Firefox)
Chrome
There are other webdrivers available, but one of the previous should be enough for you.
On Windows, you should have the executable on the same folder as your code. On Ubuntu, you should copy the webdriver file to /usr/local/bin/
You can use Selenium not only to input information, but also to a lot of other utilities.
I don't think that's doable with the webbrowser module, I suggest you take a look at Selenium
How to use Selenium with Python?
Depending on how complex (interactive, reliant on scripts, ...) your activity is, you can use requests or, as others have suggested, selenium.
Requests allows you to send and get basic data from websites, you would probably use this when automatically submitting an order form, querying an API, checking if a page has ben updated, ...
Selenium gives you programmatic control of a "normal" browser, this seems better for you specific use-case.
The webbrowser module is actually only (more or less) able to open a browser. You can use this if you want to open a link from inside your application.

Selenium and Goodreads' pagination

I'm trying to extract information from Goodreads. The problem is if I go into a url like:
https://www.goodreads.com/shelf/show/programming?page=2
with Selenium chrome webdriver or with BeautifulSoup, it still shows the first page, instead of the second one.
Example with the chrome webdriver:
While on a normal browser, it displays those books instead:
Looks like that happen because you're not logged-in in your selenium session, you will have to login and save the cookies between restarts.
Take a look at this stackoverflow answers to understand how to extract cookies.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Use existing open tab and url in Selenium py

Hi I'm trying to use Selenium python to use a url that is already open on Internet Explorer. I had a look around and not sure if this is possible.
The reason why I wouldn't like to open new brower or tab is because the page changes to different text.
So far my text only opens a new browser
CODE
from selenium import webdriver
driver = webdriver.Ie()
driver.get("https://outlook.live.com/owa/")
This answer helped me with same problem.
By now you can not access previously opened tabs with selenium.
But you can try to recreate your session, passing what is needed using requests library, for example.

Categories

Resources