When I go to the following website: https://www.bvl.com.pe/mercado/movimientos-diarios and use Selenium's page_source option, or urllib.request.urlopen what I get is a different string than if I go to Google Chrome, and open the INSPECT option in the contextual menu and copy the entire thing.
From my research, I understand it has to do with Javascript running on the webpage and what I am getting is the base HTML.
What code can I use (Python) to get the same information?
That behavior entirely browser-dependent. The browser takes the raw HTML, processes it, runs a JS script (usually), styles it with CSS and does many other things. So to get such a result in Python you'd have to make your own web browser.
After much digging around, I came upon a solution that works in most cases. Use Headless Chrome with the --dump-dom switch.
https://developers.google.com/web/updates/2017/04/headless-chrome
Programmatically in Python use the subprocess module to run Chrome in a shell and either assign the output to a variable or direct the output to a text file.
Related
I want to find the time whatever (an object, image, text, link, DB or anything) loads first in a requested website using Python and Selenium.
Checkout performance.timing, it's JavaScript and comes default in your browser. You have a lot of options to display, like:
navigationStart
connectStart
connectEnd
domLoading
domInteractive
domComplete
Just go to your console window in your browser and type performance.timing. Might be of use to you.
If you find something you can use, you can use selenium to execute the JavaScript inside the browser using execute_script:
driver.execute_script(‘return performance.timing.domComplete’)
I'm trying to load one web page and get some elements from it. So the first thing I do is to check the page using "inspect element". When I search for the tags I'm looking for, I can see them (in Chrome).
But when I try to do driver.get(url) and then driver.find_element_by_..., it doesn't find those elements because they aren't in the source code.
I think that it is probably because it doesn't load the whole page but only a part.
Here is an example:
I'm trying to find ads on the web page.
PREPARED_TABOOLA_BLOCK = """//div[contains(#id,'taboola') and not(ancestor::div[contains(#id,'taboola')])]"""
driver = webdriver.PhantomJS(service_args=["--load-images=false"])
# driver = webdriver.Chrome()
driver.maximize_window()
def find_taboola_blocks_selenium(url):
driver.get(url)
taboola_blocks = driver.find_elements_by_xpath(PREPARED_TABOOLA_BLOCK)
return taboola_blocks
print len(find_taboola_blocks_selenium('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html'))
driver.get('http://www.breastfeeding-problems.com/breastfeeding-a-sick-baby.html')
print len(driver.page_source)
OUTPUTS:
Using PhantomJS:
0
85103
Using ChromeDriver:
3
420869
Do you know how to make PhantomJS to load as much Html as possible or any other way to solve this?
Can you compare the request that ChromeDriver is making versus the request you are making in PhantomJS? Since you are only doing GET for the specified url, you may not be including other request parameters that are needed to get the advertisements.
The open() method may give you a better representation of what you are looking for here: http://phantomjs.org/api/webpage/method/open.html
The reason for this is because PhantomJS, by default, renders in a really small window, which makes it load the mobile version of the site. And with the PhantomJSDriver, calling maximizeWindow() (or maximize_window() in python) does absolutely nothing, since there is no rendered window to maximize. You will have to explicitly set the window's render size with:
edit: Below is the Java solution. I'm not entirely sure what the Python solution would be when setting the window size, but it should be similar.
driver.manage().window().setSize(new Dimension(1920, 1200));
edit again: Found the python version:
driver.set_window_size(1920, 1200)
Hope that helps!
PhantomJS 1.x is a really old browser. It only uses SSLv3 (now disabled on most sites) by default and doesn't implement most cutting edge functionality.
Advertisement scripts are usually delivered over HTTPS (SSLv3/TLS) and usually use some obscure feature of JavaScript which is not well tested or simply not implemented in PhantomJS.
If you use PhantomJS < v1.9.8 then you should use those commandline options (service_args): --ignore-ssl-errors=true --ssl-protocol=any.
If iframes or strange cross-domain requests are necessary for the page/ads to work, then add --web-security=false to the service_args.
If this still doesn't solve the problem, then try upgrading to PhantomJS 2.0.0. You might need to compile it yourself on Linux.
I want to write a Chrome extension which gets the page source and I have found some references (1, 2) on how to do it. However, the end code that would be using this source is in Python. Is there any way I could write Chrome extension and call its methods in Python?
Note:
I have tried using Selenium to get browser's source. However, I'm stuck when the page doesn't stop loading. There is a bug in selenium which prevents it from doing anything if the page doesn't stop loading. The browser doesn't return back to Selenium so I'm trying alternate methods.
I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander
Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.
Are there any tricks that will make this task easy?
Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.
If you run the following query in a chrome console, you'll see it returns everything you want.
document.getElementsByClassName('inline-text-org');
Returns
[<div class="inline-text-org" title="University of Manchester">University of Manchester</div>,
<div class="inline-text-org" title="University of California Irvine">University of California ...</div>
etc...
You can run JavaScript through python in a real life DOM using ghost.py.
This is really cool:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
"document.getElementsByClassName('inline-text-org');")
A very similar question was asked earlier here.
Quoted is selenium, originally a testing environment for web-apps.
I usually use Chrome's Developer Mode, which IMHO already gives even more details than Firefox.
For scraping dynamic content, you need not a simple scraper but a full-fledged headless browser.
dhamaniasad/HeadlessBrowsers: A list of (almost) all headless web browsers in existence is the fullest list of these that I've seen; it lists which languages each has bindings for.
(Note that more than a few of the listed projects are abandoned!)
I'm ready to write a program to analyze some statics in webpages,
and I found many python library for HTML DOM parsing such as html5lib, Beautiful Soup...
However, I found it hard to access browser object like window object by python.
Is it possible to use python to fetch browser object just like javascript?
e.g window.location
Any ideas?
Thanks
It's impossible. The window object is available only for JavaScript code running inside a browser, but you are parsing a HTML file in a Python script. Maybe you should ask a more specific question explaining in more detail what are you actually trying to achieve.
Maybe you can try PyQt's QWebkit, it can access and evaluate JavaScript.