Long story short all I am trying to do is scrape the contents of a certain page. Unfortunately, the specific info I need on that page is within an iFrame and I have tried several headless browser options, all yielding the same response which is the HTML displaying:
<iframe>Your browser does not support iframe</iframe>
In Python I have tried both Selenium (even tried the --web-security=no & --disable-web-security flags) & PhantomJS (so I know it's not JavaScript related), and in NodeJS I've tried Puppeteer, all of which aren't working...
Is there anything else out there I can try that may work?
Also, no, a direct GET request is useless because the page detects it's not a real user and loads nothing entirely regardless of user-agent etc etc so I really need a browser solution that can preferably be headless
Related
Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.
I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!
Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks
I need to parse a page, keeping HTML and JS the same as in my own browser. Site must think, that I am logged using the same browser, I need to "press" some buttons using JS and find some elements.
When using requests library or selenium.webdriver.Firefox(), site think I am from a new browser. But I think selenium must help.
Requests cannot process JavaScript, nor can it parse HTML and CSS to create a DOM. Requests is just a very nice abstraction around making HTTP requests to any server, but websites/browsers aren't the only things that use HTTP.
What you're looking for is a JavaScript engine along with an HTML and CSS parser so that it can create an actual DOM for the site and allow you to interact with it. Without these things, there'd be no way to tell what the DOM of the page would be, and so you wouldn't be able to click buttons on it and have the resulting JavaScript do what it should.
So what you're looking for is a web browser. There's just no way around it. Anything that does those things, is, by definition, a web browser.
To clarify from one of your comments, just because something has a GUI, that doesn't mean it isn't automatic. In fact, that's exactly what Selenium is for (i.e. automating the interactions with the GUI that is the web page). It's not meant to emulate user behavior exactly 1:1, and it's actually an abstraction around the WebDriver protocol, which is meant for writing automated tests. However, it does allow you to interact with the webpage in a way that approximates how a user would interact with it.
You may not want to see the GUI of the browser, but luckily, Chrome and Firefox have "headless" modes, and Selenium can control headless instances of those browsers. This would have the browser GUI be hidden while Selenium controls it, which sounds like what you're looking for.
I am trying to scrape a dynamic content (javascript) page with Python + Selenium + BS4 and the page blocks my requests at random (the soft might be: F5 AMS).
I managed to bypass this thing by changing the user-agent for each of the browsers I have specified. The thing is, only the Chrome driver can pass over the rejection. Same code, adjusted for PhantomJS or Firefox drivers is blocked constantly, like I am not even changing the user agent.
I must say that I am also multithreading, that meaning, starting 4 browsers at the same time.
Why does this happen? What does Chrome Webdriver have to offer that can pass over the firewall and the rest don't?
I really need to get the results because I want to change to Firefox, therefore, I want to make Firefox pass just as Chrome.
Two words: Browser Fingerprinting. It's a huge topic in it's own right and as Tarun mentioned would take a decent amount of research to nail this issue on its head. But possible I believe.
Using mechanize, how can I wait for some time after page load (some websites have a timer before links appear, like in download pages), and after the links have been loaded, click on a specific link?
Since it's an anchor tag and not a submit button, will browser.submit() work(I got errors while doing that)?
Mechanize does not offer javascript functionality, so you will not see dynamic content (like a timer that turns into a link).
As far as clicking on a link, you have to find the element, and then you can call click_link on it. See the Finding Links section of this site.
If you are looking for something to handle such sites, a good option is PhantomJS. It uses nodejs, but runs on the webkit engine, allowing you parse dynamic content. If you have your heart set on python, using Selenium to programatically drive a real browser may be your best bet.
If it's an anchor tag, then just GET/POST whatever it is.
The timer between links appearing is generally done in javascript - some sites you are attempting to scrape may not be usable without javascript, or requires a token generated in javascript with clientside math.
Depending on the site, you can either extract the wait time in msec/sec and time.sleep() for that long, or you'll have to use something that can execute javascript