Selenium Chrome web driver inconsistently executes JS scripts on webpages - python

I'm trying to scrape articles on PubChem, such as this one, for instance. PubChem requires browsers to have Javascript enabled, or else it redirects to a page with virtually no content that says "This application requires Javascript. Please turn on Javascript in order to use this application". To go around this, I used the Chrome web driver from the Selenium library to obtain the HTML that PubChem generates with JavaScript.
And it does that about half the time. It also frequently does not render the full html, and redirects to the Javascript warning page. How do I make it so that the script retrieves the JS version of the site consistently?
I've also tried to overcome this issue by using PhantomJS, except PhantomJS somehow does not work on my machine after installation.
from bs4 import BeautifulSoup
from requests import get
from requests_html import HTMLSession
from selenium import webdriver
import html5lib
session = HTMLSession()
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
url = "https://pubchem.ncbi.nlm.nih.gov/compound/"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(innerHTML, "html5lib")
There are no error messages whatsoever. The only issues is that sometimes the web scraper cannot obtain the JS-rendered webpage as expected. Thank you so much!

Answering my own question because why not.
You need to quit your browser by
browser = webdriver.Chrome('/Users/user/Documents/chromedriver')
# stuff
browser.quit()
and do so right after the last operation that involves the browser, as you risk having the browser cache affect your outputs in next iterations of running the script.
Hope that whoever has this issue finds this helpful!
UPDATE EDIT:
So closing the browser does increase the frequency of success, but doesn't make it consistent. Another thing that was helpful in making it work more frequently was running
sudo purge
in the terminal. I'm still not getting consistent results, however. If anyone has an idea of how to do it without using brute force (i.e. opening and closing the WebDriver until it renders the proper page), please let me know! Many thanks

Related

Getting recaptcha result from ebay sold page scraping

I'm trying to get some information from the total sold page of an item in ebay (because the API request for that is not available anymore).
I've tried both bs4 (beautiful soup) and selenium, but getting recaptcha result and not the page's content itself.
Any help with that issue?
Thanks!
Using vanilla Selenium generally leads to CAPTCHA / blocks due to the browser not hiding its identity as a browser automation.
I suggest trying out undetected_chromedriver - a slightly modified version of Selenium that circumvents most automated bot-detection protocols.
From the project's description:
[undetected_chromedriver is an...] optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io
I've used it quite a bit and it almost always works exactly as intended.
Here's some code to get you started:
import undetected_chromedriver.v2 as uc
# ADD CHROME OPTION -> DISABLE POPUP BLOCKING
options = uc.ChromeOptions()
options.add_argument("--disable-popup-blocking")
driver = uc.Chrome(options=options)
driver.get('https://amazon.com/')
You can visit the project's GitHub for more information. Keep in mind that you'll need to use slightly different syntax vis-a-vis regular Selenium when using certain functions.
Also, you don't need to download chromedriver.exe, as the module automatically downloads the latest version.

My python webscraper using selenium doesn't work everytime. What is wrong?

So I was just experimenting with webscrapers and I got a basic piece of code that just opens a webpage using selenium and the chrome drivers. However I have had this issue since I started where it won't work everytime. Sometimes the webpage will load up to the amazon home page. The other times it will look like this:
I found that changing the link by taking away the slash and removing the 's' from 'https' would make it work almost every other time. This is my code:
from selenium import webdriver
url = "http://amazon.com/"
PATH = r"C:\Users\tyler\chromedriver.exe"
browser = webdriver.Chrome(PATH)
browser.get(url)
Please help if you can. I have friends who are doing the same thing and having no problems at all.
Thank you.
EDIT
There are instances of chromedriver.exe left running on working attempts. Not sure if that'd affect anything.
I think the issue is on your connectivity. The mere fact selenium is able to open a web browser means it is working

Selenium, PhantomJS & Puppeteer: Your browser does not support iframe

Long story short all I am trying to do is scrape the contents of a certain page. Unfortunately, the specific info I need on that page is within an iFrame and I have tried several headless browser options, all yielding the same response which is the HTML displaying:
<iframe>Your browser does not support iframe</iframe>
In Python I have tried both Selenium (even tried the --web-security=no & --disable-web-security flags) & PhantomJS (so I know it's not JavaScript related), and in NodeJS I've tried Puppeteer, all of which aren't working...
Is there anything else out there I can try that may work?
Also, no, a direct GET request is useless because the page detects it's not a real user and loads nothing entirely regardless of user-agent etc etc so I really need a browser solution that can preferably be headless

Website always flags it using an outdated browser

I am trying to scrape the site https://anichart.net/ in order to use the information to build a schedule from the information. The problem is that the site is always detecting an outdated browser (shows http://outdatedbrowser.com).
<div class=noscript>We\'re sorry but AniChart requires Javascript.
<br>Please enable Javascript or <a
href=http://outdatedbrowser.com>upgrade to a modern web browser</a>.
</div></noscript><div class="noscript modern-browser" style="display:
none">Sorry, AniChart requires a modern browser.<br>Please <a
href=http://outdatedbrowser.com>upgrade to a newer web browser</a>.</div>
I have tried a regular request and have also tried forcing the user agent, shown below.
import requests
self.url = 'https://anichart.net/Winter-2019'
headers = {'User-agent': 'Chrome/72.0.3626.109'}
self.page = requests.get(self.url, headers=headers)
print(self.page.content)
I understand that the site uses javascript and the Requests module won't reference the javascript generated portion of the site unless I use other tools with it or potentially Selenium. My browsers are up-to-date so this should not be returning an outdated browser result.
This was working just fine a few days ago but it looks like they did just update their site so they may have added something that prevents automated requests on the site.
Edit:
Selenium Code below:
from selenium import webdriver
url = 'https://anichart.net/Winter-2019'
website = webdriver.Chrome()
website.get(url)
print(website.page_source)
html_after_JS = website.execute_script("return document.body.innerHTML")
print(html_after_JS)
The problem is not the browser detection.
requests simply does render JavaScript (as you seem to know already), and most sites nowadays uses front-end Javascript libraries to render content. And some more sites use Javascript detection to prevent bots from scraping the pages...
You instead will need to use a tool like Selenium, which will open a headless, "modern" browser, of your choice, and you can scrape the page from there. But you have not shown that code, so it might make sense to ask about that instead?
Or, better yet, they have an API - https://github.com/AniList/ApiV2-GraphQL-Docs
The AniList & AniChart websites themselves run on the Api, so everything you can do on the sites, you can do via the Api.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Categories

Resources