I am trying to scrape a site that attempts to block scraping. Viewing the source code through Chrome, or requests, or requests_html results it not showing the correct source code.
Here is an example:
from requests_html import HTMLSession
session = HTMLSession()
content = session.get('website')
content.html.render()
print(content.html.html)
It gives this page:
It looks like JavaScript is disabled or not supported by your browser.
Even though Javascript is enabled. Same thing happens on an actual browser.
However, on my actual browser, when I go to inspect element, I can see the source code just fine. Is there a way to extract the HTML source from inspect element?
Thanks!
The issue you are facing is that it's a page which is rendered by the Javascript on the front-end. In this case, you would require a javasacript enabled browser-engine and then you can easily read the HTML source.
Here's a working code of how I would do it (using selenium):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome(chrome_options=chrome_options)
# Ensure that the full URL path is given
URL = 'https://proper_url'
# The following step will launch a browser.
driver.get(URL)
# Now you can easily read the source HTML
HTML = driver.page_source
You will have to figure out the details of installing and setting up Selenium and the webdriver. Here's a good place to start.
Related
So I'm trying to learn Selenium for automated testing. I have the Selenium IDE and the WebDrivers for Firefox and Chrome, both are in my PATH, on Windows. I've been able to get basic testing working but this part of the testing is eluding me. I've switched to using Python because the IDE doesn't have enough features, you can't even click the back button.
I'm pretty sure this has been answered elsewhere but none of the recommended links provided an answer that worked for me. I've searched Google and YouTube with no relevant results.
I'm trying to find every link on a page, which I've been able to accomplish, even listing the I would think this would be just a default test. I even got it to PRINT the text of the link but when I try to click the link it doesn't work. I've tried doing waits of various sorts, including
visibility_of_any_elements_located AND time.sleep(5) To wait before trying to click the link.
I've tried this to click the link after waiting self.driver.find_element(By.LINK_TEXT, ("lnktxt")).click(). But none work, not in below code, the below code works, listing the URL Text, the URL and the URL Text again, defined by a variable.
I guess I'm not sure how to get a variable into the By.LINK_TEXT or ...by_link_text statement, assuming that would work. I figured if I got it into the variable I could use it again. That worked for print but not for click()
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
The only post this site recommended that might be helpful was...
How can I test EVERY link on the WEBSITE with Selenium
But it's Java based and I've been trying to learn Python for the past month so I'm not ready to learn Java just to make this work. The IDE does not seem to have an easy option for this, or from all my searches it's not documented well.
Here is my current Selenium code in Python.
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
wait_time_out = 15
class TestPazTestAll2():
def setup_method(self, method):
driver = webdriver.Firefox()
self.driver = webdriver.Firefox()
self.vars = {}
def teardown_method(self, method):
self.driver.quit()
def test_pazTestAll(self):
self.driver.get('https://poetaz.com/poems/')
lnks=self.driver.find_elements_by_tag_name("a")
print ("Total Links", len(lnks))
# traverse list
for lnk in lnks:
# get_attribute() to get all href
print(lnk.get_attribute("text"))
lnktxt = (lnk.get_attribute("text"))
print(lnk.get_attribute("href"))
print(lnktxt)
driver.quit()
Again, I'm sure I missed something in my searches but after hours of searching I'm reaching out.
Any help is appreciated.
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
I don't recommend doing this. Selenium and manipulating the browser is slow and you're not really using the browser for anything where you'd really need a browser.
What I recommend is simply sending requests to those scraped links and asserting response status codes.
import requests
link_elements = self.driver.find_elements_by_tag_name("a")
urls = map(lambda l: l.get_attribute("href"), link_elements)
for url in urls:
response = requests.get(url)
assert response.status_code == 200
(You also might need to prepend some base url to those strings found in href attributes.)
I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal html websites. But when I tried to get some data out of websites where the data loads after some delay, I found that I get an empty value. An example would be
from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "https://www.example.com/;1"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('span', 'buy')
print(a)
I am trying to grab the from here:
(value)
I have already referred a similar topic and tried executing my code on similar lines as the solution provided here. But somehow it doesnt seem to work. I am a novice here so need help getting this work.
How to scrape html table only after data loads using Python Requests?
The table (content) is probably generated by JavaScript and thus can't be "seen". I am using python3.6 / PhantomJS / Selenium as proposed by a lot of answers here.
You have to run headless browser to run delayed scraping. Please use selenium.
Here is sample code. Code is using chrome browser as driver
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome(<chromedriver path here>)
browser.set_window_size(1120, 550)
browser.get(link)
element = WebDriverWait(browser, 3).until(
EC.presence_of_element_located((By.ID, "blabla"))
)
data = element.get_attribute('data-blabla')
print(data)
browser.quit()
You can access desired values by requesting it directly from API and analyze JSON response.
import requests
import json
res = request.get('https://api.example.com/api/')
d = json.loads(res.text)
print(d['market'])
When I view the source HTML after manually navigating to the site via Chrome I can see the full page source but on loading the page source via selenium I'm not getting the complete page source.
from bs4 import BeautifulSoup
from selenium import webdriver
import sys,time
driver = webdriver.Chrome(executable_path=r"C:\Python27\Scripts\chromedriver.exe")
driver.get('http://www.magicbricks.com/')
driver.find_element_by_id("buyTab").click()
time.sleep(5)
driver.find_element_by_id("keyword").send_keys("Navi Mumbai")
time.sleep(5)
driver.find_element_by_id("btnPropertySearch").click()
time.sleep(30)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"lxml")
print soup.prettify()
The website is possibly blocking or restricting the user agent for selenium. An easy test is to change the user agent and see if that does it. More info at this question:
Change user agent for selenium driver
Quoting:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.add_argument("user-agent=whatever you want")
driver = webdriver.Chrome(chrome_options=opts)
Try something like:
import time
time.sleep(5)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
instead of driver.page_source.
Dynamic web pages are often needed to be rendered by JavaScript.
I am having a strange issue with PhantomJS or may be I am newbie. I am trying to login on NewEgg.com via Selenium by using PhantomJS. I am using Python for it. Issue is, when I use Firefox as a driver it works well but as soon as I set PhantomJS as a driver it does not go to next page hence give message:
Exception Message: u'{"errorMessage":"Unable to find element with id \'UserName\'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"89","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:55372","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\\"using\\": \\"id\\", \\"sessionId\\": \\"aaff4c40-6aaa-11e4-9cb1-7b8841e74090\\", \\"value\\": \\"UserName\\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/aaff4c40-6aaa-11e4-9cb1-7b8841e74090/element"}}' ; Screenshot: available via screen
The reason I found after taking screenshot that phantom could not navigate the page and script got finished. How do I sort this out? Code Snippet I tried given below:
import requests
from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
import datetime
my_username = "user#mail.com"
my_password = "password"
driver = webdriver.PhantomJS('/Setups/phantomjs-1.9.7-macosx/bin/phantomjs')
firefox_profile = webdriver.FirefoxProfile()
#firefox_profile.set_preference('permissions.default.stylesheet', 2)
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
#driver = webdriver.Firefox(firefox_profile)
driver.set_window_size(1120, 550)
driver.get('http://newegg.com')
driver.find_element_by_link_text('Log in or Register').click()
driver.save_screenshot('screen.png')
I even put sleep but it is not making any difference.
I experienced this with PhantomJS when the content type of the second page is not correct. A normal browser would just interpret the content dynamically, but Phantom just dies, silently.
I need to download the whole content of html pages images , css , js.
First Option:
Download the page by urllib or requests,
Extract the page info. by beutiful soup or lxml,
Download all links and
Edit the links in original page torelative.
Disadvantages
multiple steps.
The downloaded page will never be identical to remote page. may be due to js or ajax content
Second option
Some authors recommend automating the webbrowser to download the page; so the java scrip and ajax will executed before download.
scraping ajax sites and java script
I want to use this option.
First attempt
So I have copied this piece of selenium code to do 2 steps:
Open the URL in firefox browser
Download the page.
The code
import os
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False )
profile.set_preference('browser.download.dir', os.environ["HOME"])
profile.set_preference("browser.helperApps.alwaysAsk.force", False )
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/html,text/webviewhtml,text/x-server-parsed-html,text/plaintext,application/octet-stream');
browser = webdriver.Firefox(profile)
def open_new_tab(url):
ActionChains(browser).send_keys(Keys.CONTROL, "t").perform()
browser.get(url)
return browser.current_window_handle
# call the function
open_new_tab("https://www.google.com")
# Result: the browser is opened t the given url, no download occur
Result
unfortunately no download occurs, it just opens the browser at the url provided (first step).
Second attempt
I think in downloading the page by separate function; so I have added this function.
The function added
def save_current_page():
ActionChains(browser).send_keys(Keys.CONTROL, "s").perform()
# call the function
open_new_tab("https://www.google.com")
save_current_page()
Result
# No more; the browser is opened at the given url, no download occurs.
Question
How to automate downloading webpages by selenium ??