I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.
I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.
Is there anyway to use the requests framework with pages that use javascript?
Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.yourjspage.com')
r.html.render() # this call executes the js in the page
As a bonus this wraps BeautifulSoup, I think, so you can do things like
r.html.find('#myElementID').text
which returns the content of the HTML element as you'd expect.
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).
However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!
First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.
Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help
Now, all you need to do is the following code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
If you do not know how to use Selenium, here is a quick overview:
driver.get("https://www.google.com") #Browser goes to google.com
Finding elements:
Use either the ELEMENTS or ELEMENT method. Examples:
driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
driver.find_element(s)_by_class_name(class_name) # Every element with the following class
driver.find_element(s)_by_id(id) # Every element with the following ID
driver.find_element(s)_by_link_text(link_text) # Every with the full link text
driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
driver.find_element(s)_by_name(name) # Every element where name=argument
driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument
Ok! I found an element (or elements list). But what do I do now?
Here are the methods you can do on an element elem:
elem.tag_name # Could return button in a .
elem.get_attribute("id") # Returns the ID of an element.
elem.text # The inner text of an element.
elem.clear() # Clears a text input.
elem.is_displayed() # True for visible elements, False for invisible elements.
elem.is_enabled() # True for an enabled input, False otherwise.
elem.is_selected() # Is this radio button or checkbox element selected?
elem.location # A dictionary representing the X and Y location of an element on the screen.
elem.click() # Click elem.
elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
elem.submit() # Submit the form in which elem takes part.
Special commands:
driver.back() # Click the Back button.
driver.forward() # Click the Forward button.
driver.refresh() # Refresh the page.
driver.quit() # Close the browser including all the tabs.
foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.
In one example it worked through following cookies:
the cookie generated after checking for javascript for this example is "cf_clearance".
so simply create a session.
update cookie and headers as such:
s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
})
s.get(url)
and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.
Some way to do that is to invoke your request by using selenium.
Let's install dependecies by using pip or pip3:
pip install selenium
etc.
If you run script by using python3
use instead:
pip3 install selenium
(...)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'
# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
its a wrapper around pyppeteer or smth? :( i thought its something different
#property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
return self._browser
Related
So I'm trying to learn Selenium for automated testing. I have the Selenium IDE and the WebDrivers for Firefox and Chrome, both are in my PATH, on Windows. I've been able to get basic testing working but this part of the testing is eluding me. I've switched to using Python because the IDE doesn't have enough features, you can't even click the back button.
I'm pretty sure this has been answered elsewhere but none of the recommended links provided an answer that worked for me. I've searched Google and YouTube with no relevant results.
I'm trying to find every link on a page, which I've been able to accomplish, even listing the I would think this would be just a default test. I even got it to PRINT the text of the link but when I try to click the link it doesn't work. I've tried doing waits of various sorts, including
visibility_of_any_elements_located AND time.sleep(5) To wait before trying to click the link.
I've tried this to click the link after waiting self.driver.find_element(By.LINK_TEXT, ("lnktxt")).click(). But none work, not in below code, the below code works, listing the URL Text, the URL and the URL Text again, defined by a variable.
I guess I'm not sure how to get a variable into the By.LINK_TEXT or ...by_link_text statement, assuming that would work. I figured if I got it into the variable I could use it again. That worked for print but not for click()
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
The only post this site recommended that might be helpful was...
How can I test EVERY link on the WEBSITE with Selenium
But it's Java based and I've been trying to learn Python for the past month so I'm not ready to learn Java just to make this work. The IDE does not seem to have an easy option for this, or from all my searches it's not documented well.
Here is my current Selenium code in Python.
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
wait_time_out = 15
class TestPazTestAll2():
def setup_method(self, method):
driver = webdriver.Firefox()
self.driver = webdriver.Firefox()
self.vars = {}
def teardown_method(self, method):
self.driver.quit()
def test_pazTestAll(self):
self.driver.get('https://poetaz.com/poems/')
lnks=self.driver.find_elements_by_tag_name("a")
print ("Total Links", len(lnks))
# traverse list
for lnk in lnks:
# get_attribute() to get all href
print(lnk.get_attribute("text"))
lnktxt = (lnk.get_attribute("text"))
print(lnk.get_attribute("href"))
print(lnktxt)
driver.quit()
Again, I'm sure I missed something in my searches but after hours of searching I'm reaching out.
Any help is appreciated.
I basically want to be able to load a page, list all links, click a link, go back and click the next link, etc.
I don't recommend doing this. Selenium and manipulating the browser is slow and you're not really using the browser for anything where you'd really need a browser.
What I recommend is simply sending requests to those scraped links and asserting response status codes.
import requests
link_elements = self.driver.find_elements_by_tag_name("a")
urls = map(lambda l: l.get_attribute("href"), link_elements)
for url in urls:
response = requests.get(url)
assert response.status_code == 200
(You also might need to prepend some base url to those strings found in href attributes.)
I am trying to scrape a site that attempts to block scraping. Viewing the source code through Chrome, or requests, or requests_html results it not showing the correct source code.
Here is an example:
from requests_html import HTMLSession
session = HTMLSession()
content = session.get('website')
content.html.render()
print(content.html.html)
It gives this page:
It looks like JavaScript is disabled or not supported by your browser.
Even though Javascript is enabled. Same thing happens on an actual browser.
However, on my actual browser, when I go to inspect element, I can see the source code just fine. Is there a way to extract the HTML source from inspect element?
Thanks!
The issue you are facing is that it's a page which is rendered by the Javascript on the front-end. In this case, you would require a javasacript enabled browser-engine and then you can easily read the HTML source.
Here's a working code of how I would do it (using selenium):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome(chrome_options=chrome_options)
# Ensure that the full URL path is given
URL = 'https://proper_url'
# The following step will launch a browser.
driver.get(URL)
# Now you can easily read the source HTML
HTML = driver.page_source
You will have to figure out the details of installing and setting up Selenium and the webdriver. Here's a good place to start.
I want to do a specific sets of operations using Python:
1- Access a webpage
2- Click on a page button
3- Clear cache and cookies and any other site data from the browser memory.
4- Do the above in a loop.
I'm a complete novice when it comes to interacting with the web using Python.
The language itself however I'm intermediate in.
I want some learning material that I can use to understand the basic HTTP framework and be able to interact with a webpage using Python.
Which libraries, tutorials, documentation I can use to learn further?
Selenium sounds like your best bet! It's an open-source web-based automation tool.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://www.yahoo.com')
assert 'Yahoo' in browser.title
elem = browser.find_element_by_name('p') # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
This snippet will remove cache, etc:
from selenium.webdriver.support.ui import WebDriverWait
def get_clear_browsing_button(driver):
"""Find the "CLEAR BROWSING BUTTON" on the Chrome settings page."""
return driver.find_element_by_css_selector('* /deep/ #clearBrowsingDataConfirm')
def clear_cache(driver, timeout=60):
"""Clear the cookies and cache for the ChromeDriver instance."""
# navigate to the settings page
driver.get('chrome://settings/clearBrowserData')
# wait for the button to appear
wait = WebDriverWait(driver, timeout)
wait.until(get_clear_browsing_button)
# click the button to clear the cache
get_clear_browsing_button(driver).click()
# wait for the button to be gone before returning
wait.until_not(get_clear_browsing_button)
I have a url which I need to run in order for a refresh to happen. It will refresh the data cache and display the latest uploaded data in tableau server. The url is like this:
http://servername/views/workbookname/dashboard1?:refresh=yes
When I use the webbrowser library to open the url, the refresh is executed but I get a browser which is open. When I use requests to get the url, it does not refresh and gives me a Response of 200 which I assume is successful.
Anyone knows why it could happen? How can I silently use the webbrowser lib to open the url and close it afterwards or have the requests act as a webbrowser when doing the get function?
import webbrowser
url = 'http://servername/views/workbookname/dashboard1?:refresh=yes'
webbrowser.open(url)
import requests
url = "http://servername/views/workbookname/dashboard1?:refresh=yes"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
html = requests.get(url,headers=headers)
print(html)
requests.get() simply returns the markup received from the server after 'GET' request without any further client-side execution.
Whereas in a browser context, there's a lot more that can be done on client-side javascript. I haven't looked at your page specifically, but there might be certain javascript code doing further processing.
Instead of web browser or requests you can use Selenium. You can read more about it here.
Selenium lets you browser pages like you do use the browser but also gives you the flexibility to automate + control actions on page with python code.
You could perhaps use Selenium Chrome Webdriver to load the page in the background. (Or you can use Firefox driver).
Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.
Try this:
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options
url = "http://servername/views/workbookname/dashboard1?:refresh=yes"
#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")
# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source
Note
When you open your URL, webbrowser module launches your default browser that already has your credentials/cookies cached. Whereas, if your URL needs any authentication or login to access, you'll have to provide these in when getting the page using selenium. Think of each selenium web driver session as an incognito session. Here's an example on how to simulate a login with web driver.
References:
selenium - chromedriver executable needs to be in PATH
The reason why your browser opens up is simply because that is what webbrowser.open() is supposed to do, instead of sending an HTTP Request it opens the browser and puts in the URL. A possible solution would be using selenium instead of webbrowser because when I looked at it I didnt find a headless option for the package you are using yet. So here it is:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless")
with Chrome(options=chrome_options) as browser:
browser.get(url)
In case this solution is not acceptable because you need to use webdriver instead of selenium you would need to find a way to pass options to your browser instance. I didnt find a way with dir() or help() to pass this argument to webbrowser but if I find something I will add it.
I need to scrape google suggestions from search input. Now I use selenium+phantomjs webdriver.
search_input = selenium.find_element_by_xpath(".//input[#id='lst-ib']")
search_input.send_keys('phantomjs har python')
time.sleep(1.5)
from lxml.html import fromstring
etree = fromstring(selenium.page_source)
output = []
for suggestion in etree.xpath(".//ul[#role='listbox']/li//div[#class='sbqs_c']"):
output.append(" ".join([s.strip() for s in suggestion.xpath(".//text()") if s.strip()]))
but I see in firebug XHR request like this. And response - simple text file with data that I need. Then I look at log:
selenium.get_log("har")
I can't see this request. How can I catch it? I need this url for using as template for requests lib to use it with other search words. Or may be possible run js that initiate this request with other(not from input field) arguments, is it possible?
You can solve it with Python+Selenium+PhantomJS only.
Here is the list of things I've done to make it work:
pretend to be a browser with a head by changing the PhantomJS's User-Agent through Desired Capabilities
use Explicit Waits
ask for the direct https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python url
Working solution:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS
desired_capabilities["phantomjs.page.customHeaders.User-Agent"] = "Mozilla/5.0 (Linux; U; Android 2.3.3; en-us; LG-LU3000 Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get("https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python")
wait = WebDriverWait(driver, 10)
# focus the input and trigger the suggestion list to be shown
search_input = wait.until(EC.visibility_of_element_located((By.NAME, "q")))
search_input.send_keys(Keys.ARROW_DOWN)
search_input.click()
# wait for the suggestion box to appear
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "ul[role=listbox]")))
# parse suggestions
print "List of suggestions: "
for suggestion in driver.find_elements_by_css_selector("ul[role=listbox] li[dir]"):
print suggestion.text
Prints:
List of suggestions:
python phantomjs screenshot
python phantomjs ghostdriver
python phantomjs proxy
unable to start phantomjs with ghostdriver