How can I run site js function with custom arguments? - python

I need to scrape google suggestions from search input. Now I use selenium+phantomjs webdriver.
search_input = selenium.find_element_by_xpath(".//input[#id='lst-ib']")
search_input.send_keys('phantomjs har python')
time.sleep(1.5)
from lxml.html import fromstring
etree = fromstring(selenium.page_source)
output = []
for suggestion in etree.xpath(".//ul[#role='listbox']/li//div[#class='sbqs_c']"):
output.append(" ".join([s.strip() for s in suggestion.xpath(".//text()") if s.strip()]))
but I see in firebug XHR request like this. And response - simple text file with data that I need. Then I look at log:
selenium.get_log("har")
I can't see this request. How can I catch it? I need this url for using as template for requests lib to use it with other search words. Or may be possible run js that initiate this request with other(not from input field) arguments, is it possible?

You can solve it with Python+Selenium+PhantomJS only.
Here is the list of things I've done to make it work:
pretend to be a browser with a head by changing the PhantomJS's User-Agent through Desired Capabilities
use Explicit Waits
ask for the direct https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python url
Working solution:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
desired_capabilities = webdriver.DesiredCapabilities.PHANTOMJS
desired_capabilities["phantomjs.page.customHeaders.User-Agent"] = "Mozilla/5.0 (Linux; U; Android 2.3.3; en-us; LG-LU3000 Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"
driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)
driver.get("https://www.google.com/?gws_rd=ssl#q=phantomjs+har+python")
wait = WebDriverWait(driver, 10)
# focus the input and trigger the suggestion list to be shown
search_input = wait.until(EC.visibility_of_element_located((By.NAME, "q")))
search_input.send_keys(Keys.ARROW_DOWN)
search_input.click()
# wait for the suggestion box to appear
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "ul[role=listbox]")))
# parse suggestions
print "List of suggestions: "
for suggestion in driver.find_elements_by_css_selector("ul[role=listbox] li[dir]"):
print suggestion.text
Prints:
List of suggestions:
python phantomjs screenshot
python phantomjs ghostdriver
python phantomjs proxy
unable to start phantomjs with ghostdriver

Related

how to make driver to wait upto 10 seconds to click hyperlink

In the below URL i need to click a mail icon hyperlink, sometimes it is not working even code is correct, in this case driver needs to wait upto 10 seconds and go to the next level
https://www.sciencedirect.com/science/article/pii/S1001841718305011
tags = driver.find_elements_by_xpath('//a[#class="author size-m workspace-trigger"]//*[local-name()="svg"]')
if tags:
for tag in tags:
tag.click()
how to use explicitly or implicitly wait here-- "tag.click()"
from my understanding, after the element clicked it should wait until author popup appear then extract using details() ?
tags = driver.find_elements_by_css_selector('svg.icon-envelope')
if tags:
for tag in tags:
tag.click()
# wait until author dialog/popup on the right appear
WebDriverWait(driver, 10).until(
lambda d: d.find_element_by_class_name('e-address') # selector for email
)
try:
details()
# close the popup
driver.find_element_by_css_selector('button.close-button').click()
except Exception as ex:
print(ex)
continue
As an aside.. you can extract the author contact e-mails (which are same as for click) from json like string in one of the scripts
from selenium import webdriver
import json
d = webdriver.Chrome()
d.get('https://www.sciencedirect.com/science/article/pii/S1001841718305011#!')
script = d.find_element_by_css_selector('script[data-iso-key]').get_attribute('innerHTML')
script = script.replace(':false',':"false"').replace(':true',':"true"')
data = json.loads(script)
authors = data['authors']['content'][0]['$$']
emails = [author['$$'][3]['$']['href'].replace('mailto:','') for author in authors if len(author['$$']) == 4]
print(emails)
d.quit()
You can also use requests to get all the recommendations info
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
data = requests.get('https://www.sciencedirect.com/sdfe/arp/pii/S1001841718305011/recommendations?creditCardPurchaseAllowed=true&preventTransactionalAccess=false&preventDocumentDelivery=true', headers = headers).json()
print(data)
Sample view:
You have to wait until the element is clickable . You can do it with WebDriverWait function.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('url')
elements = driver.find_elements_by_xpath('xpath')
for element in elements:
try:
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.LINK_TEXT, element.text)))
finally:
element.click()
You can try like below to click on the hyperlinks containing mail icon. When a click is initiated, a pop up box shows up containing additional information. The following script can fetch the email address from there. It's always a great trouble to dig out anything when svg element are there. I've used BeautifulSoup library in order for the usage of .extract() function to kick out svg element so that the script can reach the content.
from bs4 import BeautifulSoup
from contextlib import closing
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
with closing(webdriver.Chrome()) as driver:
driver.get("https://www.sciencedirect.com/science/article/pii/S1001841718305011")
for elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(#name,'baut')]")))[-2:]:
elem.click()
soup = BeautifulSoup(driver.page_source,"lxml")
[item.extract() for item in soup.select("svg")]
email = soup.select_one("a[href^='mailto:']").text
print(email)
Output:
weibingzhang#ecust.edu.cn
junhongqian#ecust.edu.cn
use the builtin time.sleep() function
from time import sleep
tags = driver.find_elements_by_xpath('//a[#class="author size-m workspace-trigger"]//*[local-name()="svg"]')
if tags:
for tag in tags:
sleep(10)
tag.click()

Unable to get a dynamically generated content from a webpage

I have written a script in python using selenium to fetch the business summary (which is within p tag) located at the bottom right corner of a webpage under the header Company profile. The webpage is heavily dynamic, so I thought to use a browser simulator. I have created a css selector, which is able to parse the summary if I copy the html elements directly from that webpage and try on it locally. For some reason, when I tried the same selector within my below script, it doesn't do the trick. It throws timeout exception error instead. How can I fetch it?
This is my try:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
It seem that there is no Business Summary block initially, but it is generated after you scroll page down. Try below solution:
from selenium.webdriver.common.keys import Keys
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
print(item.text)
You have to scroll the page down twice until the element will be present:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END) # scroll page
time.sleep(1) # small pause between
driver.find_element_by_tag_name("body").send_keys(Keys.END) # one more time
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
If you will scroll only one time it won't work properly at some reason(at least for me). I think it depends on window dimensions, on the smaller window you have to scroll more than on a bigger one.
Here is a much simpler approach using requests and working with the JSON data that is already in the page. I would also recommend to always use request if possible. It may take some extra work but the end result is a lot more reliable / cleaner. You could also take my example a lot further and parse the JSON to work directly with it (you need to clean up the text to be valid JSON). In my example I just use split which was just faster to do but it could lead to problems down the road when doing something more complex.
import requests
from lxml import html
url = 'https://in.finance.yahoo.com/quote/AAPL?p=AAPL'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(url, headers=headers)
tree = html.fromstring(r.text)
data= [e.text_content() for e in tree.iter('script') if 'root.App.main = ' in e.text_content()][0]
data = data.split('longBusinessSummary":"')[1]
data = data.split('","city')[0]
print (data)

PhantomJS returning empty web page (python, Selenium)

Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless.
The code looks like this:
import sys
import traceback
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87"
)
try:
# Choose our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#browser = webdriver.PhantomJS()
#browser = webdriver.Firefox()
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
browser.get("https://www.whatever.com")
# For debug, see what we got back
html_source = browser.page_source
with open('out.html', 'w') as f:
f.write(html_source)
# PROCESS THE PAGE (code removed)
except Exception, e:
browser.save_screenshot('screenshot.png')
traceback.print_exc(file=sys.stdout)
finally:
browser.close()
The output is merely:
<html><head></head><body></body></html>
But when I use the Chrome or Firefox options, it works fine. I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. No difference.
What am I missing?
UPDATED: I will try to keep the below snippet updated with until it works. What's below is what I'm currently trying.
import sys
import traceback
import time
import re
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")
try:
# Set up our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
print "getting web page..."
browser.get("https://www.website.com")
# Need to wait for the page to load
timeout = 10
print "waiting %s seconds..." % timeout
wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
print "done waiting. Response:"
# Rest of code snipped. Fails as "wait" above.
I was facing the same problem and no amount of code to make the driver wait was helping.
The problem is the SSL encryption on the https websites, ignoring them will do the trick.
Call the PhantomJS driver as:
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This solved the problem for me.
You need to wait for the page to load. Usually, it is done by using an Explicit Wait to wait for a key element to be present or visible on a page. For instance:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# ...
browser.get("https://www.whatever.com")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))
html_source = browser.page_source
# ...
Here, we'll wait up to 10 seconds for a div element with class="content" to become visible before getting the page source.
Additionally, you may need to ignore SSL errors:
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
Though, I'm pretty sure this is related to the redirecting issues in PhantomJS. There is an open ticket in phantomjs bugtracker:
PhantomJS does not follow some redirects
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This worked for me

Amazon web scraping

I'm trying to scrape Amazon prices with phantomjs and python. I want to parse it with beautiful soup, to get the new and used prices for books, the problem is: when I pass the source of the request I do with phantomjs the prices are just 0,00, the code is this simple test.
I'm new in web scraping but I don't understand if is amazon who have measures to avoid scraping prices or I'm doing it wrong because I was trying with other more simple pages and I can get the data I want.
PD I'm in a country not supported to use amazon API, that's why the scraper is necesary
import re
import urlparse
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'
class AmzonScraper(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
self.driver.set_window_size(1120, 550)
def scrape_prices(self):
self.driver.get(link)
s = BeautifulSoup(self.driver.page_source)
return s
def scrape(self):
source = self.scrape_prices()
print source
self.driver.quit()
if __name__ == '__main__':
scraper = TaleoJobScraper()
scraper.scrape()
First of all, to follow #Nick Bailey's comment, study the Terms of Use and make sure there are no violations on your side.
To solve it, you need to tweak PhantomJS desired capabilities:
caps = webdriver.DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87"
self.driver = webdriver.PhantomJS(desired_capabilities=caps)
self.driver.maximize_window()
And, to make it bullet-proof, you can make a Custom Expected Condition and wait for the price to become non-zero:
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class wait_for_price(object):
def __init__(self, locator):
self.locator = locator
def __call__(self, driver):
try :
element_text = EC._find_element(driver, self.locator).text.strip()
return element_text != "0,00"
except StaleElementReferenceException:
return False
Usage:
def scrape_prices(self):
self.driver.get(link)
WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice")))
s = BeautifulSoup(self.driver.page_source)
return s
Good answer on setting the user agent for phantomjs to that of a normal browser. Since you said that your country is being blocked by amazon, then I would imagine that you also need to set a proxy.
here is an example of how to start phantomJS in python with a firefox useragent and a proxy.
from selenium.webdriver import *
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass' ]
dcap = dict( DesiredCapabilities.PHANTOMJS )
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0"
driver = PhantomJS( desired_capabilities = dcap, service_args=service_args )
where 1.1.1.1 is your proxy ip and port is the proxy port. Also username and password are only necessary if your proxy requires authentication.
Another framework to try is Scrapy it is simpler than selenium, which is used to simulate browser interactions. Scrapy gives you classes for easily parsing data using CSS selectors or XPath, and a pipeline to store that data in whatever format you'd like, like writing it to a MongoDB database for example
Often times you can write a fully build spider and deploy it to the Scrapy cloud in under 10 lines of code
Checkout this YT video on how to use Scrapy for scraping Amazon reviews as a use case

Using python Requests with javascript pages

I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.
I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.
Is there anyway to use the requests framework with pages that use javascript?
Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.yourjspage.com')
r.html.render() # this call executes the js in the page
As a bonus this wraps BeautifulSoup, I think, so you can do things like
r.html.find('#myElementID').text
which returns the content of the HTML element as you'd expect.
You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.
While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).
However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!
First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.
Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help
Now, all you need to do is the following code:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
If you do not know how to use Selenium, here is a quick overview:
driver.get("https://www.google.com") #Browser goes to google.com
Finding elements:
Use either the ELEMENTS or ELEMENT method. Examples:
driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
driver.find_element(s)_by_class_name(class_name) # Every element with the following class
driver.find_element(s)_by_id(id) # Every element with the following ID
driver.find_element(s)_by_link_text(link_text) # Every with the full link text
driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
driver.find_element(s)_by_name(name) # Every element where name=argument
driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument
Ok! I found an element (or elements list). But what do I do now?
Here are the methods you can do on an element elem:
elem.tag_name # Could return button in a .
elem.get_attribute("id") # Returns the ID of an element.
elem.text # The inner text of an element.
elem.clear() # Clears a text input.
elem.is_displayed() # True for visible elements, False for invisible elements.
elem.is_enabled() # True for an enabled input, False otherwise.
elem.is_selected() # Is this radio button or checkbox element selected?
elem.location # A dictionary representing the X and Y location of an element on the screen.
elem.click() # Click elem.
elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
elem.submit() # Submit the form in which elem takes part.
Special commands:
driver.back() # Click the Back button.
driver.forward() # Click the Forward button.
driver.refresh() # Refresh the page.
driver.quit() # Close the browser including all the tabs.
foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.
In one example it worked through following cookies:
the cookie generated after checking for javascript for this example is "cf_clearance".
so simply create a session.
update cookie and headers as such:
s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
})
s.get(url)
and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.
Some way to do that is to invoke your request by using selenium.
Let's install dependecies by using pip or pip3:
pip install selenium
etc.
If you run script by using python3
use instead:
pip3 install selenium
(...)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'
# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
its a wrapper around pyppeteer or smth? :( i thought its something different
#property
async def browser(self):
if not hasattr(self, "_browser"):
self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)
return self._browser

Categories

Resources