I'm trying to extract data from the link below using selenium via python:
www.oanda.com
But I'm getting an error that, "Unable to Locate an Element". In browser console i tried using this Css selector:
document.querySelector('div.position.short-position.style-scope.position-ratios-app')
This querySelector returns me the data for short percentage of 1st row in the browser console(for this test), but when i used this selector in the python script below it gives me an error that, "Unable to Locate element" or sometimes empty sctring.
Please suggest me solution if there's any.Will be grateful, thanks :)
# All Imports
import time
from selenium import webdriver
#will return driver
def getDriver():
driver = webdriver.Chrome()
time.sleep(3)
return driver
def getshortPercentages(driver):
shortPercentages = []
shortList = driver.find_elements_by_css_selector('div.position.short-position.style-scope.position-ratios-app')
for elem in shortList:
shortPercentages.append(elem.text)
return shortPercentages
def getData(url):
driver = getDriver()
driver.get(url)
time.sleep(5)
# pagesource = driver.page_source
# print("Page Source: ", pagesource)
shortList = getshortPercentages(driver)
print("Returned source from selector: ", shortList)
if __name__ == '__main__':
url = "https://www.oanda.com/forex-trading/analysis/open-position-ratios"
getData(url)
Required data is located inside an iframe, so you need to switch to iframe before handling elements:
driver.switch_to.frame(driver.find_element_by_class_name('position-ratios-iframe'))
Also note that data inside iframe is dynamic, so make sure that you're using Implicit/Explicit wait (using time.sleep(5) IMHO is not the best solution)
Related
I'm starting web scraping and followed tutorials. Yet in this code I get a "nameError: name 'avail' is not defined". I guess it's really easy, but how could I fix this ? (Error is probably in the for loop at line 15 in avail = i.text())
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome('/Users/victorfichtner/Downloads/Chromedriver')
driver.get('https://www.myntra.com/smart-watches/boat/boat-unisex-black-storm-m-
smart-watch/13471916/buy')
a = driver.find_elements_by_xpath("//*[#class='pdp-add-to-bag pdp-button pdp-flex
pdp-center']")
for i in a :
avail = i.text()
driver.quit()
print(avail)
Things to be noted.
find_elements return a list, where as find_element return a single web element.
Xpath is brittle.
Use explicit waits for dynamic loading.
It is .text in Python not .text()
Sample code :
driver = webdriver.Chrome('/Users/victorfichtner/Downloads/Chromedriver')
driver.maximize_window()
driver.implicitly_wait(50)
driver.get('https://www.myntra.com/smart-watches/boat/boat-unisex-black-storm-m- smart-watch/13471916/buy')
a = driver.find_elements_by_xpath("//*[contains(#class,'pdp-add-to-bag pdp-button pdp-flex')]")
avail = ""
for i in a :
avail = i.text
driver.quit()
print(avail)
Output :
ADD TO BAG
I have started selenium using python. I am able to change the message text using find_element_by_id. I want to do the same with find_element_by_xpath which is not successful as the xpath has two instances. want to try this out to learn about xpath.
I want to do web scraping of a page using python in which I need clarity on using Xpath mainly needed for going to next page.
#This code works:
import time
import requests
import requests
from selenium import webdriver
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_element_by_id("user-message")
eleUserMessage.clear()
eleUserMessage.send_keys("Testing Python")
time.sleep(2)
driver.close()
#This works fine. I wish to do the same with xpath.
#I inspect the the Input box in chrome, copy the xpath '//*[#id="user-message"]' which seems to refer to the other box as well.
# I wish to use xpath method to write text in this box as follows which does not work.
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_elements_by_xpath('//*[#id="user-message"]')
eleUserMessage.clear()
eleUserMessage.send_keys("Test Python")
time.sleep(2)
driver.close()
To elaborate on my comment you would use a list like this:
eleUserMessage_list = driver.find_elements_by_xpath('//*[#id="user-message"]')
my_desired_element = eleUserMessage_list[0] # or maybe [1]
my_desired_element.clear()
my_desired_element.send_keys("Test Python")
time.sleep(2)
The only real difference between find_elements_by_xpath and find_element_by_xpath is the first option returns a list that needs to be indexed. Once it's indexed, it works the same as if you had run the second option!
I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from imdb using python. It only displays 25 reviews until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[#id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
I want all the reviews, but now I can only get the first 25.
You have several issues in your script. Hardcoded wait is very inconsistent and certainly the worst option to comply. The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again. Moreover, every title produces a huge line gap in the output which needs to be properly stripped. I've slightly changed your script to reflect the suggestion I've given above.
Try this to get the required output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()
Your code is fine. Great even. But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button. That's why you are getting the same 25 reviews listed all the time.
When you use Selenium to control the web browser, you are clicking the 'Load More' button. This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.
The bottom line is that JavaScript (which is run in the web browser) updates the page. But in your Python program, you only get the HTML once for the page statically using the Requests library.
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews. Then, have BeautifulSoup parse the HTML again. We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.
So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup. Something like this should work:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[#class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
I'm trying to scrape Chinese economic data from an official website, but I keep getting an Element Not Found exception on the last line here. I've scoured stackoverflow and have tried adding implicitly_wait and switching the problem line from xpath to ID, but nothing has worked. Any thoughts?
from selenium import webdriver
FAI = []
FAIinfra = []
FAIestate = []
path_to_chromedriver = '/Users/cargillsk/Downloads/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
browser.implicitly_wait(30)
url = 'http://www.cqdata.gov.cn/easyquery.htm?cn=A0101'
browser.get(url)
browser.find_element_by_id('treeZhiBiao_4').click()
browser.find_element_by_xpath('//*
[#id="mySelect_sj"]/div[2]/div[1]').click()
browser.find_element_by_xpath('//*
[#id="mySelect_sj"]/div[2]/div[2]/div[3]/input').clear()
browser.find_element_by_xpath('//*
[#id="mySelect_sj"]/div[2]/div[2]/div[3]/input').send_keys('last100')
browser.find_element_by_xpath('//*
[#id="mySelect_sj"]/div[2]/div[2]/div[3]/div[1]').click()
FAIinitial = browser.find_element_by_xpath('//*[#id="main-container"]/div[2]/div[2]/div[2]/div/div[2]/table/thead/tr/th[2]/strong').text
for i in range(2,102):
i = str(i)
FAI.append(browser.find_element_by_xpath('//*[#id="table_main"]/tbody/tr[1]/td[%s]' % i).text)
FAIinfra.append(browser.find_element_by_xpath('//*[#id="table_main"]/tbody/tr[4]/td[%s]' % i).text)
FAIestate.append(browser.find_element_by_xpath('//*[#id="table_main"]/tbody/tr[55]/td[%s]' % i).text)
browser.find_element_by_id("treeZhiBiao_3").click()
browser.find_element_by_id("treeZhiBiao_14").click()
So... the implicit wait is not your issue. Looking through the websites code I found that there is no "treeZhiBiao_14", so I'm not sure what your trying to click here. Maybe try using something like this instead so you know what your clicking.
browser.find_element_by_xpath("//*[contains(text(), '工业')]").click()
or
browser.find_element_by_xpath("//*[contains(text(), 'industry')]").click()
I am trying to learn how to read text inside a PDF using the IE driver of Selenium. I am getting a selenium.common.exceptions.NoSuchElementException: Message: Unable to find element with css selector == body
from selenium import webdriver
import time
TO_url = "Y:\Work\Work\PFCToolbox\exampleTO\HT072663_001.pdf"
vpc_url = "http://dspgot03.vcc.ford.com/apps/vpc/vpc.nsf/"
driver = webdriver.Ie()
driver.get(TO_url)
element = driver.find_element_by_css_selector("body")
time.sleep(10)
I also tried using other driver.find_element_by functions but couldn't find one that works
Instead of finding the body element, try sending keys to get the text:
driver.send_keys(Keys.CONTROL, 'a')
driver.send_keys(Keys.CONTROL, 'c')
Then paste it from the clipboard.