python selenium webscraping - cannot obtain data

python selenium webscraping - cannot obtain data - python

I came across a website where I am hoping to scrape some data from. But the site seems to be un-scrapable for my limited Python knowledge. When using driver.find_element_by_xpath, I usually run into timeout exceptions.
Using the code I provided below, I hope to click on the first result and go to a new page. On the new page, I want to scrape the product title, and pack size. But no matter how I try it, I cannot even get Python to click the right thing for me. Let alone scraping the data. Can someone help out?
My desired output is:
Tris(triphenylphosphine)rhodium(I) chloride, 98%
190420010
1 GR 87.60
5 GR 367.50
These are the codes I have so far:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = "http://www.acros.com/"
cas = "14694-95-2" # need to select for the appropriate one
driver = webdriver.Firefox()
driver.get(url)
country = driver.find_element_by_name("ddlLand")
for option in country.find_elements_by_tag_name("option"):
if option.text == "United States":
option.click()
driver.find_element_by_css_selector("input[type = submit]").click()
choice = driver.find_element_by_name("_ctl1:DesktopThreePanes1:ThreePanes:_ctl4:ddlType")
for option in choice.find_elements_by_tag_name("option"):
if option.text == "CAS registry number":
option.click()
inputElement = driver.find_element_by_id("_ctl1_DesktopThreePanes1_ThreePanes__ctl4_tbSearchString")
inputElement.send_keys(cas)
driver.find_element_by_id("_ctl1_DesktopThreePanes1_ThreePanes__ctl4_btnGo").click()

Your code as presented works fine for me, in that it directs the instance of Firefox to http://www.acros.com/DesktopModules/Acros_Search_Results/Acros_Search_Results.aspx?search_type=CAS&SearchString=14694-95-2 which shows the search results.
If you locate the iframe element on that page:
<iframe id="searchAllFrame" allowtransparency="" background-color="transparent" frameborder="0" width="1577" height="3000" scrolling="auto" src="http://newsearch.chemexper.com/misc/hosted/acrosPlugin/center.shtml?query=14694-95-2&searchType=cas&currency=&country=NULL&language=EN&forGroupNames=AcrosOrganics,FisherSci,MaybridgeBB,BioReagents,FisherLCMS&server=www.acros.com"></iframe>
and use driver.switch_to.frame to switch to that frame then I think data you want should be scrapable from there, for example:
driver.switch_to.frame(driver.find_element_by_xpath("//iframe[#id='searchAllFrame']"))
You can then carry on using the driver as usual to find elements within that iframe. (I think switch_to_frame works similarly, but is deprecated.)
(I can't seem to find a decent link to docs for switch_to, this isn't all that helpful.

Related

selenium accessing text input element with changing ID using chrome webdriver

I am having trouble accessing a input element from this specific webpage. http://prod.symx.com/MTECorp/config.asp?cmd=edit&CID=428D77C8A7ED4DA190E6170116F3A71B
if the webpage has timed out just go ahead and click on this clink below
https://www.mtecorp.com/click-find/
and click on the hyperlink "RL_reactors" to take you to the page.
On this page, I am currently trying to access the search bar/ input element of the webpage to type in a part number that the company sells. This is for school projects and collecting data from different companies for pricing and etc. I am using pycharm(python) and selenium to write this script. Currently, this is the snippet of my code at the moment
# web scraping for MTE product cost list
# reading excel files on the drive
import time
from openpyxl import workbook, load_workbook
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
.........................
..more code
..........................
#part that is getting stuck on
if((selection >= 1) and (selection <= 7)):
print("valid selection going to page...")
if(selection == 1):
target=driver.find_element(By.XPATH,"/html/body/main/article/div/div/div/table/tbody/tr[1]/td[1]/a")
driver.execute_script("arguments[0].click();", target)
element = WebDriverWait(driver,100).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".plxsty_pid"))).send_keys("test")
print("passed clickabel element agruement\n")
currently, my code does go to the RL_reactors page as shown below but however when I'm using CSS selector by class name it doesn't recognize the class type I'm trying to get. Now of course many would say why not use XPath and etc. The reason I cant use XPath and etc is that the element id changes for every iteration of the script. So for example the 1st run of the program id name would be "hr8" when for the other script the program name could be "dsfsih". For my observation, the only part of the element that stays constant is the value and the class name. I have tried using XPath, id, ccselector, and such but to no result. Any suggestions
thanks!

Because you are using javascript to click the link on your website, selenium doesn't change the tab (hence it cannot locate the class you are searching for). You can explicitly tell selenium to change the tab window.
url = "https://www.mtecorp.com/click-find/"
driver.get(url)
target=driver.find_element(By.XPATH,"/html/body/main/article/div/div/div/table/tbody/tr[1]/td[1]/a")
driver.execute_script("arguments[0].click();", target)
driver.switch_to.window(driver.window_handles[1])
element = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".plxsty_pid"))).clear()
element = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".plxsty_pid"))).send_keys('test')
Alternatively, instead of clicking the link, you can grab the href and open it in a new instance of selenium by calling driver.get() again.
url = "https://www.mtecorp.com/click-find/"
driver.get(url)
target_link=driver.find_element(By.XPATH,"/html/body/main/article/div/div/div/table/tbody/tr[1]/td[1]/a").get_attribute('href')
driver.get(target_link)
element = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".plxsty_pid"))).clear()
element = WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,".plxsty_pid"))).send_keys("test")

How to use selenium for webscraping google flights?

I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!

I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)

I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846

Selenium drop down option unable to web scrape

So I have to web scrape the info of car year, model and make from https://auto-buy.geico.com/nb#/sale/vehicle/gskmsi/ (if the link doesn't work, kindly go to 'https://geico.com', fill in the zip code as '75002', enter random details in customer info and you will land up in the vehicle info link).
Having browsed through various answers, I have figured out that I can't use mechanize or something similar owing to the browser sending JavaScript requests every time I select an option in the menu. That leaves something like Selenium to help me.
Following is my code:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Ie("IEDriverServer.exe")
WebDriverWait(driver, 10)
driver.get('https://auto-buy.geico.com/nb#/sale/customerinformation/gskmsi')
html = driver.page_source
soup = BeautifulSoup(html)
select = Select(driver.find_element_by_id('vehicleYear'))
print(select)
The output is an empty [] because it's unable to locate the form.
Please let me know how to select the data from the forms of the page.
P.S.: Though I have used IE, any code correction using Mozilla or Chrome is also welcome.

You need to fill out all the info in "Customer" tab using Selenium and then wait for the appearance of this select element:
from selenium.webdriver.support import ui
select_element = ui.Select(ui.WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.ID, "vehicleYear"))))
Then select a needed option:
select_element.select_by_visible_text("2017")
Hope it helps you!

Python Selenium, Scraping dynamic drop down

I am using Selenium and python for web scraping and the page that i am using for testing this link
But the problem is i am not able to handle the dynamic content of the drop down, here is the problem arises
While selecting the state, the city is loaded based on the state, Some Php and js are going in the back end as far as i know.
So, i searched the web and came with a solution to wait for the sometime please use this link as reference.
The following is a part of my code
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
chrome_path = r"E:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.blooddonors.in")
select = Select(driver.find_element_by_xpath('/html/body/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/form/table/tbody/tr[2]/td[1]/select'))
select.select_by_visible_text('Tamil Nadu')
driver.implicitly_wait(60)
drop = Select(driver.find_element_by_xpath('//*[#id="div_city"]/select'))
select.select_by_visible_text('Coimbotore')
I am using a windows sys and i tried using CMD. It doesn't need a wait function it works fine without it.
The error that i am facing is:
raise NoSuchElementException("Could not locate element with visible text: %s" % text)
selenium.common.exceptions.NoSuchElementException: Message: Could not locate element with visible text: Coimbotore
But their it is actually their.
If someone can help me resolve the issue it would be great and i can move on to the next.
Thanks

To select Tamilnadu and then select Coimbotore you can use the following code block :
driver.get("http://www.blooddonors.in")
select = Select(driver.find_element_by_name('select'))
select.select_by_visible_text('Tamil Nadu')
drop = Select(driver.find_element_by_name('city'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#name='city']/option[text()='Coimbotore']"))
city_option.click()

Selenium provides a select class which can be used to grab elements from drop-down menu.
select = Select(driver.find_element_by_id('city'))
select.select_by_value('430') #search by value, coimbotore is 430

Your second drop-down defined as drop while you're still trying to handle first drop-down (select) which doesn't have "Coimbotore" option...
Just replace
drop = Select(driver.find_element_by_xpath('//*[#id="div_city"]/select'))
select.select_by_visible_text('Coimbotore')
with
drop = Select(driver.find_element_by_xpath('//select[#name="city" and count(option) > 1]'))
drop.select_by_visible_text('Coimbotore')

Webscrape Flashscore with Python/Selenium

I started to learn scrape websites with Python and Selenium. I choose selenium because I need to navigate through the website and I also have to login.
I wrote an script that is able to open a firefox window and it opens the website www.flashscore.com. With this script I also be able to login and navigate to the different sports section (main menu) they have.
The code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# open website
driver = webdriver.Firefox()
driver.get("http://www.flashscore.com")
# login
driver.find_element_by_id('signIn').click()
username = driver.find_element_by_id("email")
password = driver.find_element_by_id("passwd")
username.send_keys("*****")
password.send_keys("*****")
driver.find_element_by_name("login").click()
# go to the tennis section
link = driver.find_element_by_link_text('Tennis')
link.click()
#go to the live games tab in the tennis section
# ?????????????????????????????'
Then it went more difficult. I also want to navigate to, for example, the sections "live games" and "finished" tabs in the sports sector. This part wouldn't work. I tried many things but I can't get into one of this tabs. When analyzing the website I see that they use some Iframes. I also find some code to switch to a Iframes window. But the problem is, I can't find the name of the Iframe where the tabs are that I want to click on. Maybe the Iframes are not the problem and do I look to the wrong way. (Maybe the problem is caused by some javascript?)
Can anybody please help me with this?

No, the iframes are not the problem in this case. The "Live games" element is not inside an iframe. Locate it by link text and click:
live_games_link = driver.find_element_by_link_text("LIVE Games")
live_games_link.click()
You may need to wait for this link to be clickable before actually trying to click it:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
wait = WebDriverWait(driver, 10)
live_games_link = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "LIVE Games")))
live_games_link.click()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python selenium webscraping - cannot obtain data - python

Related

selenium accessing text input element with changing ID using chrome webdriver

How to use selenium for webscraping google flights?

Selenium drop down option unable to web scrape

Python Selenium, Scraping dynamic drop down

Webscrape Flashscore with Python/Selenium

Categories

Resources