How to use selenium for webscraping google flights? - python

I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!

I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)

I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846

Related

Unable to Open .htm link with Selenium

I cannot open the the link described in the picture with selenium.
I have tried to find element by css_selector, link, partial link, xpath. Still no success, program shows no error, but does not click the last link. Here is the picture from the inspect code from the sec website. Picture of Inspect Code. The line of code that wants to open this is in bold.
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
PATH = "C:\Program Files (x86)\Misc Programs\chromedriver.exe"
stock = 'KO'
#stock = input("Enter stock ticker: ")
browser = webdriver.Chrome(PATH)
#First SEC search
sec_url = 'https://www.sec.gov/search/search.htm'
browser.get(sec_url)
tikSearch = browser.find_element_by_css_selector('#cik')
tikSearch.click()
tikSearch.send_keys(stock)
Sclick = browser.find_element_by_css_selector('#searchFormDiv > form > fieldset > span > input[type=submit]')
Sclick.click()
formDesc = browser.find_element_by_css_selector('#seriesDiv > table > tbody > tr:nth-child(2) > td:nth-child(1)')
print(formDesc)
doc = browser.find_element_by_css_selector('#documentsbutton')
doc.click()
##Cannot open file
**form = browser.find_element_by_xpath('//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
form.click()**
uClient = uReq(sec_url)
page_html = uClient.read()```
On Firefox this worked and got https://www.sec.gov/Archives/edgar/data/21344/000002134421000018/a20201231crithrifplan.htm
Pasting that into Chrome directly also works.
But in the script, it indeed did not open and left one stuck at:
https://www.sec.gov/Archives/edgar/data/21344/000002134421000018/0000021344-21-000018-index.htm
where, oddly, clicking on the link by hand works in the browser that Selenium launched.
It's better with a wait, but if I put time.sleep(5) before your
form = browser.find_element_by_xpath('//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
it opens in Chrome.
EDIT: And here it is done properly with no sleep:
wait = WebDriverWait(browser, 20)
wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a'))).click()
This assumes you have the imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Possibly useful addition:
I am surprised there is no Selenium Test Helper out there with methods that wrap in some bulletproofing (or maybe there are and I do not know), like what Hetzner Cloud did in its Protractor Test Helper. So I wrote my own little wrapper method for the click (also for send keys, which calls this one). If it's useful to you or readers, enjoy. It could be enhanced to build in retries or take the wait time or whether to scroll the field into the top or bottom of the window (or at all) as parameters. It is working in my context as is.
def safe_click(driver, locate_method, locate_string):
"""
Parameters
----------
driver : webdriver
initialized browser object
locate_method : Locator
By.something
locate_string : string
how to find it
Returns
-------
WebElement
returns whatever click() does.
"""
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_element_located((locate_method, locate_string)))
driver.execute_script("arguments[0].scrollIntoView(false);",
driver.find_element(locate_method, locate_string))
return wait.until(EC.element_to_be_clickable((locate_method, locate_string))).click()
If you use it, then the call (which I just tested and it worked) would be:
safe_click(browser, By.XPATH, '//*[#id="formDiv"]/div/table/tbody/tr[2]/td[3]/a')
You could be using it elsewhere, too, but it does not seem like there is a need.

Webdriver not returning some data

I am trying to get some information from a website. The Web Inspector shows the html source, with what JavaScript rendered into it. So I wanted to use chromedriver to render it for the purpose of extracting certain information, which cannot be accessed by simply requesting the website.
Now what seems confusing, is that even the driver is not returning anything.
My code looks like this:
driver = webdriver.Chrome('path/Chromedriver')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
And the website is:
https://www.amundietf.co.uk/professional/product/view/LU1681038243
Is there anything else that gets rendered into the html, when the Web Inspector is opened, which Chromedriver is not able to handle?
Thanks for your answers in advance!
At least you need to accept privacy settings, than click validateDisclaimer to site:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
url = "https://www.amundietf.co.uk/professional/product/view/LU1681038243"
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_id("footer_tc_privacy_button_3").click()
driver.find_element_by_id("validateDisclaimer").click()
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fpFrame.fpBannerMore #blockleft>#part_principale_1")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
results = soup.find_all("tr", class_="odd")
print(results)
After it you need to wait for your page to load and to define elements you are looking for correctly.
Your question really contains many questions, that should be solved one by one.
I just pointed out the first of the problems.
Update
I solved the issue.
You will need to parse result by yourself.
So, you had problems:
Did not click two buttons.
Did not wait for a table you need to load.
Did not have any waits. In Selenium you must use them.

Dynamic element (Table) in page is not updated when i use Click() in selenium, so i couldn't retrive the new data

Page which i need to scrape data from: Digikey Search result
Issue
It is allowed to show only 100 row in each table, so i have to move between multiple tables using the NextPageButton.
As illustrated in the code below, I actually do though, but the results retrieves to me every time the first table results and doesn't move on to the next table results on my click action ActionChains(driver).click(element).perform().
Keep in mind that NO new pages is opened, click is going to be intercepted by some sort of JavaScript to do some rich UI stuff on the same page to load a new table of data
My Expectations
I am just trying to validate that I could move to the next table, then i will edit the code to loop through all of them.
This piece of code should return the data in the second table from results, BUT it actually returns the values from the first table which loaded initially with the URL. This means that the click action didn't occur or it actually occurred but the WebDriver driver content isn't being updated by interacting with dynamic JavaScript elements in the page.
I will appreciate any help, Thanks..
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver import ActionChains
import time
import sys
url = "https://www.digikey.com/en/products/filter/coaxial-connectors-rf-terminators/382?s=N4IgrCBcoA5QjAGhDOl4AYMF9tA"
chrome_driver_path = "..PATH\\chromedriver"
chrome_options = Options()
chrome_options.add_argument ("--headless")
webdriver = webdriver.Chrome(
executable_path= chrome_driver_path
,options= chrome_options
)
with webdriver as driver:
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(presence_of_element_located((By.CSS_SELECTOR, "tbody")))
element = driver.find_element_by_css_selector("button[data-testid='btn-next-page']")
ActionChains(driver).click(element).perform()
time.sleep(10) #too much time i know, but to make sure it is not a waiting issue. something needs to be updated
results = driver.find_elements_by_css_selector("tbody")
for count in results:
countArr = count.text
print(countArr)
print()
driver.close()
Finally found a SOLUTION !
Source of the solution.
As expected the issue was in the clicking action itself. It is somehow not being done right or it's not being done at all as illustrated in the solution Source question.
the solution is to click the button using Javascript execution.
Change line 30
ActionChains(driver).click(element).perform()
to be as following:
driver.execute_script("arguments[0].click();",element)
That's it..

How can I use Selenium (Python) to do a Google Search and then open the results of the first page in new tabs?

As the title said, I'd like to performa a Google Search using Selenium and then open all results of the first page on separate tabs.
Please have a look at the code, I can't get any further (it's just my 3rd day learning Python)
Thank you for your help !!
Code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui
query = 'New Search Query'
browser = webdriver.Chrome('/Users/MYUSERNAME/Desktop/Desktop-Files/Chromedriver/chromedriver')
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
element = browser.find_element_by_class_name('LC20lb')
element.click()
The reason why I imported pyautogui is because I tried simulating a right click and then open in new tab for each result but it was a little confusing :)
Forget about pyautogui as what you want to do can be done in Selenium. Same with most of the rest. You just do not need it. See if this code meets your needs.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
query = 'sins of a solar empire' #my query about a video game
browser = webdriver.Chrome()
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
links = browser.find_elements_by_class_name('r') #I went on Google Search and found the container class for the link
for link in links:
url = link.find_element_by_tag_name('a').get_attribute("href") #this code extracts the url of the HTML link
browser.execute_script('''window.open("{}","_blank");'''.format(url)) # this code uses Javascript to open a new tab and open the given url in that new tab
print(link.find_element_by_tag_name('a').get_attribute("href"))

No Visible Text Results from Python XPATH Selenium Elements

I would like to extract the "First Published" dates using selenium on Python, yet I am facing problems trying to get any visible date results even though I'm getting successful results when looking up the xpath through the browser's inspection tab, I do get successful length results of the elements, yet no text results of the dates on my console.
My Code:
import time
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\\Users\\MyComputer\\PycharmProjects\\SeleniumProject\\venv\\Lib\site-packages\\selenium\\webdriver\\common\\geckodriver.exe')
driver.get('https://tools.cisco.com/security/center/publicationListing.x?resourceIDs=93036,5834,80720&apply=1,1,1&totalbox=3&pt0=Cisco&cp0=93036&pt1=Cisco&cp1=5834&pt2=Cisco&cp2=80720#~FilterByProduct')
time.sleep(20)
prices = driver.find_elements_by_xpath('//span[#class="ng-binding" and contains(text(),"GMT")]')
for post in prices:
if post.text != "":
print(post.text)
print(len(prices))
driver.close()
I have tried other visible xpaths on the website to test on python and I can get the 20 vulnerability titles that show up on screen as seen when you open the link, so I am assuming that I have to tell selenium to click every link and extract the date and do that for every title ? But then how am I able to get them all in one go through the browser inspection tab ?
All help is appreciated,
Thanks,
Selenium can retrieve only visible text. If you don't won't to open all the hidden sections you can use get_attribute('textContent') or get_attribute('innerText')
for post in prices:
print(post.get_attribute('innerText'))
Please find a solution if you want to fetch first publish date
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\New folder\chromedriver.exe")
driver.maximize_window()
driver.get("https://tools.cisco.com/security/center/publicationListing.x?resourceIDs=93036,5834,80720&apply=1,1,1&totalbox=3&pt0=Cisco&cp0=93036&pt1=Cisco&cp1=5834&pt2=Cisco&cp2=80720#~FilterByProduct")
print driver.current_url
element = WebDriverWait(driver, 15).until(EC.visibility_of_element_located((By.XPATH, "//div[#id='tab-2']//tr[1]//td[1]//table[1]//tbody[1]//tr[1]/td[4]/span")))
print element.text
If you want to print all dates
dates = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#id='tab-2']//tr[*]//td[1]//table[1]//tbody[1]//tr[*]/td[4]/span")))
print(len(dates))
for date in dates:
print date.text

Categories

Resources