I want to use Selenium to find specific text - python

I was going to use Selenium to crawl the web
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get('https://steamdb.info/tag/1742/?all')
driver.implicitly_wait(3)
li = []
games = driver.find_elements_by_xpath('//*[#class="table-products.text-center.dataTable"]')
for i in games:
time.sleep(5)
li.append(i.get_attribute("href"))
print(li)
After accessing the steam url that I was looking for, I tried to find something called an appid
The picture below is the HTML I'm looking for
I'm trying to find the number next to "data-appid="
But if I run my code, nothing is saved in the "games"

Correct me if I'm wrong but from what I can see this steam page requires you to log-in, are you sure that when webdriver opens the page that same data is available to you ?
Additionally when using By, the correct syntax would be games = driver.find_element(By.CSS_SELECTOR('//*[#class="table-products.text-center.dataTable"]'))

Related

Zillow Web Scraping using Selenium PXCaptcha

I am trying to do a project using Selenium which gets to Zillow to find homes for rent and return their properties i.e. renting link, price and address.
This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
driver.get(ZILLOW_HOUSES_URL)
house_links = driver.find_elements(By.CSS_SELECTOR, LINKS_CSS_SELECTOR)
prices = driver.find_elements(By.CSS_SELECTOR, PRICES_CSS_SELECTOR)
addresses = driver.find_elements(By.CSS_SELECTOR, ADDRESSES_CSS_SELECTOR)
for link in house_links:
print(link.get_attribute('href'))
for price in prices:
print(price.text.split('+')[0].split(', ')[0].split('/')[0])
for address in addresses:
print(address.text)
Mostly when I run it, it does go to the Zillow webpage, but this CaptchaPX thing comes up. I press and hold, but it comes up again saying Try Again. I try it again, it doesn't stop. How to get rid of this?
You need to make sure cookies can be saved. This got me passed the CAPTCHA for me. It has to be a fully qualified path or Chrome complains.
sel_path = os.path.join(os.getcwd(), 'selenium')
chrome_options = Options()
chrome_options.add_argument("user-data-dir="+ sel_path)
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(zillow_path)

How to use selenium for webscraping google flights?

I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!
I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)
I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846

How to use 'not start-with' attribute of xpath in selenium to skip certain websites in python

I've written a code which asks the user for input and opens duckduckgo to search for the website related to that input value. In the search results, I want to open the website which doesn't start with the website mentioned by me in //a[not(starts-with(#href, 'website'))].This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import pyautogui
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
stuff = input()
options = webdriver.ChromeOptions()
options.headless = True
browser = webdriver.Chrome()
browser.implicitly_wait(30)
browser.maximize_window()
browser.get("http://www.duckduckgo.com")
elem = browser.find_element_by_name("q")
elem.clear()
elem.send_keys(stuff)
elem.submit()
matched_elements = browser.find_elements_by_xpath('//a[not(starts-with(#href, "https://it.wikipedia.org/"))]' or '//a[not(starts-with(#href, "https://www.facebook.com"))]')
if matched_elements:
matched_elements[0].click()
Suppose if the user has entered this input:- Regina Pacis, Reggio nell'Emilia, 42124 and the search results are these:-
I want the code to skip over the wikipedia and facebook search results and click on the link highlighted in red. But instead of that, the code goes back to duckduckgo.
I know I can easily achieve the result by:-
match_elements = browser.find_elements_by_class_name('result__url__domain')
match_elements[2].click()
But the search results are dynamic which will change as per user input. I'd really appreciate if you guys could help me
You can use:
dom_bl = ["wikipedia.org", "facebook.com"]
match_elements = browser.find_elements_by_class_name('result__url')
for link in match_elements:
url = link.get_attribute("href")
if any(dom in url for dom in dom_bl):
continue
link.click()
break

Python- Getting link of new webpage using selenium

I'm new to selenium and I wrote this code that gets user input and searches in ebay but I want to save the new link of the search so I can pass it on to BeautifulSoup.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
search_ = input()
browser = webdriver.Chrome(r'C:\Users\Leila\Downloads\chromedriver_win32')
browser.get("https://www.ebay.com.au/sch/i.html?_from=R40&_trksid=p2499334.m570.l1311.R1.TR12.TRC2.A0.H0.Xphones.TRS0&_nkw=phones&_sacat=0")
Search = browser.find_element_by_id('kw')
Search.send_keys(search_)
Search.send_keys(Keys.ENTER)
#how do you write a code that gets the link of the new page it loads
To extract a link from a webpage, you need to make use of the HREF attribute and use the get_attribute() method.
This example from here illustrates how it would work.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.w3.org/')
for a in driver.find_elements_by_xpath('.//a'):
print(a.get_attribute('href'))
In your case, do:
Search = browser.find_element_by_id('kw')
page_link = Search.get_attribute('href')

How can I use Selenium (Python) to do a Google Search and then open the results of the first page in new tabs?

As the title said, I'd like to performa a Google Search using Selenium and then open all results of the first page on separate tabs.
Please have a look at the code, I can't get any further (it's just my 3rd day learning Python)
Thank you for your help !!
Code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui
query = 'New Search Query'
browser = webdriver.Chrome('/Users/MYUSERNAME/Desktop/Desktop-Files/Chromedriver/chromedriver')
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
element = browser.find_element_by_class_name('LC20lb')
element.click()
The reason why I imported pyautogui is because I tried simulating a right click and then open in new tab for each result but it was a little confusing :)
Forget about pyautogui as what you want to do can be done in Selenium. Same with most of the rest. You just do not need it. See if this code meets your needs.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
query = 'sins of a solar empire' #my query about a video game
browser = webdriver.Chrome()
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
links = browser.find_elements_by_class_name('r') #I went on Google Search and found the container class for the link
for link in links:
url = link.find_element_by_tag_name('a').get_attribute("href") #this code extracts the url of the HTML link
browser.execute_script('''window.open("{}","_blank");'''.format(url)) # this code uses Javascript to open a new tab and open the given url in that new tab
print(link.find_element_by_tag_name('a').get_attribute("href"))

Categories

Resources