Zillow Web Scraping using Selenium PXCaptcha

Zillow Web Scraping using Selenium PXCaptcha - python

I am trying to do a project using Selenium which gets to Zillow to find homes for rent and return their properties i.e. renting link, price and address.
This is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)
driver.get(ZILLOW_HOUSES_URL)
house_links = driver.find_elements(By.CSS_SELECTOR, LINKS_CSS_SELECTOR)
prices = driver.find_elements(By.CSS_SELECTOR, PRICES_CSS_SELECTOR)
addresses = driver.find_elements(By.CSS_SELECTOR, ADDRESSES_CSS_SELECTOR)
for link in house_links:
print(link.get_attribute('href'))
for price in prices:
print(price.text.split('+')[0].split(', ')[0].split('/')[0])
for address in addresses:
print(address.text)
Mostly when I run it, it does go to the Zillow webpage, but this CaptchaPX thing comes up. I press and hold, but it comes up again saying Try Again. I try it again, it doesn't stop. How to get rid of this?

You need to make sure cookies can be saved. This got me passed the CAPTCHA for me. It has to be a fully qualified path or Chrome complains.
sel_path = os.path.join(os.getcwd(), 'selenium')
chrome_options = Options()
chrome_options.add_argument("user-data-dir="+ sel_path)
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(zillow_path)

Related

Python - Element item Xpath with Selenium

I'm creating a bot to download a pdf from a website. I used selenium to open google chrome and I can open the website window but I select the Xpath of the first item in the grid, but the click to download the pdf does not occur. I believe I'm getting the wrong Xpath.
I leave the site I'm accessing and my code below. Could you tell me what am I doing wrong? Am I getting the correct Xpath? Thank you very much in advance.
This site is an open government data site from my country, Brazil, and for those trying to access from outside, maybe the IP is blocked, but the page would be this:
Image site
Source site
Edit
Page source code
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
service = Service(ChromeDriverManager().install())
navegador = webdriver.Chrome(service=service)
try:
navegador.get("https://www.tce.ce.gov.br/cidadao/diario-oficial-eletronico")
time.sleep(2)
elem = navegador.find_element(By.XPATH, '//*[#id="formUltimasEdicoes:consultaAvancadaDataTable:0:j_idt101"]/input[1]')
elem.click()
time.sleep(2)
navegador.close()
navegador.quit()
except:
navegador.close()
navegador.quit()

I think you'll need this PDF, right?:
<a class="maximenuck " href="https://www.tce.ce.gov.br/downloads/Jurisdicionado/CALENDARIO_DAS_OBRIGACOES_ESTADUAIS_2020_N.pdf" target="_blank"><span class="titreck">Estaduais</span></a>
You'll need to locate that element by xpath, and then download the pdf's using the "href" value requests.get("Your_href_url")
The XPATH in your source-code is //*[#id="menu-principal"]/div[2]/ul/li[5]/div/div[2]/div/div[1]/ul/li[14]/div/div[2]/div/div[1]/ul/li[3]/a but that might not always be the same.

I want to use Selenium to find specific text

I was going to use Selenium to crawl the web
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get('https://steamdb.info/tag/1742/?all')
driver.implicitly_wait(3)
li = []
games = driver.find_elements_by_xpath('//*[#class="table-products.text-center.dataTable"]')
for i in games:
time.sleep(5)
li.append(i.get_attribute("href"))
print(li)
After accessing the steam url that I was looking for, I tried to find something called an appid
The picture below is the HTML I'm looking for
I'm trying to find the number next to "data-appid="
But if I run my code, nothing is saved in the "games"

Correct me if I'm wrong but from what I can see this steam page requires you to log-in, are you sure that when webdriver opens the page that same data is available to you ?
Additionally when using By, the correct syntax would be games = driver.find_element(By.CSS_SELECTOR('//*[#class="table-products.text-center.dataTable"]'))

How to use selenium for webscraping google flights?

I'm trying to pull the airline names and prices of a specific flight. I'm having trouble with the x.path and/or using the right html tags because when I run the code below, all I get back is 14 empty lists.
from selenium import webdriver
from lxml import html
from time import sleep
driver = webdriver.Chrome(r"C:\Users\14074\Python\chromedriver")
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
sleep(1)
tree = html.fromstring(driver.page_source)
for flight_tree in tree.xpath('//div[#class="TQqf0e sSHqwe tPgKwe ogfYpf"]'):
title = flight_tree.xpath('.//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[2]/div/div[2]/div[6]/div/div[2]/div/div[1]/div/div[1]/div/div[2]/div[2]/div[2]/span/text()')
price = flight_tree.xpath('.//span[contains(#data-gs, "CjR")]')
print(title, price)
#driver.close()
This is just the first part of my code but I can't really continue without getting this to work. If anyone has some ideas on what I'm doing wrong that would be amazing! It's been driving me crazy. Thank you!

I noticed a few issues with your code. First of all, I believe that when entering this page, first google will show you the "I agree to terms and conditions" popup before showing you the content of the page, therefore you need to first click on that button.
Also, you should use the find_elements_by_xpath function directly on driver instead of using the page content, as this also allows you to render the javascript content. You can find more info here: python tree.xpath return empty list
To get more info on how to scrape using selenium and python you could check out this guide: https://www.webscrapingapi.com/python-selenium-web-scraper/
I used the following code to scrape the titles. (I also changed the xpaths to do so, by extracting them directly from google chrome. You can do that by right clicking on an element -> inspect and in the elements tab where the element is, you can right click -> copy -> Copy xpath)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# I used these for the code to work on my windows subsystem linux
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('--disable-dev-sh-usage')
driver = webdriver.Chrome(ChromeDriverManager().install(), options=option)
URL = 'https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA'
driver.get(URL)
driver.find_element_by_xpath('//*[#id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span').click() # this is necessary to pres the I agree button
elements = driver.find_elements_by_xpath('//*[#id="yDmH0d"]/c-wiz[2]/div/div[2]/div/c-wiz/div/c-wiz/div[2]/div[3]/div[3]/c-wiz/div/div[2]/div[1]/div/div/ol/li')
for flight_tree in elements:
title = flight_tree.find_element_by_xpath('.//*[#class="W6bZuc YMlIz"]').text
print(title)

I tried the below code, with screen maximized and having explicit waits and could successfully extract the information, please see below :
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.get("https://www.google.com/travel/flights/searchtfs=CBwQAhopagwIAxIIL20vMHBseTASCjIwMjEtMTItMjNyDQgDEgkvbS8wMWYwOHIaKWoNCAMSCS9tLzAxZjA4chIKMjAyMS0xMi0yN3IMCAMSCC9tLzBwbHkwcAGCAQsI____________AUABSAGYAQE&tfu=EgYIAhAAGAA")
wait = WebDriverWait(driver, 10)
titles = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div/descendant::h3")))
for name in titles:
print(name.text)
price = name.find_element(By.XPATH, "./../following-sibling::div/descendant::span[2]").text
print(price)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
Tokyo
₹38,473
Mumbai
₹3,515
Dubai
₹15,846

Python- Getting link of new webpage using selenium

I'm new to selenium and I wrote this code that gets user input and searches in ebay but I want to save the new link of the search so I can pass it on to BeautifulSoup.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
search_ = input()
browser = webdriver.Chrome(r'C:\Users\Leila\Downloads\chromedriver_win32')
browser.get("https://www.ebay.com.au/sch/i.html?_from=R40&_trksid=p2499334.m570.l1311.R1.TR12.TRC2.A0.H0.Xphones.TRS0&_nkw=phones&_sacat=0")
Search = browser.find_element_by_id('kw')
Search.send_keys(search_)
Search.send_keys(Keys.ENTER)
#how do you write a code that gets the link of the new page it loads

To extract a link from a webpage, you need to make use of the HREF attribute and use the get_attribute() method.
This example from here illustrates how it would work.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.w3.org/')
for a in driver.find_elements_by_xpath('.//a'):
print(a.get_attribute('href'))
In your case, do:
Search = browser.find_element_by_id('kw')
page_link = Search.get_attribute('href')

Can't close in-browser ad pop-up using Selenium and is killing my automation

I am trying to scrape information from this website Singapore Streetdirectory, it's sort of like a Google Map but localized to only Singapore. So what I need to do is to key in a postal code to the search bar, click on search and scrape the street address. I have more than a thousand postal code to process.
The annoying thing is that when I launch this webpage, there will always be an in-browser ad popup. I tried to close it with Selenium using the following code and it doesn't work:
close_ad = driver.find_element_by_id('btn_close')
close_ad.click()
I wanted to close by using xpath and look for their 'href' but the href appears to be some java code and it also doesn't seems to work.
I like to ask if there is any solutions on this or workaround? The ultimate plan is to run this codes in headless browser mode. Below is my full code for your reference if necessary
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = True
driver = webdriver.Chrome(executable_path='/Applications/chromedriver') # , options=options
driver.get('https://www.streetdirectory.com/')
driver.set_window_position(0,23)
print(driver.title)
time.sleep(3)
close_ad = driver.find_element_by_id('a.btn_close')
close_ad.click()
Thanks for the help.

you can use headless also with following
driver.get("https://www.streetdirectory.com/")
html_content = BeautifulSoup(driver.page_source, 'lxml')
print(html_content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Zillow Web Scraping using Selenium PXCaptcha - python

Related

Python - Element item Xpath with Selenium

I want to use Selenium to find specific text

How to use selenium for webscraping google flights?

Python- Getting link of new webpage using selenium

Can't close in-browser ad pop-up using Selenium and is killing my automation

Categories

Resources