Scrape Zillow Lender Profiles - python

i am trying to scrape desired information from zillow lender profiles on this website: https://www.zillow.com/lender-directory/?sort=Relevance&location=Alabama%20Shores%20Muscle%20Shoals%20AL&language=English&page=1
i know how to scrape the info with beautiful soup... im just trying to create a list on clickable links for each profile so i can iterate to each one...scrape desired info(i can do this) and then go back to starting page and go to next profile link... probably a simple solution but ive been trying to get a list of darn clickable links for a couple hours now and i think its time to ask lol
thanks
ive tried a number of different approaches to get the list of clickable links but may have implemented them incorrectly so im open to same suggestions to double check
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
import time
#Driver to get website...need to get phantomJS going..
driver = webdriver.Chrome(r'C:\Users\mfoytlin\Desktop\chromedriver.exe')
driver.get('https://www.zillow.com/lender-directory/?sort=Relevance&location=Alabama%20Shores%20Muscle%20Shoals%20AL&language=English&page=1')
time.sleep(2)
#Get page HTML data
soup = BeautifulSoup(driver.page_source, 'html.parser')
profile_links = []
profile_links = driver.find_elements_by_xpath("//div[#class='zsg-content-item']//a")
for profile in range(len(profile_links)):
profile_links = driver.find_elements_by_xpath("//div[#class='zsg-content-item']//a")
profile_links[profile].click()
time.sleep(2)
driver.back()
time.sleep(2)

The find_elements parameter is wrong here you can try either of followings.
This is the code that works when you use find_elements()
def find_elements(self, by=By.ID, value=None):
"""
Find elements given a By strategy and locator. Prefer the find_elements_by_* methods when
possible.
:Usage:
elements = driver.find_elements(By.CLASS_NAME, 'foo')
:rtype: list of WebElement
"""
if self.w3c:
if by == By.ID:
by = By.CSS_SELECTOR
value = '[id="%s"]' % value
elif by == By.TAG_NAME:
by = By.CSS_SELECTOR
elif by == By.CLASS_NAME:
by = By.CSS_SELECTOR
value = ".%s" % value
elif by == By.NAME:
by = By.CSS_SELECTOR
value = '[name="%s"]' % value
# Return empty list if driver returns null
# See https://github.com/SeleniumHQ/selenium/issues/4555
return self.execute(Command.FIND_ELEMENTS, {
'using': by,
'value': value})['value'] or []
Try any of the following options
profile_links = driver.find_elements_by_xpath("//div[#class='zsg-content-item']//a")
OR
profile_links = driver.find_elements(By.XPATH,"//div[#class='zsg-content-item']//a")
Here is list when you use above code.
['https://www.zillow.comhttps://www.zillow.com/lender-profile/courtneyhall17/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/SouthPointBank/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/kmcdaniel77/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/jdowney75/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/fredabutler/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/justindorroh/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/aball731/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/1stfedmort/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/tstutts/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/sbeckett0/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/DebiBretherick/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/cking313/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/Gregory%20Angus/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/cbsbankmarketing/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/ajones392/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/sschulte6/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/dreamhomemortgagellc/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/DarleenBrooksHill/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/sjones966/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/BlakeRobbins4/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/zajones5746/', 'https://www.zillow.comhttps://www.zillow.com/lender-profile/adeline%20perkins/']
Edited
As I said you need to re-assign element again.
profile_links = driver.find_elements_by_xpath("//div[#class='ld-lender-info-column']//h2//a")
for profile in range(len(profile_links)):
profile_links = driver.find_elements_by_xpath("//div[#class='ld-lender-info-column']//h2//a")
driver.execute_script("arguments[0].click();", profile_links[profile])
time.sleep(2)
driver.back()
time.sleep(2)

You can find all the clickable links using this approach. This is written in Java. you can write the equivalent in python.
List<WebElement> Links = driver.findElements(By.xpath("//div[#class='zsg-content-item']//a"));
ArrayList<String> capturedLinks = new ArrayList<>();
for(WebElement link:Links)
{
String myLink = "https://www.zillow.com"+ link.getAttribute("href")
if(!capturedLinks.contains(myLink)) //to avoid duplicates
{
capturedLinks.add(myLink);
}
}

I suppose the following script might do what you wanted to. In short, the script will parse the profile links from it's landing page and then iterate through those links to scrape the name from their target pages.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.zillow.com/lender-directory/?sort=Relevance&location=Alabama%20Shores%20Muscle%20Shoals%20AL&language=English&page=1'
with webdriver.Chrome() as driver:
wait = WebDriverWait(driver,10)
driver.get(url)
items = [item.get_attribute("href") for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h2 > a[href^='/lender-profile/']")))]
for profilelink in items:
driver.get(profilelink)
name = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.lender-name"))).text
print(name)

Related

Get information from webpage with the same class names (Python Selenium)

I have a simple question that assumingly can be solved very easily.
I however used some time now to extract the four lines of information as shown here:
see html structure here
I first try to access the <ul _ngcontent-xl-byg-c79="" class="short ng-star-inserted" item to then loop over the <li _ngcontent-xl-byg-c79="" class="table-row ng-star-inserted"> items in order to store the embedded information in my dataframe (columns are 'Mærke', 'Produkttype', 'Serie', and 'Model').
What do I do wrong? My problem is that the four lines have the same "class" name, which gives me the same output in all four loops.
This is my code:
from selenium import webdriver
import pandas as pd
# Activate web browser: External control
browser = webdriver.Chrome(r'C:\Users\KristerJens\Downloads\chromedriver_win32\chromedriver')
# Get webpage
browser.get("https://www.xl-byg.dk/shop/knauf-insulation-ecobatt-murfilt-190-mm-2255993")
# Get information
brand= []
product= []
series=[]
model=[]
for i in browser.find_elements_by_xpath("//ul[#class='short ng-star-inserted']/li"):
for p in i.find_elements_by_xpath("//span[#class='attribute-name']"):
brand.append(i.find_elements_by_class_name('?').text)
product.append(i.find_elements_by_class_name('?').text)
series.append(i.find_elements_by_class_name('?').text)
model.append(i.find_elements_by_class_name('?').text)
df = pd.DataFrame()
df['brand'] = brand
df['product'] = product
df['series'] = series
df['model'] = model
Any help is very appreciated!!
Try like below and confirm:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.xl-byg.dk/shop/knauf-insulation-ecobatt-murfilt-190-mm-2255993")
wait = WebDriverWait(driver,30)
# Cookie pop-up
wait.until(EC.element_to_be_clickable((By.XPATH,"//button[#aria-label='Accept all' or #aria-label = 'Accepter alle']"))).click()
options = driver.find_elements_by_xpath("//div[#class='row-column']//ul[contains(#class,'short')]/li")
for opt in options:
attribute = opt.find_element_by_xpath("./span[#class='attribute-name']").text # Use a "." in the xpath to find element within in an element
value = opt.find_element_by_xpath("./*[contains(#class,'ng-star-inserted')]").text
print(f"{attribute} : {value}")
Mærke : Knauf Insulation
Produkttype : Murfilt
Serie : ECOBATT
Materiale : Glasmineraluld

ElementClickInterceptedException: Message: element click intercepted - how to locate and switch to iframe in a for loop

I'm trying to to loop through 162 links of country rankings on a JavaScript page and click in and out of each country. For the first 13 or so country links they work, but once I get to around Belgium (give or take), I'm hit with ElementClickInterceptedException: Message: element click intercepted: Element <iframe class="js-lazyload loaded" data-src="https://assets.weforum.org/static/reports/gender-gap-report-2021/v8/index.html".... Earlier in the script I handled an iframe, but I'm not sure what I need to know and do in order to find and switch to an iframe in this loop should an iframe arise. Here's my code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import re
driver = webdriver.Chrome(executable_path= "C:/work/chromedriver.exe")
def load_list_page(returnlist = False):
'''
This function takes you to the list view of country rankings for the Gender Gap Index.
There's a default option to return all country rankings on the page as a list.
'''
driver.get("https://www.weforum.org/reports/global-gender-gap-report-2021/in-full/economy-profiles#economy-profiles")
# bring up list view of countries
wait=WebDriverWait(driver,10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "iFrameResizer0")))
wait.until(EC.element_to_be_clickable((By.XPATH,"//*[name()='svg' and #class='sc-gzOgki ftxBlu']"))).click()
if returnlist:
list_of_countries = driver.find_elements_by_xpath("//div[#id='root']//a[#class='sc-chPdSV lcYVwX sc-kAzzGY kraFwA']/div[1]/div")
return list_of_countries
# Collect country names
countries = load_list_page(returnlist=True)
country_names_raw = [i.text for i in countries]
# get all non-empty strings
country_names = [i for i in country_names_raw if len(i)>0]
# extract just the country name using regex
country_names = [re.match(r'\d{,3}. ([\w\s,\'\.]+).*\n', i).group(1) for i in country_names]
# Record the index for the country names that had non-empty strings. These indexes reference WebElements that
# will link to the country profile page. Use these indices to grab the webelements that link to country profiles
# NOTE: I had to add 1 to each index since it seems the link is in the webelement immediately after the webelement with the country text
link_index = [i+1 for i, j in enumerate(country_names_raw) if len(j) > 0]
# Loop through and click country rankings
for index, link in enumerate(link_index[:14]):
try:
countries = load_list_page(returnlist=True)
countries[link].click()
except Exception as e:
print(f"{e}")
print(f"Error for {country_names[index]}, link index: {link}")
for index, link in enumerate(link_index[:14]):
try:
countries = load_list_page(returnlist=True)
#countries[link].click()
driver.execute_script("arguments[0].click();", countries[link])
except Exception as e:
print(f"{e}")
print(f"Error for {country_names[index]}, link index: {link}")
You can invoke click directly on the element to bypass the overlapping element.
When iterating through the list of countries,
for index, link in enumerate(link_index[:14]):
before clicking on the element scroll that element into the view.
Java method performing this looks like:
public void scrollElementIntoView(WebElement element){
waitForElementToBeVisible(element);
((JavascriptExecutor) driver).executeScript("arguments[0].scrollIntoView(true);", webElement);
wait(300);
}
Python method will be similar.

is there a way to wait for an a tag element that could be one of two based on their href value in selenium?

So I'm not sure if this practically valid, but was wondering if there's a way in selenium to wait for an <a tag - out of two based on their href value or the text contained after the tag closes.
What I'm trying to do is to power up this page https://www.coingecko.com/en/exchanges, iterate through the exchanges links, visit each one of them, then click on the about tab of each of those newly opened pages as they contain the info to be extracted. The code actually worked up until halfway through when it failed to identify the tab properly through a StaleElementException and elementNotFound as I did it through driver.find_element_by_text.
The problem is that the 'about' tab changes from one page to the other, so it's either //ul[#role='tablist']/li[3] or li[2], and that's why I'm trying to wait and click on the right element based on its href value. That is since one of the a tags on the page href's value contains the text # about ---> //ul[#role='tablist']/li[3]/a
Apologies if it wasn't straightforward but I was trying to pinpoint what the issue was until recently :)
This is the code that I've attempted so far if anyone can gratefully point me in the right direction
from selenium.webdriver import Chrome
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException, ElementNotVisibleException
webdriver = '/Users/karimnabil/projects/selenium_js/chromedriver-1'
driver = Chrome(webdriver)
num_of_pages = 4
exchanges_list = []
names_list = []
websites_list = []
emails_list = []
years_list = []
countries_list = []
twitter_list = []
for i in range(num_of_pages):
url = 'https://www.coingecko.com/en/exchanges?page=' + str(i+1)
driver.get(url)
links = driver.find_elements_by_xpath("//tbody[#data-target='exchanges-list.tableRows']/tr/td[2]/div/span[2]/a")
links = [url.get_attribute('href') for url in links]
time.sleep(0.5)
for link in links:
driver.get(link)
wait = WebDriverWait(driver, 2)
wait.until(EC.text_to_be_present_in_element_value((By.XPATH, "//ul[#role='tablist']/li[position()=2 or position()=3]/a"), '#about'))
try:
name = driver.find_element_by_xpath("//div[#class='exchange-details-header-content']/div/h1").text
website = driver.find_element_by_xpath("//div[#class='row no-gutters']/div[8]/a").get_attribute('href')
email = driver.find_element_by_xpath("//div[#class='row no-gutters']/div[9]/a").get_attribute('href')
year_est = driver.find_element_by_xpath("//div[#class='row no-gutters']/div[10]").text
inc_country = driver.find_element_by_xpath("//div[#class='row no-gutters']/div[12]").text
twitter = driver.find_element_by_xpath("//div[#class='row no-gutters']/div[16]/div[2]/div[2]/a").get_attribute('title')
except:
pass
try:
print('---------------')
print('exchange name is : {}'.format(name))
print('exchange website is : {}'.format(website))
print('exchange email is : {}'.format(email))
print('exchange established in year: {}'.format(year_est))
print('exchange incorporated in : {}'.format(Inc_country))
print('exchange twitter handle is: {}'.format(twitter))
except:
pass
try:
names_list.append(name)
websites_list.append(website)
emails_list.append(email)
years_list.append(year_est)
countries_list.append(Inc_country)
twitter_list.append(twitter)
except:
pass
df = pd.DataFrame(list(zip(names_list, websites_list,emails_list, years_list, countries_list, twitter_list)), columns=['Ex_Names', 'Website', 'Support Email', 'Inc Year', 'Inc Country', 'Twitter Handle' ])
CoinGecko2_data = df.to_csv('CoinGecko4.csv', index=False)
If you know the href just wait for: //a[contains(#href, 'my-href')]
I am not sue if there is any but you can create your custom wait. Here is an example:
https://seleniumbyexamples.github.io/waitcustom

Cannot extract the html table

I want to harvest information using beautiful soup and python3 from a table within a given website .
I have also tried to use XPath method but still cannot get a way to obtain the data.
coaches = 'https://www.badmintonengland.co.uk/coach/find-a-coach'
coachespage = urlopen(coaches)
soup = BeautifulSoup(coachespage,features="html.parser")
data = soup.find_all("tbody", { "id" : "JGrid-az-com-1031-tbody" })
def crawler(table):
for mytable in table:
try:
rows = mytable.find_all('tr')
for tr in rows:
cols = tr.find_all('td')
for td in cols:
return(td.text)
except:
raise ValueError("no data")
print(crawler(data))
If you use selenium to make the selections and then pd.read_html the page_source to get the table, this allows javascript to run and populate values
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
url = 'https://www.badmintonengland.co.uk/coach/find-a-coach'
driver = webdriver.Chrome()
driver.get(url)
ele = driver.find_element_by_css_selector('.az-triggers-panel a') #distance dropdown
driver.execute_script("arguments[0].scrollIntoView();", ele)
ele.click()
option = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID, "comboOption-az-com-1015-8"))) # any distance
option.click()
driver.find_element_by_css_selector('.az-btn-text').click()
time.sleep(5) #seek better wait condition for page update
tables = pd.read_html(driver.page_source)

how to get hidden href tag in selenium with "::before"

Im trying to get a url from a PLP and visit each of the elements to get certain keywords from the PDP and dump it into json file. However, the list only returns 1 data only. Im suspecting the website is trying to block the action. Im using this program once a month to see what new features are added in new items.
The code between the "***" is the part I am having trouble with. It returns the correct value but only returns 1 data.How can I get more data?In the example below I am only getting the product names to make it simple.
sample url: "https://store.nike.com/us/en_us/pw/mens-running-shoes/7puZ8yzZoi3"
Actual element
<div class="exp-product-wall clearfix">
::before
<div class="grid-item fullSize" data-pdpurl="https://www.nike.com/t/epic-react-flyknit-2-mens-running-shoe-459stf" data-column-index="0" data-item-index="1">
<div class="grid-item-box">
<div class="grid-item-content">
<div class="grid-item-image">
<div class="grid-item-image-wrapper sprite-sheet sprite-index-1">
<a href="https://www.nike.com/t/epic-react-flyknit-2-mens-running-shoe-459stf">
<img src="https://images.nike.com/is/image/DotCom/pwp_sheet2?$NIKE_PWPx3$&$img0=BQ8928_001&$img1=BQ8928_003&$img2=BQ8928_005">
Below working code
import selenium
import json
import time
import re
import string
import requests
import bs4
from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
domain = 'website url goes here'
def prepare_driver(url):
'''Returns a Firefox Webdriver.'''
options = Options()
# options.add_argument('-headless')
driver = webdriver.Chrome(executable_path='location to chromedriver')
driver.get(url)
wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located(
(By.CLASS_NAME, 'product-name ')))
time.sleep(2)
return driver
def fill_form(driver, search_argument):
'''Finds all the input tags in form and makes a POST requests.'''
#search_field = driver.find_element_by_id('q')
#search_field.send_keys(search_argument)
# We look for the search button and click it
#driver.find_element_by_class_name('search__submit')\
#.click()
wait = WebDriverWait(driver, timeout=10).until(
EC.presence_of_all_elements_located(
(By.CLASS_NAME, 'product-name ')))
def scrape_results(driver, n_results):
'''Returns the data from n_results amount of results.'''
products_urls = list()
products_data = list()
***for product_title in driver.find_elements_by_xpath('//div[#class="exp-gridwall-content clearfix"]'):
products_urls.append(product_title.find_element_by_xpath(
'//div[#class="grid-item fullSize"]').get_attribute('data-pdpurl'))***
for url in range(0, n_results):
if url == n_results:
break
url_data = scrape_product_data(driver, products_urls[url])
products_data.append(url_data)
return products_data
def scrape_product_data(driver, product_url):
'''Visits an product page and extracts the data.'''
if driver == None:
driver = prepare_driver(product_url)
driver.get(product_url)
time.sleep(12)
product_fields = dict()
# Get the product name
product_fields['product_name'] = driver.find_element_by_xpath(
'//h1[#id="pdp_product_title"]').get_attribute('textContent')
# .text.strip('name')
return product_fields
if __name__ == '__main__':
try:
driver = prepare_driver(domain)
#fill_form(driver, 'juniole tf')
products_data = scrape_results(driver, 2)
products_data = json.dumps(products_data, indent=4,ensure_ascii=False) #ensure_acii => changes japanese to correct character
with open('data.json', 'w') as f:
f.write(products_data)
finally:
driver.quit()
Desired Output in json:
[
{
"product_name": "Nike Epic React Flyknit 2",
"descr": "The Nike Epic React Flyknit 2 takes a step up from its predecessor with smooth, lightweight performance and a bold look. An updated Flyknit upper conforms to your foot with a minimal, supportive design. Underfoot, durable Nike React technology defies the odds by being both soft and responsive, for comfort that lasts as long as you can run."
},
{
"product_name": "Nike Zoom Fly SP Fast Nathan Bell",
"descr": "The Nike Zoom Fly SP Fast Nathan Bell is part of a collaboration with artist Nathan Bell, featuring hand-drawn graphics that celebrate running as a competition with yourself. It's designed to meet the demands of your toughest tempo runs, long runs and race day with a responsive construction that turns the pressure of each stride into energy return for the next."
}
]
You can easily get the urls with requests. I targeted the data-pdpurl attribute. In the selenium loop you may need to add some handling of requests for location. A short wait is needed during loop to prevent false claims of product not available.
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
d = webdriver.Chrome()
results = []
r = requests.get('https://store.nike.com/us/en_us/pw/mens-running-shoes/7puZ8yzZoi3')
soup = bs(r.content, 'lxml')
products = []
listings = soup.select('.grid-item')
for listing in listings:
url = listing['data-pdpurl']
title = listing.select_one('.product-display-name').text
row = {'title' :title ,
'url' : url}
products.append(row)
for product in products:
url = product['url']
d.get(url)
try:
d.get(url)
desc = WebDriverWait(d,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".description-preview")))
results.append({'product_name': product['title'],
'descr' : desc.text})
except Exception as e:
print(e, url)
finally:
time.sleep(1)
d.quit()
print(results)

Categories

Resources