Search results don't change URL - Web Scraping with Python and Selenium - python

I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.

I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text

have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr

Related

Xpath returns empty array - lxml

I'm trying to write a program that scrapes https://www.tcgplayer.com/ to get a list of Pokemon TCG prices based on a specified list
from lxml import etree, html
import requests
import string
def clean_text(element):
all_text = element.text_content()
cleaned = ' '.join(all_text.split())
return cleaned
page = requests.get("http://www.tcgplayer.com/product/231462/pokemon-first-partner-pack-pikachu?xid=pi731833d1-f2cc-4043-9551-4ca08506b43a&page=1&Language=English")
tree = html.fromstring(page.content)
price = tree.xpath("/html/body/div[2]/div/div/section[2]/section/div/div[2]/section[3]/div/section[1]/ul/li[1]/span[2]")
print(price)
However, when I am running this code the output ends up just being an empty list "[]"
I have tried using selenium and the browser function that it has, however I would like it to not need to open a browser for 100+ cards to get the price data. I have tested this code on another website url and xpath (https://www.pricecharting.com/game/pokemon-promo/jolteon-v-swsh183, /html/body/div[1]/div[2]/div/div/table/tbody[1]/tr[1]/td[4]/span[1]) - so I wonder if it is just how https://www.tcgplayer.com/ is built.
The expected return value is around $5
Question answered above by #Grismar:
When you test the XPath on a site, you probably do this in the Developer Console in the browser, after the page has loaded. At that point in time, any JavaScript will have already executed and completed and the page may have been updated or even been constructed from scratch by it. When using requests, it just loads the basic page and no scripts get executed - you'll need something that can execute JavaScript to get the same result, like selenium
BeautifulSoup scraping returns no data

How to open and scrape multiple links with Selenium

I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.
In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)
The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])

Export data from dynamic website with BS4 + Python:

I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?
For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/
If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here
You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.

Part of HTML not visible for Scrapy

Set-up
I'm using scrapy to scrape housing ads.
For each ad, I'm trying to obtain info on year of construction.
This info is stated in most ads.
Problem
I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.
However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.
Check this example ad.
If I use response.css('#caracteristique_bien').extract_first(), I get,
<div id="caracteristique_bien"></div>
That's as far as I can go. Any deeper returns emptiness.
How can I obtain the year of construction?
As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapyis not a browser).
The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.
for example to get the description, you can find it inside:
import re
import demjson
script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first()
# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']
# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']
...
As you can see script_info contains all the information, you just need to come up with a way to parse that to get what you want
But there is some information that isn't inside the same response. To get it you need to do a GET request to:
https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
As you can see, it only requires the idannonce, which you can get from the previous response with:
demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']
Later with the second request, you can get for example the "construction year" with:
import json
...
[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']
Loaded the page, opened devtools of the browser, and did a ctrl-F with the css selector you used (caracteristique_bien), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
where you can find what you are looking for
Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.
You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)
Just use it headless with Chrome options and this will be fine :
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)

Getting an element by attribute and using driver to click on a child element in webscraping - Python

Webscraping results from Indeed.com
-Searching 'Junior Python' in 'Los Angeles, CA' (Done)
-Sometimes popup window opens. Close the window if popup occurs.(Done)
-Top 3 results are sponsored so skip these and go to real results
-Click on result summary section which opens up side panel with full summary
-Scrape the full summary
-When result summary is clicked, url changes. Rather than opening new window, I would like to scrape the side panel full summary
-Each real result is under ('div':{'data-tn-component':'organicJob'}). I am able to get jobtitle, company, and short summary using BeautifulSoup. However, I would like to get the full summary on the side panel.
Problem
1) When I try to click on the link(using Selenium) (jobtitle or the short summary, which opens up the side panel), the code only ends up clicking on the 1st link which is the 'sponsored'. Unable to locate and click on real result under id='jobOrganic'
2) Once a real result is clicked on(manually), I can see that the full summary side panel is under <td id='auxCol'> and within this, under . The full summary is contained within the <p> tags. When I try to have a selenium scrape full summary using findAll('div':{'id':'vjs-desc'}), all I get is blank result [].
3) The url also changes when the side panel is opened. I tried using Selenium to have driver get the new url and then soup the url to get results but all I'm getting is the 1st sponsored result which is not what I want. I'm not sure why BeautifulSoup keeps getting the result for sponsored, even when I'm running the code under 'id='jobOrganic' real results.
Here is my code. I've been working on this for almost past two days, went through stackoverflow, documentation, and google but unable to find the answer. I'm hoping someone can point out what I'm doing wrong and why I'm unable to get the full summary.
Thanks and sorry for this being so long.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bs
url = 'https://www.indeed.com/'
driver = webdriver.Chrome()
driver.get(url)
whatinput = driver.find_element_by_id('text-input-what')
whatinput.send_keys('Junior Python')
whereinput = driver.find_element_by_id('text-input-where')
whereinput.click()
whereinput.clear()
whereinput.send_keys('Los Angeles, CA')
findbutton = driver.find_element_by_xpath('//*[#id="whatWhere"]/form/div[3]/button')
findbutton.click()
try:
popup = driver.find_element_by_id('prime-popover-close-button')
popup.click()
except:
pass
This is where I'm stuck. The result summary is under {'data-tn-component':'organicJob'}, span class='summary'. Once I click on this, side panel opens up.
soup = bs(driver.page_source,'html.parser')
contents = soup.findAll('div',{"data-tn-component":"organicJob"})
for each in contents:
summary = driver.find_element_by_class_name('summary')
summary.click()
This opens side panel but it clicks the first sponsored link in the whole page (sponsored link), not the real result. This basically goes out of the 'organicJob' resultset for some reason.
url = driver.current_url
driver.get(url)
I tried to set the new url after clicking on the link(sponsored) to test out whether I can even get the side panel full summary(albeit sponsored, as test purpose).
soup=bs(driver.page_source,'html.parser')
fullsum = soup.findAll('div',{"id":"vjs-desc"})
print(fullsum)
This actually prints out the full summary of side panel, but it keeps printing the same 1st result over and over through the whole loop, instead of moving to the next one.
The problem is you are fetching divs using beautiful soup but, clicking using selenium which is not aware of your collected divs.
As you are using find_element_by_class_name() method of the driver object. It searches the whole page instead of your intended div object each(in the for loop). Thus, it ends up fetching the same first result from the whole page in each iterations.
One, quick work around is possible using only selenium(this will be slower though)
elements = driver.find_elements_by_tag_name('div')
for element in elements:
if "organicJob" in element.get_attribute("data-tn-component"):
summary = element.find_element_by_class_name('summary')
summary.click()
The above code will search for all the divs and, iterate over them to find divs with data-tn-component attribute containing organicJob. Once, it find one it will search for element with class name summary and click on that element.

Categories

Resources