Export data from dynamic website with BS4 + Python:

Export data from dynamic website with BS4 + Python: - python

I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?

For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/

If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here

You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.

Related

Xpath returns empty array - lxml

I'm trying to write a program that scrapes https://www.tcgplayer.com/ to get a list of Pokemon TCG prices based on a specified list
from lxml import etree, html
import requests
import string
def clean_text(element):
all_text = element.text_content()
cleaned = ' '.join(all_text.split())
return cleaned
page = requests.get("http://www.tcgplayer.com/product/231462/pokemon-first-partner-pack-pikachu?xid=pi731833d1-f2cc-4043-9551-4ca08506b43a&page=1&Language=English")
tree = html.fromstring(page.content)
price = tree.xpath("/html/body/div[2]/div/div/section[2]/section/div/div[2]/section[3]/div/section[1]/ul/li[1]/span[2]")
print(price)
However, when I am running this code the output ends up just being an empty list "[]"
I have tried using selenium and the browser function that it has, however I would like it to not need to open a browser for 100+ cards to get the price data. I have tested this code on another website url and xpath (https://www.pricecharting.com/game/pokemon-promo/jolteon-v-swsh183, /html/body/div[1]/div[2]/div/div/table/tbody[1]/tr[1]/td[4]/span[1]) - so I wonder if it is just how https://www.tcgplayer.com/ is built.
The expected return value is around $5

Question answered above by #Grismar:
When you test the XPath on a site, you probably do this in the Developer Console in the browser, after the page has loaded. At that point in time, any JavaScript will have already executed and completed and the page may have been updated or even been constructed from scratch by it. When using requests, it just loads the basic page and no scripts get executed - you'll need something that can execute JavaScript to get the same result, like selenium
BeautifulSoup scraping returns no data

How to open and scrape multiple links with Selenium

I am new to scraping with Python and have encountered a weird issue.
I am attempting to scrape of OCR'd newspaper articles from a list of URLS using selenium -- the proxy settings on the data source make this easier than other options.
However, I receive tracebacks for the text data every time I run my code. Here is the code that I am using:
article_links = []
for link in driver.find_elements_by_xpath('/html/body/div[1]/main/section[1]/ul[2]/li[*]/div[2]/div[1]/h3/a'):
links = link.get_attribute("href")
article_links.append(links)
articles = []
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
driver.find_element_by_css_selector("#js-doc-explorer-show-additional-views").click()
time.sleep(1)
for article_text in driver.find_elements_by_css_selector("#ocr-container > div.fulltext-ocr.js-page-ocr"):
articles.append(article_text)
I come closest to solving the issue by using .click(), which opens a hidden panel for my data. However, upon using this code, the only data that fills is the last row in the dataset. Without the .click(), all rows come back with nothing. Changing the sleep settings also does not help.
The Xpath for the text data is:
/html/body/div[2]/main/section/div[2]/div[2]/section[2]/div/div[4]/text()
Alternatively, is there a way to get each link's source code and parse it with beautifulsoup after the fact?
UPDATE: There has to be something wrong with the loops -- I can get either the first or last values, but nothing in between.

In a more recent version of Selenium, the method find_elements_by_xpath() is deprecated. Is that the issue you are facing? If it is, import from selenium.webdriver.common.by import By and change it to find_elements(By.XPATH, ...) Similarly, find_elements_by_css_selector() is replaced with find_elements(By.CSS_SELECTOR, ...)
You don't specify if this is even the issue, but if it is, I hope this helps :-)

The solution is found by calling the relevant (unique) class and specifying that it must contain text.
news = []
for article in article_links:
driver2.get(article)
driver2.find_element(By.CSS_SELECTOR, "#js-doc-explorer-show-additional-views").click()
article_text = driver2.find_element(By.XPATH, '//div[#class="fulltext-ocr js-page-ocr"][contains(text()," ")]')
news.append([article_text.text])

Part of HTML not visible for Scrapy

Set-up
I'm using scrapy to scrape housing ads.
For each ad, I'm trying to obtain info on year of construction.
This info is stated in most ads.
Problem
I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.
However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.
Check this example ad.
If I use response.css('#caracteristique_bien').extract_first(), I get,
<div id="caracteristique_bien"></div>
That's as far as I can go. Any deeper returns emptiness.
How can I obtain the year of construction?

As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapyis not a browser).
The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.
for example to get the description, you can find it inside:
import re
import demjson
script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first()
# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']
# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']
...
As you can see script_info contains all the information, you just need to come up with a way to parse that to get what you want
But there is some information that isn't inside the same response. To get it you need to do a GET request to:
https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
As you can see, it only requires the idannonce, which you can get from the previous response with:
demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']
Later with the second request, you can get for example the "construction year" with:
import json
...
[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']

Loaded the page, opened devtools of the browser, and did a ctrl-F with the css selector you used (caracteristique_bien), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
where you can find what you are looking for

Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.
You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)
Just use it headless with Chrome options and this will be fine :
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)

Search results don't change URL - Web Scraping with Python and Selenium

I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.

I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text

have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr

How to get a specific frame in a web page and retrieve its content

I wanted to access the translation results of the following url
http://translate.google.com/translate?hl=en&sl=en&tl=ar&u=http%3A%2F%2Fwww.saltycrane.com%2Fblog%2F2008%2F10%2Fhow-escape-percent-encode-url-python%2F
the translation is displayed in the bottom content frame out of the two frames. I am interested in retrieving only the bottom content frame to get the translations
selenium for python allows us to fetch page contents via web automation:
browser.get('http://translate.google.com/#en/ar/'+hurl)
The required frame is an iframe :
<div id="contentframe" style="top:160px"><iframe src="/translate_p?hl=en&am... name=c frameborder="0" style="height:100%;width:100%;position:absolute;top:0px;bottom:0px;"></div></iframe>
but how to get the bottom content frame element to retrieve the translations using web automation?
Came to know that PyQuery also allows us to browse the contents using the JQuery formalism
Update:
An answer mentioned that Selenium provides a method where you can do that.
frame = browser.find_element_by_tag_name('iframe')
browser.switch_to_frame(frame)
# get page source
browser.page_source
but it does not work in the above example. It returns an empty page .

You can use driver.switchTo.frame(1); here, the digit 1 inside frame() is the index of frames present in the webpage. as your requirement is to switch to second frame and the index starts with 0, you should use driver.switchTo.frame(1);
But the above code is in Java. In Python, you can use the below line.
driver.switch_to_frame(1);
UPDATE
driver.get("http://translate.google.com/translate?hl=en&sl=en&tl=ar&u=http://www.saltycrane.com/blog/2008/10/how-escape-percent-encode-url-python/");
driver.switchTo().frame(0);
System.out.println(driver.findElement(By.xpath("/html/body/div/div/div[3]/h1/span/a")).getText());
Output: SaltyCrane ???????
I have just tried to print the title name SaltCrane that is present inside the iframe.
It worked for me except for the ? symbols after the SaltCrane. As it was arabic, it was unable to decode the same.
The above code is in Java. Same logic should also work in Python.

Selenium provides a method where you can do that.
frame = browser.find_element_by_tag_name('iframe')
browser.switch_to_frame(frame)
# get page source
browser.page_source

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Export data from dynamic website with BS4 + Python: - python

For this you might need to look into tools like selenium and/or "firefox-headless". Especially selenium allows you to "remote-control" web-pages with Python Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/

You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.

Related

Xpath returns empty array - lxml

How to open and scrape multiple links with Selenium

Part of HTML not visible for Scrapy

Search results don't change URL - Web Scraping with Python and Selenium

How to get a specific frame in a web page and retrieve its content

Categories

Resources