Web scraping with Selenium

Web scraping with Selenium - python

I'm trying to scrape this website for the list of company names, code, industry, sector, mkt cap, etc in the table with selenium. I'm new to it and have written the below code:
path_to_chromedriver = r'C:\Documents\chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
url = r'http://sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts'
browser.get(url)
time.sleep(15)
output = browser.page_source
print(output)
However, I'm able to get the below tags, but not the data in it..
<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>
<div class="pager results-display"></div>
I have previously also tried BS4 to scrape it, but failed at it. Any help is much appreciated.

The results are in an iframe - switch to it and then get the .page_source:
iframe = driver.find_element_by_css_selector("#mainContent iframe")
driver.switch_to.frame(iframe)
I would also add a wait for the table to be loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
# locate and switch to the iframe
iframe = driver.find_element_by_css_selector("#mainContent iframe")
driver.switch_to.frame(iframe)
# wait for the table to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.companyName')))
print(driver.page_source)

This is totally do-able. What might be the easiest is to use a 'find_elements' call (note that it's plural) and grab all of the <tr> elements. It will return a list that you can parse using find element (singular) calls on each one in the list, but this time find each element by class.
You may be running into a timing issue. I noticed that the data you are looking for loads VERY slowly. You probably need to wait for that data. The best way to do that will be to check for its existence until it appears, then try to load it. Find elements calls (again, note that I'm using the plural again) will not throw an exception when looking for elements and finding none, it will just return an empty list. This is a decent way to check for the data to appear.

Related

Selenium doesn't find element loaded from Ajax

I have been trying to get access to the 4 images on this page: https://altkirch-alsace.fr/serviable/demarches-en-ligne/prendre-un-rdv-cni/
However the grey region seems to be Ajax-loaded (according to its class name). I want to get the element <div id="prestations"> inside of it but can't access it, nor any other element within the grey area.
I have tried to follow several answers to similar questions, but no matter how long I wait I get an error that the element is not found ; the element is here when I click "Inspect element" but I don't see it when I click "View source". Does that mean I can't access it through selenium?
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("https://altkirch-alsace.fr/serviable/demarches-en-ligne/prendre-un-rdv-cni/")
element = WebDriverWait(driver, 10) \
.until(lambda x: x.find_element(By.ID, "prestations"))
print(element)

You're not using WebDriverWait(...).until properly. Your lambda is using find_element, which throws exception when it is called and the element is not found.
You should use it like this instead:
from selenium.webdriver.support import expected_conditions as EC
...
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "prestations"))
)

Got the same problem. There is no issue with waiting. There is a frame on that webpage:
<iframe src="https://www.rdv360.com/mairie-d-altkirch?ajx_md=1" width="100%"
height="600" style="border:0px;"></iframe>
The Document Object Model - DOM of the main website does not contain any informations about the webpage that is loaded into a frame.
Even waiting hours you will not find any elements inside this frame, as it has its own DOM.
You need to switch the WebDriver context to this frame. Than you may access the frame's DOM:
As on your website the iframe has no id, you may search for all frames like described on the selenium webpage (https://www.selenium.dev/documentation/webdriver/browser/frames/):
The code searches for all HTML-tags of type "iframe" and takes the first (maybe it should be [0] not [1])
iframe = driver.find_elements(By.TAG_NAME,'iframe')[1]
driver.switch_to.frame(iframe)
Now you may find your desired elements.
The solution that worked in my case on a webpage like this:
<html>
...
<iframe ... id="myframe1">
...
<\iframe>
<iframe ... id="myframe2">
...
<\iframe>
...
</html>
my_iframe = my_WebDriver.find_element(By.ID, 'myframe_1')
my_Webdriver.swith_to.frame(my_iframe)
also working:
my_WebDriver.switch_to.frame('myframe_1')
According Selenium docs you may use iframe's name, id or the element itself to switch to that frame.

Change style with selenium python

On this website I try to set some filters to collect data but I can't access to the table using a click event with selenium in my python script.
I noticed that I need to change the style from :
div id="filtersWrapper" class="displayNone " style="display: none;"
to
div id="filtersWrapper" class="displayNone " style="display: block;"
I think that I should use driver.execute_script(), but I have no clue how to do it
I would greatly appreciate some help with this. Thank you!

You can change an attribute on an element using javascript through selenium
element = driver.find_element_by_id('filtersWrapper')
driver.execute_script("arguments[0].setAttribute('attributeToChange', 'new value')", element)
or you can try clicking the element with javascript
driver.execute_script("arguments[0].click();", element)

I have checked the DOM Tree of the webpage. Somehow I was unable to locate any element as:
<div id="filtersWrapper" class="displayNone " style="display: none;">
However the following element exists:
<div id="filtersWrapper" class="displayNone ">
<div id="filtersArrowIndicator" class="arrowIndicator"></div>
.
<div id="economicCalendarSearchPopupResults" class="eventSearchPopupResults economicCalendarSearchPopupResults text_align_lang_base_1 dirSearchResults calendarFilterSearchResults displayNone">
</div>
</div>
Not sure if that was your desired element. A bit of more information about your usecase would have helped us to debug the issue in a better way. However, To set the display property of style attribute as block for the element you can use:
driver.execute_script("document.getElementById('filtersWrapper').style.display='block';");

You can use driver.execute_script() to accomplish this. This is how I change the style attribute in my own code:
div_to_change = driver.find_element_by_id("filtersWrapper")
driver.execute_script("arguments[0].style.display = 'block';", div_to_change)
I had a look at the website you are automating, and you might not need to use JSE at all to do this - there's a reason the div you are trying to click has style = "display: none" - it is not meant to be clicked in this context. Working around that with Javascript might not produce your intended results. This code snippet has been updated with your requirements to set a Time filter in the Economic Calendar section:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://www.investing.com/economic-calendar/")
driver.find_element_by_id("economicCurrentTime").click()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "filterStateAnchor"))).click()
checkbox_for_bull3 = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[#id='importance2']")))
driver.execute_script("arguments[0].scrollIntoView(true);", checkbox_for_bull3)
checkbox_for_bull3.click()
checkbox_for_time = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//fieldset[label[#for='timeFiltertimeOnly']]/input")))
checkbox_for_time.click()
I modified your code snippet to fix a few issues -- when navigating to the economic-calendar page, you were clicking the 'Filters' field twice which caused an issue trying to click checkbox_for_bull3. I also added a scrollIntoView() Javascript call.
I ran this on my local machine and the code executed end to end successfully.

Page source not showing advertisments for selenium / Python

This should be a very straight forward element find which is just not occurring, I've add in a very long implicit wait to allow the page to load completely
from selenium import webdriver
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get("https://www.smh.com.au")
driver.find_elements_by_class_name("img_ad")
As well as wait loads based on element location
timeout = 10
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, '"img_ad'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
However, this element is not appearing despite me seeing it clearly in inspect mode in firefox
<img src="https://tpc.googlesyndication.com/simgad/9181016285467049325" alt="" class="img_ad" width="970" height="250" border="0">
This is an advertisement on the page so I think there might be some funky code sitting on top of it which doesn't show in the driver, any advice on how to collect this?

The advert is in an iFrame so you need to switch this frame first.
But I found that after several page loads the adverts stopped appearing on the web-page. I did find that the adverts loaded nearly every time using driver = webdriver.Opera() but not in Chrome of Firefox, even using private browsing and clearing all browsing data.
If they appeared then this code worked.
To find the element by a partial class name I at first used find_element_by_css_selector("amp-img[class^='img_ad']"). Sometimes the element with the img_ad class is not present so you can use driver.find_element_by_id("aw0") which finds the data more often. Sometimes the web-page HTML does not even have this id so my code prints the HTML.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
driver.get("https://www.smh.com.au")
driver.implicitly_wait(10)
iFrame = driver.find_elements_by_tag_name("iframe")[1]
driver.switch_to.frame(iFrame)
try:
# element = driver.find_element_by_css_selector("amp-img[class^='img_ad']")
# print(element.get_attribute('outerHTML'))
element = driver.find_element_by_id("aw0")
print(element.get_attribute('innerHTML'))
except NoSuchElementException:
print("Advert not found")
print(driver.page_source)
driver.quit()
Outputs:
<amp-img alt="" class="img_ad i-amphtml-layout-fixed i-amphtml-layout-size-defined i-amphtml-element i-amphtml-layout" height="250" i-amphtml-layout="fixed" i-amphtml-ssr="" src="https://tpc.googlesyndication.com/simgad/16664324514375864185" style="width:970px;height:250px;" width="970"><img alt="" class="i-amphtml-fill-content i-amphtml-replaced-content" decoding="async" src="https://tpc.googlesyndication.com/simgad/16664324514375864185"></amp-img>
or:
<img src="https://tpc.googlesyndication.com/simgad/10498242030813793376" border="0" width="970" height="250" alt="" class="img_ad">
or:
<html><head></head><body></body></html>

Python Selenium: Getting dynamic content within iframe

I am trying to scrape the available apartment listings from the following webpage: https://3160599v2.onlineleasing.realpage.com/
I am using the Python implementation of Selenium, but so far I haven't found an effective solution to programmatically get the content. My most basic code is the following, which currently just returns the non-dynamic HTML source code:
from selenium import webdriver
driver = webdriver.Chrome('/path_to_driver')
driver.get('https://3160599v2.onlineleasing.realpage.com/')
html = driver.page_source
The returned html variable does not contain the apartment listings I need.
If I 'Inspect' the element using Chrome's built-in inspect tool, I can see that the content is within an un-classed iframe: <iframe frameborder="0" realpage-oll-widget="RealPage-OLL-Widget" style="width: 940px; border: none; overflow: hidden; height: 2251px;"></iframe>
Several children down within this iframe you can also see the div <div class="main-content"> which contains all the info I need.
Other solutions I have tried include implementing an explicit WebDriverWait:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'main-content')))
I get a TimeoutException with this method as the element is never found.
I also tried using the driver.switch_to.frame() method, with no success.
The only steps that have actually allowed me to get the apartment listings out of the webpage have been (using Chrome):
Manually right-click on an element of the listings within the webpage
Click Inspect
Find the div 'main-content'
Manually right-click on this div and select Copy -> Copy Element
This is not an effective solution since I'm seeking to automate this process.
How can I get this dynamically generated content out of the webpage in a programatic way?

Try to use below code to switch to iframe:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
wait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(driver.find_element_by_xpath('//iframe[#realpage-oll-widget="RealPage-OLL-Widget"]')))
Also note that method that allows to switch to static iframe is switch_to.frame(), but not switch-to.frame()

You can not directly see the content which is in the iframe. You need to change frame. You can do this by firstly selecting 'iframe element' and then switching to it with driver.switch_to.frame() function.
iframe = driver.get_element_by_id('iframe')
driver.switch_to.frame(iframe)
After that you can access the iframe's content.
Alternatively, you can take the source attribute of iframe then going to that page with selenium. In the end, iframe content is another html page.

Scraper unable to extract titles from a website

I've written a script in python in combination with Selenium to extract the titles of different news being displayed in the left sided bar in finance.yahoo website. I've used css selector to get the content. However, the script is neither giving any result nor throwing any error. I can't figure out the mistake I'm making. Hope somebody will take a look into it. Thanks in advance.
Here is my script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://finance.yahoo.com/")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "u.StretchedBox")))
for item in driver.find_elements_by_css_selector("u.StretchedBox span"):
print(item.text)
driver.quit()
Elements within which the titles are:
<h3 class="M(0)" data-reactid="128"><a rel="nofollow noopener noreferrer" class="Fw(b) Fz(20px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 Td(n) C(#0078ff):h C(#000)" target="_blank" href="https://beap.gemini.yahoo.com/mbclk?bv=1.0.0&es=bVwDtPMGIS8NDKqncZWZBjLsQQHm58Z9cLJuMqC6LadDlYfVCoy.d3GqO599EPAiYnsxB0SB8aRURPve9Q8mOEjH.NrcVcVDhldut.C_9Vn16XER1q1G07a48FMQ_.sv9GCyVx7zcj1kBtWPysaYzQqboJWgUo5DRRHbAnejwVtYRPHJTEptil92tx_ccJZ9FnxE8L3tfDuS0Q3l5ftVhamTOon_nzuvtvqqBwD7X0T.7Z3wZBgtH93gM1xImZ0hdFUzsuQPDAjZWs1KdH0YsXIf3uLrmcJFoI9leh8KRljnIPC.RdhOF6OYcJfHtDks85nSIgfOsMyUr1wEhMA2Qa2htpEg5w.P4UIXeoldjzJ_NsUrtXqEFIJNKoaeq_FNiQ9wcI16utKO87167zkfSPzVY09d3pVLZg20V7tqTThOkG_IakPnmlOriJKnufsBWj1wp.6Q4PasAt2g4Y1yw9U71FIfG2dDwpryRKDWrUBfTvjwwItlSyXyvWvIYUyXXxR74qWcIEC3KAvVN7.iqSckV_EssVM8ytp5HiN4iTACpEmc96rpdNEqHYpRotwze8NF5cDubsZbW58Hauq_aO.DbhZJ7TbBDx5vZK_M%26lp=https%3A%2F%2Fin.search.yahoo.com%2Fsearch%3Fp%3Dcheap%2Bairfare%2Bdomestic%26fr%3Dstrm-tts-thg%26.tsrc%3Dstrm-tts-thg%26type%3Dcheapairfaredomestic-in" data-reactid="129">
<u class="StretchedBox" data-reactid="130"></u>
<span data-reactid="131">The Cheapest Domestic Airfare Rates</span></a></h3>

You didn't get neither error nor results because:
find_elements_...() method intend to return you a list. If your selector match no elements you won't get error, just an empty list. Also if to try to iterate through the empty list, you won't get error
your CSS selector should match span that is descendant of u with attribute class="StretchedBox", but actually required span is not descendant, but sibling.
Try to use below code:
for item in driver.find_elements_by_css_selector("u.StretchedBox+span"):
print(item.text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with Selenium - python

Related

Selenium doesn't find element loaded from Ajax

Change style with selenium python

Page source not showing advertisments for selenium / Python

Python Selenium: Getting dynamic content within iframe

Scraper unable to extract titles from a website

Categories

Resources