Page source not showing advertisments for selenium / Python - python

This should be a very straight forward element find which is just not occurring, I've add in a very long implicit wait to allow the page to load completely
from selenium import webdriver
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get("https://www.smh.com.au")
driver.find_elements_by_class_name("img_ad")
As well as wait loads based on element location
timeout = 10
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, '"img_ad'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
However, this element is not appearing despite me seeing it clearly in inspect mode in firefox
<img src="https://tpc.googlesyndication.com/simgad/9181016285467049325" alt="" class="img_ad" width="970" height="250" border="0">
This is an advertisement on the page so I think there might be some funky code sitting on top of it which doesn't show in the driver, any advice on how to collect this?

The advert is in an iFrame so you need to switch this frame first.
But I found that after several page loads the adverts stopped appearing on the web-page. I did find that the adverts loaded nearly every time using driver = webdriver.Opera() but not in Chrome of Firefox, even using private browsing and clearing all browsing data.
If they appeared then this code worked.
To find the element by a partial class name I at first used find_element_by_css_selector("amp-img[class^='img_ad']"). Sometimes the element with the img_ad class is not present so you can use driver.find_element_by_id("aw0") which finds the data more often. Sometimes the web-page HTML does not even have this id so my code prints the HTML.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
driver.get("https://www.smh.com.au")
driver.implicitly_wait(10)
iFrame = driver.find_elements_by_tag_name("iframe")[1]
driver.switch_to.frame(iFrame)
try:
# element = driver.find_element_by_css_selector("amp-img[class^='img_ad']")
# print(element.get_attribute('outerHTML'))
element = driver.find_element_by_id("aw0")
print(element.get_attribute('innerHTML'))
except NoSuchElementException:
print("Advert not found")
print(driver.page_source)
driver.quit()
Outputs:
<amp-img alt="" class="img_ad i-amphtml-layout-fixed i-amphtml-layout-size-defined i-amphtml-element i-amphtml-layout" height="250" i-amphtml-layout="fixed" i-amphtml-ssr="" src="https://tpc.googlesyndication.com/simgad/16664324514375864185" style="width:970px;height:250px;" width="970"><img alt="" class="i-amphtml-fill-content i-amphtml-replaced-content" decoding="async" src="https://tpc.googlesyndication.com/simgad/16664324514375864185"></amp-img>
or:
<img src="https://tpc.googlesyndication.com/simgad/10498242030813793376" border="0" width="970" height="250" alt="" class="img_ad">
or:
<html><head></head><body></body></html>

Related

Selenium doesn't find element loaded from Ajax

I have been trying to get access to the 4 images on this page: https://altkirch-alsace.fr/serviable/demarches-en-ligne/prendre-un-rdv-cni/
However the grey region seems to be Ajax-loaded (according to its class name). I want to get the element <div id="prestations"> inside of it but can't access it, nor any other element within the grey area.
I have tried to follow several answers to similar questions, but no matter how long I wait I get an error that the element is not found ; the element is here when I click "Inspect element" but I don't see it when I click "View source". Does that mean I can't access it through selenium?
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("https://altkirch-alsace.fr/serviable/demarches-en-ligne/prendre-un-rdv-cni/")
element = WebDriverWait(driver, 10) \
.until(lambda x: x.find_element(By.ID, "prestations"))
print(element)
You're not using WebDriverWait(...).until properly. Your lambda is using find_element, which throws exception when it is called and the element is not found.
You should use it like this instead:
from selenium.webdriver.support import expected_conditions as EC
...
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "prestations"))
)
Got the same problem. There is no issue with waiting. There is a frame on that webpage:
<iframe src="https://www.rdv360.com/mairie-d-altkirch?ajx_md=1" width="100%"
height="600" style="border:0px;"></iframe>
The Document Object Model - DOM of the main website does not contain any informations about the webpage that is loaded into a frame.
Even waiting hours you will not find any elements inside this frame, as it has its own DOM.
You need to switch the WebDriver context to this frame. Than you may access the frame's DOM:
As on your website the iframe has no id, you may search for all frames like described on the selenium webpage (https://www.selenium.dev/documentation/webdriver/browser/frames/):
The code searches for all HTML-tags of type "iframe" and takes the first (maybe it should be [0] not [1])
iframe = driver.find_elements(By.TAG_NAME,'iframe')[1]
driver.switch_to.frame(iframe)
Now you may find your desired elements.
The solution that worked in my case on a webpage like this:
<html>
...
<iframe ... id="myframe1">
...
<\iframe>
<iframe ... id="myframe2">
...
<\iframe>
...
</html>
my_iframe = my_WebDriver.find_element(By.ID, 'myframe_1')
my_Webdriver.swith_to.frame(my_iframe)
also working:
my_WebDriver.switch_to.frame('myframe_1')
According Selenium docs you may use iframe's name, id or the element itself to switch to that frame.

WebDriverWait works but page_source still returns half rendered HTML

I have read Wait Until Page is Loaded, How to use Selenium Wait, Explicit Wait and other documentations to wait for a page to load and then start scraping. The wait successfully passes but I still get the same half/incomplete rendered HTML code.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')
# start chrome browser
browser = webdriver.Chrome(options=options,executable_path='C:/chromedriver_win32/chromedriver.exe')
browser.get('https://swappa.com/listing/view/LTNZ94446')
try:
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.ID, "wrap")))
print(browser.page_source)
except TimeoutException:
print("not found")
For this my output starts somewhere half-way rather than from <html> at the top.
<div class="col-xs-6 col-sm-2 col-md-2">
<div class="img-container" style="margin-bottom: 15px;">
<a href="https://static.swappa.com/media/listing/LTNZ94446/mhhHypyw.jpg" class="lightbox">
<img class="img-responsive" src="https://static.swappa.com/images/cache/7b/67/7b679a1d89816bc341a802f19f661eac.jpg" alt="Listing Image" style="margin:0px 0px 0px 0px; ">
</a>
</div>
</div>
I am not sure where is it going wrong.
It is clearly able to see the presence of element ID. (<div id="wrap">) since it doesnt throw timeout error
I tried using visibility of element, still no luck
Tried using readystate as well but no luck.
If there are ways using other libraries such as BeautifulSoup/URLLib/URLlib2/Scrapy, those would be helpful as well
You can check if page fully loaded using JavaScript:
options = webdriver.ChromeOptions()
options.add_argument('headless')
# start chrome browser
browser = webdriver.Chrome(options=options)
browser.get('https://swappa.com/listing/view/LTNZ94446')
WebDriverWait(browser, 30).until(lambda d: d.execute_script(
'return ["complete", "interactive"].indexOf(document.readyState) != -1'))
# or use only complete
# WebDriverWait(browser, 30).until(lambda d: d.execute_script('return document.readyState == "complete"'))
print(browser.page_source)
You can use python requests module.
Code:
import requests
response=requests.get("https://swappa.com/listing/view/LTNZ94446")
if response.status_code==200:
print(response.text)

Switching into second iframe in Selenium Python3 [duplicate]

This question already has answers here:
Unable to locate the child iframe element which is within the parent iframe through Selenium in Java
(2 answers)
Multiple iframe tags Selenium webdriver
(2 answers)
Closed 3 years ago.
I am trying to switch into second iframe of the website for personal auto-filler for my business.
Just in case I might get marked as dup, I tried Python Selenium switch into an iframe within an iframe already and sadly got not much out of it.
Here are html codes for two iframes. Second iframe is within the first one:
<div id="tab2_1" class="cms_tab_content1" style="display: block;">
<iframe id="the_iframe"src=
"http://www.scourt.go.kr/portal/information/events/search/search.jsp">
</iframe> <!-- allowfullscreen --></div>
<div id="contants">
<iframe frameborder="0" width="100%" height="1100" marginheight="0"
marginwidth="0" scrolling="auto" title="나의 사건검색"
src="http://safind.scourt.go.kr/sf/mysafind.jsp
sch_sa_gbn=&sch_bub_nm=&sa_year=&sa_gubun=&sa_serial=&x=&
y=&saveCookie="></iframe>
<noframes title="나의사건검색(새창)">
<a href="http://safind.scourt.go.kr/sf/mysafind.jspsch_sa_gbn
=&sch_bub_nm=&sa_year=&sa_gubun=&sa_serial=&x=&y=&
saveCookie=" target="_blank"
title="나의사건검색(새창)">프레임이 보이지 않을경우 이곳을 클릭해주세요</a></noframes></div>
Just for a reference- so far, I tried these:
#METHOD-1
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.ID,"the_iframe"))
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.CSS_SELECTOR,"#contants > iframe"))
#METHOD-2
driver.switch_to.frame(driver.find_element_by_xpath('//*[#id="the_iframe"]'))
driver.WAIT
driver.switch_to.frame(driver.find_element_by_css_selector('//*[#id="contants"]/iframe'))
driver.WAIT
#METHOD-3
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.default_content()
driver.switch_to.frame(iframe)
driver.find_elements_by_tag_name('iframe')[0]
Here is the entire code I have right now:
import time
import requests
from bs4 import BeautifulSoup as SOUP
import lxml
import re
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.support import ui
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.ui import Select
import psycopg2
#ID and Password for autologin
usernameStr= 'personalinfo'
passwordStr= 'personalinfo'
driver = webdriver.Chrome('./chromedriver')
ultrawait = WebDriverWait(driver, 9999)
ConsumerName=""
COURTNO=""
CASENO=""
CaseYear=""
CaseBun=""
CaseSerial=""
#AutoLogin
driver.get("http://PERSONALINFO")
username = driver.find_element_by_name('userID')
username.send_keys(usernameStr)
userpassword = driver.find_element_by_name('userPassword')
userpassword.send_keys(passwordStr)
login_button=driver.find_elements_by_xpath("/html/body/div/div[2]/div/form/input")[0]
login_button.click()
#Triggered when str of URL includes words in look_for
def condition(driver):
look_for = ("SangDamPom", "jinHaengNo")
url = driver.current_url
WAIT = WebDriverWait(driver, 2)
for s in look_for:
if url.find(s) != -1:
url = driver.current_url
html = requests.get(url)
soup = SOUP(html.text, 'html.parser')
soup = str(soup)
#Some info crawled
CN_first_index = soup.find('type="text" value="')
CN_last_index = soup.find('"/></td>\n<t')
ConsumerName=soup[CN_first_index+19:CN_last_index]
ConsumerName.replace(" ","")
#Some info crawled
CTN_first_index = soup.find('background-color:#f8f8f8;')
CTN_last_index = soup.find('</td>\n<td height="24"')
COURTNO = soup[CTN_first_index+30:CTN_last_index]
COURTNO = COURTNO.replace('\t', '')
#Some info crawled
CAN_first_index = soup.find('가능하게 할것(현제는 적용않됨)">')
CAN_last_index = soup.find('</a></td>\n<td height="24"')
CASENO=soup[CAN_first_index+19:CAN_last_index]
CaseYear=CASENO[:4]
CaseBun=CASENO[4:-5]
CaseSerial=CASENO[-5:]
print(ConsumerName, COURTNO, CaseYear, CaseBun, CaseSerial)
#I need to press this button for iframe I need to appear.
frame_button=driver.find_elements_by_xpath("//*[#id='aside']/fieldset/ul/li[2]")[0]
frame_button.click()
time.sleep(1)
#Switch iframe
driver.switch_to.frame(driver.find_element_by_xpath('//*[#id="the_iframe"]'))
driver.wait
driver.switch_to.frame(driver.find_element_by_xpath("//iframe[contains(#src,'mysafind')]"))
time.sleep(1)
#Insert instit.Name
CTNselect = Select(driver.find_element_by_css_selector('#sch_bub_nm'))
CTNselect.select_by_value(COURTNO)
#Insert Year
CYselect = Select(driver.find_element_by_css_selector('#sel_sa_year'))
CYselect.select_by_value(CaseYear)
#Insert Number
CBselect = Select(driver.find_element_by_css_selector('#sa_gubun'))
CBselect.select_by_visible_text(CaseBun)
#사건번호 입력 (숫자부분)
CS_Insert = WAIT.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#sa_serial")))
CS_Insert.click()
CS_Insert.clear()
CS_Insert.send_keys(CaseSerial)
#Insert Name
CN_Insert = WAIT.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "#ds_nm")))
CN_Insert.click()
CN_Insert.clear()
CN_Insert.send_keys(ConsumerName)
break
ultrawait.until(condition)
Don't mind indention errors, problem of copy-paste.
I think its from #Switch iframe I have issue with.
Those inputs that come after #Switch iframe are all functional. I've tested them by opening iframe at another tab.
You need to deal with your frames separately:
When you finished working with one of them and want to switch to another -- you need to do:
driver.switch_to.default_content()
then switch to another one.
Better to use explicit wait for switching to the frame:
from selenium.webdriver.support import ui
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
ui.WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#contants>iframe")))
wherein By you can use any locator.
As per your html structure, both the iframes are different and the second one is not in between the first one. It would have been inside the first one if the html structure was like:
<div>
<iframe1>
<iframe2>
</iframe2>
</iframe1>
</div>
But as your first <div> and first <iframe> are ending before starting the second div and iframe, it means both the iframes are seperate.
So, according to your requirements, you just need to switch to the second iframe which can be done using:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.CSS_SELECTOR,"second iframe css"))
Updated Answer:
Try the code:
driver.switch_to.frame(driver.find_element_by_xpath('//*[#id="the_iframe"]'))
driver.WAIT
driver.switch_to.frame(driver.find_element_by_xpath('//iframe[contains(#src,'mysafind')]'))
driver.WAIT

Python Selenium web driver with chrome driver gets detected

I assumed that the chrome browsing session opened by Selenium would be the same as the google chrome local installation. But when I try to search on this website, even just open it with selenium and manually control the search process, I will get an error message where as the search result returns fine when I use regular chrome with my own profile or in incognito window.
Whenever I search on this issue, I find results stating mouse movements or clicking pattern gives it away. But it is not the case as I tried manually control after opening the browser. Something in the html request gives it away. Is there anyway to overcome that?
The website in question is: https://www.avnet.com/wps/portal/us
The error message when in automated session.
As per the the website in question https://www.avnet.com/wps/portal/us I am not sure about the exact issue you are facing perhaps your code block would have given us some more leads whats wrong happening. However I am am able to access the mentioned url just fine :
Code Block :
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get('https://www.avnet.com/wps/portal/us')
print("Page Title is : %s" %driver.title)
Console Output :
Page Title is : Avnet: Quality Electronic Components & Services
Snapshot :
Update
I had a relook at the issue you are facing. I have read the entire HTML DOM and have found no traces of Bot Detection mechanisms. Had there been any Bot Detection mechanisms implemented the website even wouldn't have allowed you to traverse/scrape the DOM Tree even to find the Search Box even.
Further debugging the issue the following are my observations :
Through your Automated Script you can proceed till sending the searchtext to the Search Box successfully.
While manually you search for a valid product, the auto-suggestions are displayed through a <span> tag as the html below and you can click on any of the auto-suggestions to browse to the specific product.
Auto Suggestions :
SPAN tag HTML :
<span id="auto-suggest-parts-dspl">
<p class="heading">Recommended Parts</p>
<dl class="suggestion">
<div id="list_1" onmouseover="hoverColor(this)" onmouseout="hoverColorOut(this)" class="">
<div class="autosuggestBox">
AM8TW-4805DZ
<p class="desc1">Aimtec</p>
<p class="desc2">Module DC-DC 2-OUT 5V/-5V 0.8A/-0.8A 8W 9-Pin DIP Tube</p>
</div>
</div>
This <span> is basically not getting triggered/generated when we are using the WebDriver.
In absence of the Auto Suggestions if you force a search that results into the error page.
Conclusion
The main issue seems to be either with the form-control class or with the function scrollDown(event,this) associated with onkeydown event.
#TooLongForComment
To reproduce this issue
from random import randint
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--profile-directory=Default')
options.add_argument('--incognito')
options.add_argument('--disable-plugins-discovery')
options.add_argument('--start-maximized')
browser = webdriver.Chrome('./chromedriver', chrome_options=options)
browser.get('https://www.avnet.com/wps/portal/us')
try:
search_box_id = 'searchInput'
myElem = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, search_box_id)))
elem = browser.find_element_by_id(search_box_id)
sleep(randint(1, 5))
s = 'CA51-SM'
for c in s: # randomize key pressing
elem.send_keys(c)
sleep(randint(1, 3))
elem.send_keys(Keys.RETURN)
except TimeoutException as e:
pass
finally:
browser.close()
I've used hexedit to edit chromedriver key from $cdc_ to fff..
Investigate how it's done by reading every JavaScript block, look at this answer for detection example
Try to add extension to modify headers and mask under Googlebot by changing user-agent & referrer options.add_extension('/path-to-modify-header-extension')

Web scraping with Selenium

I'm trying to scrape this website for the list of company names, code, industry, sector, mkt cap, etc in the table with selenium. I'm new to it and have written the below code:
path_to_chromedriver = r'C:\Documents\chromedriver'
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
url = r'http://sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts'
browser.get(url)
time.sleep(15)
output = browser.page_source
print(output)
However, I'm able to get the below tags, but not the data in it..
<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>
<div class="pager results-display"></div>
I have previously also tried BS4 to scrape it, but failed at it. Any help is much appreciated.
The results are in an iframe - switch to it and then get the .page_source:
iframe = driver.find_element_by_css_selector("#mainContent iframe")
driver.switch_to.frame(iframe)
I would also add a wait for the table to be loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
# locate and switch to the iframe
iframe = driver.find_element_by_css_selector("#mainContent iframe")
driver.switch_to.frame(iframe)
# wait for the table to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.companyName')))
print(driver.page_source)
This is totally do-able. What might be the easiest is to use a 'find_elements' call (note that it's plural) and grab all of the <tr> elements. It will return a list that you can parse using find element (singular) calls on each one in the list, but this time find each element by class.
You may be running into a timing issue. I noticed that the data you are looking for loads VERY slowly. You probably need to wait for that data. The best way to do that will be to check for its existence until it appears, then try to load it. Find elements calls (again, note that I'm using the plural again) will not throw an exception when looking for elements and finding none, it will just return an empty list. This is a decent way to check for the data to appear.

Categories

Resources