How to check if a web element is visible - python

I am using Python with BeautifulSoup4 and I need to retrieve visible links on the page. Given this code:
soup = BeautifulSoup(html)
links = soup('a')
I would like to create a method is_visible that checks whether or not a link is displayed on the page.
Solution Using Selenium
Since I am working also with Selenium I know that there exist the following solution:
from selenium.webdriver import Firefox
firefox = Firefox()
firefox.get('https://google.com')
links = firefox.find_elements_by_tag_name('a')
for link in links:
if link.is_displayed():
print('{} => Visible'.format(link.text))
else:
print('{} => Hidden'.format(link.text))
firefox.quit()
Performance Issue
Unfortunately the is_displayed method and getting the text attribute perform a http request to retrieve such informations. Therefore things can get really slow when there are many links on a page or when you have to do this multiple times.
On the other hand BeautifulSoup can perform these parsing operations in zero time once you get the page source. But I can't figure out how to do this.

AFAIK, BeautifulSoup will only help you parse the actual markup of the HTML document anyway. If that's all you need, then you can do it in a manner like so (yes, I already know it's not perfect):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
def is_visible_1(link):
#do whatever in this function you can to determine your markup is correct
try:
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
def is_visible_2(**kwargs):
try:
soup = kwargs.get('soup', None)
del kwargs['soup']
#Exception thrown if element can't be found using kwargs
link = soup.find_all(**kwargs)[0]
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
#checks links that already exist, not *if* they exist
for link in soup.find_all('a'):
print(str(is_visible_1(link)))
#checks if an element exists
print(str(is_visible_2(soup=soup,id='someID')))
BeautifulSoup doesn't take into account other parties that will tell you that the element is_visible or not, like: CSS, Scripts, and dynamic DOM changes. Selenium, on the other hand, does tell you that an element is actually being rendered or not and generally does so through accessibility APIs in the given browser. You must decide if sacrificing accuracy for speed is worth pursuing. Good luck! :-)

try with find_elements_by_xpath and execute_script
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.com/?hl=en")
links = driver.find_elements_by_xpath('//a')
driver.execute_script('''
var links = document.querySelectorAll('a');
links.forEach(function(a) {
a.addEventListener("click", function(event) {
event.preventDefault();
});
});
''')
visible = []
hidden = []
for link in links:
try:
link.click()
visible.append('{} => Visible'.format(link.text))
except:
hidden.append('{} => Hidden'.format(link.get_attribute('textContent')))
#time.sleep(0.1)
print('\n'.join(visible))
print('===============================')
print('\n'.join(hidden))
print('===============================\nTotal links length: %s' % len(links))
driver.execute_script('alert("Finish")')

Related

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

How can I check if an element exists on a page using Selenium XPath?

I'm writing a script in to do some webscraping on my Firebase for a few select users. After accessing the events page for a user, I want to check for the condition that no events have been logged by that user first.
For this, I am using Selenium and Python. Using XPath seems to work fine for locating links and navigation in all other parts of the script, except for accessing elements in a table. At first, I thought I might have been using the wrong XPath expression, so I copied the path directly from Chrome's inspection window, but still no luck.
As an alternative, I have tried to copy the page source and pass it into Beautiful Soup, and then parse it there to check for the element. No luck there either.
Here's some of the code, and some of the HTML I'm trying to parse. Where am I going wrong?
# Using WebDriver - always triggers an exception
def check_if_user_has_any_data():
try:
time.sleep(10)
element = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#id="event-table"]/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div')))
print(type(element))
if element == True:
print("Found empty state by copying XPath expression directly. It is a bit risky, but it seems to have worked")
else:
print("didn’t find empty state")
except:
print("could not find the empty state element", EC)
# Using Beautiful Soup
def check_if_user_has_any_data#2():
time.sleep(10)
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'html.parser')
print(soup.text[:500])
print(len(soup.findAll('div', {"class": "table-row-no-data ng-scope"})))
HTML
<div class="table-row-no-data ng-scope" ng-if="::config" ng-class="{overlay: config.isBuilderOpen()}">
<div class="no-data-content layout-align-center-center layout-row" layout="row" layout-align="center center">
<!-- ... -->
</div>
The first version triggers the exception and is expected to evaluate 'element' as True. Actual, the element is not found.
The second version prints the first 500 characters (correctly, as far as I can tell), but it returns '0'. It is expected to return '1' after inspecting the page source.
Use the following code:
elements = driver.find_elements_by_xpath("//*[#id='event-table']/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div")
size = len(elements)
if len(elements) > 0:
# Element is present. Do your action
else:
# Element is not present. Do alternative action
Note: find_elements will not generate or throw any exception
Here is the method that generally I use.
Imports
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
Method
def is_element_present(self, how, what):
try:
self.driver.find_element(by=how, value=what)
except NoSuchElementException as e:
return False
return True
Some things load dynamically. It is better to just set a timeout on a wait exception.
If you're using Python and Selenium, you can use this:
try:
driver.find_element_by_xpath("<Full XPath expression>") # Test the element if exist
# <Other code>
except:
# <Run these if element doesn't exist>
I've solved it. The page had a bunch of different iframe elements, and I didn't know that one had to switch between frames in Selenium to access those elements.
There was nothing wrong with the initial code, or the suggested solutions which also worked fine when I tested them.
Here's the code I used to test it:
# Time for the page to load
time.sleep(20)
# Find all iframes
iframes = driver.find_elements_by_tag_name("iframe")
# From inspecting page source, it looks like the index for the relevant iframe is [0]
x = len(iframes)
print("Found ", x, " iFrames") # Should return 5
driver.switch_to.frame(iframes[0])
print("switched to frame [0]")
if WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#class="no-data-title ng-binding"]'))):
print("Found it in this frame!")
Check the length of the element you are retrieving with an if statement,
Example:
element = ('https://www.example.com').
if len(element) > 1:
# Do something.

python selenium - get (ctrl-u) equivalent page_source

I need to get the ctrl-u equivalent of browser.page_source for comparative purposes.
is this possible with browser.execute_script or another method?
I've tried various methods like browser.get(view-source:https://www.example.com) but haven't seen a solution.
Its works fine for me , I guess it's the problem with the quotes,
browser.get('https://www.example.com')
browser.page_source
You can also achieve the same using browser.execute_script()
browser.execute_script('return document.documentElement.outerHTML')
if I'm not wrong you want to compare original html ctrl+U and rendered html browser.page_source, for that you can use requests
import requests
originalHTML = requests.get('http://...').text
print(originalHTML)
or you can create another tab for view-source:
url = 'https://..../'
browser.get(url)
renderedHTML = browser.page_source
# open blank page because JS cannot open special URL like `view-source:`
browser.execute_script("window.open('about:blank', '_blank')")
# switch to tab 2
browser.switch_to_window(browser.window_handles[1])
browser.get("view-source:" + url)
originalHTML = originalHTML = browser.find_element_by_css_selector('body').text
# switch to tab 1
#browser.switch_to_window(browser.window_handles[0])

Scraping hidden jquery values using python

HERE is the site I believe its hosted on GitHub.
I am having trouble scraping the values in the input fields. Specifically private key and Public. I tried using Selenium and BeautifulSoup but it would give empty values, rather None (the HTML doesn't contain the keys).
I checked the page source and it seems that the input value is empty (not contained within the HTML) but when you load the page it is visible and existent in the input box.
Here is my code:
def openit(browser):
browser.get('file:///Users/Aha/Desktop/Code/english/index.html')
time.sleep(5)
nav = browser.find_element_by_id("addr")
print(nav.text)
return browser.page_source
soupdata = openit(browser)
soup = BeautifulSoup(soupdata, 'html.parser')
val = soup.find('input', {'id': 'addr'}).get('value')
print (val)
You can retrieve that value via execute_script method of the selenium webdriver
print(browser.execute_script("return $('#addr').val();"))
Output:
14ropRunS5iY9sx9d9mpCRNEsXj7RtTtuS

Fetch all href link using selenium in python

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.
For example, I want all the links in the href= property of all the <a> tags on http://psychoticelites.com/
I've written a script and it is working. But, it's giving me the object address. I've tried using the id tag to get the value, but, it doesn't work.
My current script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")
assert "Psychotic" in driver.title
continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[#href]")
#x = str(continue_link)
#print(continue_link)
print(elem)
Well, you have to simply loop through the list:
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
find_elements_by_* returns a list of elements (note the spelling of 'elements'). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).
I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.
elems = driver.find_elements_by_tag_name('a')
for elem in elems:
href = elem.get_attribute('href')
if href is not None:
print(href)
driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
driver.close()
Note: Adding delay is very important. First run it in debug mode and Make sure your URL page is getting loaded. If the page is loading slowly, increase delay (sleep time) and then extract.
If you still face any issues, please refer below link (explained with an example) or comment
Extract links from webpage using selenium webdriver
You can try something like:
links = driver.find_elements_by_partial_link_text('')
You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:
https://pypi.python.org/pypi/htmldom/2.0
from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")
dom = dom.createDom()
The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call "createDom" method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.
You can query the elements using the "find" method of HtmlDom object:
p_links = dom.find("a")
for link in p_links:
print ("URL: " +link.attr("href"))
The above code will print all the links/urls present on the web page
Unfortunately, the original link posted by OP is dead...
If you're looking for a way to scrape links on a page, here's how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:
from gazpacho import Soup
url = "https://stackoverflow.com/q/34759787/3731467"
soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")
[a.attrs["href"] for a in a_tags]
You can do this by using BeautifulSoup with very easy and efficient way. I have tested the below codes and worked fine for the same purpose.
After this line -
driver.get("http://psychoticelites.com/")
use the below code -
response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
if link.get('href'):
print(link.get("href"))
print('\n')
All of the accepted answers using Selenium's driver.find_elements_by_*** no longer work with Selenium 4. The current method is to use find_elements() with the By class.
Method 1: For loop
The below code utilizes 2 lists. One for By.XPATH and the other, By.TAG_NAME. One can use either-or. Both are not needed.
By.XPATH IMO is the easiest as it does not return a seemingly useless None value like By.TAG_NAME does. The code also removes duplicates.
from selenium.webdriver.common.by import By
driver.get("https://www.amazon.com/")
href_links = []
href_links2 = []
elems = driver.find_elements(by=By.XPATH, value="//a[#href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
for elem in elems:
l = elem.get_attribute("href")
if l not in href_links:
href_links.append(l)
for elem in elems2:
l = elem.get_attribute("href")
if (l not in href_links2) & (l is not None):
href_links2.append(l)
print(len(href_links)) # 360
print(len(href_links2)) # 360
print(href_links == href_links2) # True
Method 2: List Comprehention
If duplicates are OK, one liner list comprehension can be used.
from selenium.webdriver.common.by import By
driver.get("https://www.amazon.com/")
elems = driver.find_elements(by=By.XPATH, value="//a[#href]")
href_links = [e.get_attribute("href") for e in elems]
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2] # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]
print(len(href_links)) # 387
print(len(href_links2)) # 387
print(href_links == href_links2) # True
import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
print(link)
Update for the existing solving Post:
For the current version it needs to be:
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Categories

Resources