I'm scraping a site and am able to pull down an email href attribute but all of the emails contain the mailto: tag. For example, I'd like the email mailto:john#gmail.com to just be john#gmail.com. I've searched stack and am finding several regular expression solutions but am unable to implement them. In Python 3.6 the import re stays gray. It seems like that must be a default library now but it isn't working. I've also tried altering the XPATH but am unclear on how to render the XPATH since Selenium doesn't allow you to do that apparently.
Here is my code:
try:
element = "//div[#class='business-buttons']/a[1]"
email_el = driver.find_element(By.XPATH, element)
email = email_el.get_attribute("href")
except NoSuchElementException:
print("Handled NoSuchElementException no email")
pass
You can try the method .replace():
email.replace("mailto:", "")
If you have a list of scraped emails you can use .replace() in a loop:
email_list = ['mailto:john#gmail.com','mailto:john2#gmail.com','mailto:john3#gmail.com']
for item in email_list:
item = item.replace("mailto:", "")
print(item)
Output:
john#gmail.com
john2#gmail.com
john3#gmail.com
Related
I've been looking for a solution to this but to no avail. I am scraping a website using Selenium with Python. I am looping through some XML URLs which I am extracting some information from. Here is an example page.
This page works fine but down the loop, some of the pages do not have some of the elements I'm looking for. How do I return a null where the element does not exist? I have tried using or None but this does not seem to work. Here is my code snippet:
if dataset_id is not None:
xml_url = f'https://www.spatialdata.gov.scot/geonetwork/srv/eng/xml.metadata.get?uuid={dataset_id}'
driver.get(xml_url)
contact_email = driver.find_element(By.XPATH, '//gmd:CI_ResponsibleParty/gmd:organisationName/gco:CharacterString').get_attribute('textContent')
contact_name = driver.find_element(By.XPATH, '//gmd:CI_Address/gmd:electronicMailAddress/gco:CharacterString').get_attribute('textContent')
update_frequency = driver.find_element(By.XPATH, '//gmd:maintenanceAndUpdateFrequency/gmd:MD_MaintenanceFrequencyCode').get_attribute("codeListValue")
date_span_start = driver.find_element(By.XPATH, '//gml:TimePeriod/gml:beginPosition').get_attribute('textContent') or None
date_span_end = driver.find_element(By.XPATH,' //gml:TimePeriod/gml:endPosition').get_attribute('textContent') or None
else:
contact_email = None
contact_name = None
update_frequency = None
date_span_start = None
date_span_end = None
Here is a snippet of how the XML page looks like:
<gmd:address>
<gmd:CI_Address>
<gmd:deliveryPoint>
<gco:CharacterString>Great Glen House, Leachkin Road</gco:CharacterString>
</gmd:deliveryPoint>
<gmd:city>
<gco:CharacterString>INVERNESS</gco:CharacterString>
</gmd:city>
<gmd:postalCode>
<gco:CharacterString>IV3 8NW</gco:CharacterString>
</gmd:postalCode>
<gmd:country>
<gco:CharacterString>United Kingdom</gco:CharacterString>
</gmd:country>
<gmd:electronicMailAddress>
<gco:CharacterString>data_supply#snh.gov.uk</gco:CharacterString>
</gmd:electronicMailAddress>
</gmd:CI_Address>
Ever time it lands on a page without the given element, I get an error like so, depending on what field is missing:
InvalidSelectorException: Message: invalid selector: Unable to locate an element with the xpath expression //gml:TimePeriod/gml:beginPosition because of the following error:
NamespaceError: Failed to execute 'evaluate' on 'Document': The string '//gml:TimePeriod/gml:beginPosition' contains unresolvable namespaces.
I'm really hoping to get this sorted out. Thanks in advance!
I am trying to scrape the PSI readings from this website. But no matter whichever selection criterion (id:first-half, class:allow-overflow-item) I use, selenium cannot locate the table and always run the except clause. The webpage can be opened without a problem.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.haze.gov.sg/resources/readings-over-the-last-24-hours')
try:
elem = browser.find_elements_by_id('first-half')
print(elem.text)
except:
print('Was not able to find an element with that name.')
You are using find_elementS which results in a list of matching elements. Lists do not have text attribute. Use find_element_by_id:
try:
elem = browser.find_element_by_id('first-half')
print(elem.text)
except:
print('Was not able to find an element with that name.')
Is the table inside a iFrame?
If it is you'll need to select first that iframe:
self.driver.switch_to.frame(self.driver.find_element_by_id("frameNameXXXX"))
cheers
I'm writing a script in to do some webscraping on my Firebase for a few select users. After accessing the events page for a user, I want to check for the condition that no events have been logged by that user first.
For this, I am using Selenium and Python. Using XPath seems to work fine for locating links and navigation in all other parts of the script, except for accessing elements in a table. At first, I thought I might have been using the wrong XPath expression, so I copied the path directly from Chrome's inspection window, but still no luck.
As an alternative, I have tried to copy the page source and pass it into Beautiful Soup, and then parse it there to check for the element. No luck there either.
Here's some of the code, and some of the HTML I'm trying to parse. Where am I going wrong?
# Using WebDriver - always triggers an exception
def check_if_user_has_any_data():
try:
time.sleep(10)
element = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#id="event-table"]/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div')))
print(type(element))
if element == True:
print("Found empty state by copying XPath expression directly. It is a bit risky, but it seems to have worked")
else:
print("didn’t find empty state")
except:
print("could not find the empty state element", EC)
# Using Beautiful Soup
def check_if_user_has_any_data#2():
time.sleep(10)
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, 'html.parser')
print(soup.text[:500])
print(len(soup.findAll('div', {"class": "table-row-no-data ng-scope"})))
HTML
<div class="table-row-no-data ng-scope" ng-if="::config" ng-class="{overlay: config.isBuilderOpen()}">
<div class="no-data-content layout-align-center-center layout-row" layout="row" layout-align="center center">
<!-- ... -->
</div>
The first version triggers the exception and is expected to evaluate 'element' as True. Actual, the element is not found.
The second version prints the first 500 characters (correctly, as far as I can tell), but it returns '0'. It is expected to return '1' after inspecting the page source.
Use the following code:
elements = driver.find_elements_by_xpath("//*[#id='event-table']/div/div/div[2]/mobile-table/md-whiteframe/div[1]/ga-no-data-table/div")
size = len(elements)
if len(elements) > 0:
# Element is present. Do your action
else:
# Element is not present. Do alternative action
Note: find_elements will not generate or throw any exception
Here is the method that generally I use.
Imports
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
Method
def is_element_present(self, how, what):
try:
self.driver.find_element(by=how, value=what)
except NoSuchElementException as e:
return False
return True
Some things load dynamically. It is better to just set a timeout on a wait exception.
If you're using Python and Selenium, you can use this:
try:
driver.find_element_by_xpath("<Full XPath expression>") # Test the element if exist
# <Other code>
except:
# <Run these if element doesn't exist>
I've solved it. The page had a bunch of different iframe elements, and I didn't know that one had to switch between frames in Selenium to access those elements.
There was nothing wrong with the initial code, or the suggested solutions which also worked fine when I tested them.
Here's the code I used to test it:
# Time for the page to load
time.sleep(20)
# Find all iframes
iframes = driver.find_elements_by_tag_name("iframe")
# From inspecting page source, it looks like the index for the relevant iframe is [0]
x = len(iframes)
print("Found ", x, " iFrames") # Should return 5
driver.switch_to.frame(iframes[0])
print("switched to frame [0]")
if WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#class="no-data-title ng-binding"]'))):
print("Found it in this frame!")
Check the length of the element you are retrieving with an if statement,
Example:
element = ('https://www.example.com').
if len(element) > 1:
# Do something.
The following script follows a page in Instagram:
browser = webdriver.Chrome('./chromedriver')
# GO INSTAGRAM PAGE FOR LOGIN
browser.get('https://www.instagram.com/accounts/login/?hl=it')
sleep(2)
# ID AND PASSWORD
elem = browser.find_element_by_name("username").send_keys('test')
elem = browser.find_element_by_name("password").send_keys('passw')
# CLICK BUTTON AND OPEN INSTAGRAM
sleep(5)
good_elem = browser.find_element_by_xpath('//*[#id="react-root"]/section/main/div/article/div/div[1]/div/form/span/button').click()
sleep(5)
browser.get("https://www.instagram.com")
# GO TO PAGE FOR FOLLOW
browser.get("https://www.instagram.com/iam.ai4/")
sleep(28)
segui = browser.find_element_by_class_name('BY3EC').click()
If an element with class BY3EC isn't found I want the script to keep working.
When an element is not found it throws NoSuchElementException, so you can use try/except to avoid that, for example:
from selenium.common.exceptions import NoSuchElementException
try:
segui = browser.find_element_by_class_name('BY3EC').click()
except NoSuchElementException:
print('Element BY3EC not found') # or do something else here
You can take a look at selenium exceptions to get an idea of what each one of them is for.
surround it with try catches, than you can build a happy path and handle failures as well, so your test case will always work
Best practice is to not use Exceptions to control flow. Exceptions should be exceptional... rare and unexpected. The simple way to do this is to get a collection using the locator and then see if the collection is empty. If it is, you know the element doesn't exist.
In the example below we search the page for the element you wanted and check to see that the collection contains an element, if it does... click it.
segui = browser.find_elements_by_class_name('BY3EC')
if segui:
segui[0].click()
Say for example I have this list of keywords: "Head,Feet, Hand,Fingers"
How can I pass all of these inside "()" of browser.find_element_by_link_text()?
Purpose is to search all these keywords one by one and if found, will simulate clicking through each of the keyword.
sample code:
for i in browser.find_element_by_link_text("**all keywords should be passed here**"):
i.click()
PS. Python Newbie.
You cannot pass multiple link texts to find_element_by_link_text().
You have multiple ways to approach the problem. You can, for instance, switch to using XPath locators - dynamically construct an expression checking all the link text variations:
link_texts = ["Head", "Feet", "Hand", "Fingers"]
expression = "//a[%s]" % (" or ".join(". = '%s'" % link_text for link_text in link_texts))
for link in driver.find_elements_by_xpath(expression):
link.click()
Or, you can issue find_element_by_link_text() in a loop handling NoSuchElementException exception (in case an element is not found) properly:
from selenium.common.exceptions import NoSuchElementException
link_texts = ["Head", "Feet", "Hand", "Fingers"]
for link_text in link_texts:
try:
link = driver.find_element_by_link_text(link_text)
link.click()
except NoSuchElementException:
print("Link text '%s' not found" % link_text)
The latter option would be slower, but at the same time much more explicit.