I'm writing a program to iterate through elements on a webpage. I start the browser like so:
self.browser = webdriver.Chrome(executable_path="C:/Users/me/chromedriver.exe")
self.browser.get("https://www.google.com/maps/place/Foster+Street+Coffee/#36.0016436,-78.9018397,19z/data=!4m7!3m6!1s0x89ace473f05b7d39:0x42c63a92682d9ec3!8m2!3d36.0016427!4d-78.9012927!9m1!1b1")
this opens the site, in which I can find an element i'm interested in using:
reviews = self.browser.find_elements_by_class_name("section-review-line")
now I have a list of elements for class name "section-review-line", which seems to populate correctly. I'd like to iterate through this list of elements and pick out subelements with a set of logic. To get the subelements, which I know exist as class name "section-review-review-content", I try this:
for review in reviews:
content = review.find_element_by_class_name("section-review-review-content")
This errors out with:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".section-review-review-content"}
Ok, here are all the items information that you need from each review.
reviews = driver.find_elements_by_class_name("section-review-content")
for review in reviews:
reviewer = review.find_element_by_class_name("section-review-title").text
numOfReviews = review.find_element_by_xpath(".//div[#class='section-review-subtitle']//span[contains(.,'reviews')]").text.strip().replace('.','')
numberOfStarts = review.find_element_by_class_name("section-review-stars").get_attribute('aria-label').strip()
publishDate = review.find_element_by_class_name("section-review-publish-date").text
content = review.find_element_by_class_name("section-review-review-content").text
ah figured it out,
using a strange page that had an empty element up top, causing it to error out. the large majority of the elements did not have this problem, using a try catch solved it like so:
reviews = self.browser.find_elements_by_class_name("section-review-line")
for review in reviews:
try:
content = review.find_element_by_class_name("section-review-review-content")
rtext = content.find_element_by_class_name("section-review-text").text
except:
continue
Related
I've been looking for a solution to this but to no avail. I am scraping a website using Selenium with Python. I am looping through some XML URLs which I am extracting some information from. Here is an example page.
This page works fine but down the loop, some of the pages do not have some of the elements I'm looking for. How do I return a null where the element does not exist? I have tried using or None but this does not seem to work. Here is my code snippet:
if dataset_id is not None:
xml_url = f'https://www.spatialdata.gov.scot/geonetwork/srv/eng/xml.metadata.get?uuid={dataset_id}'
driver.get(xml_url)
contact_email = driver.find_element(By.XPATH, '//gmd:CI_ResponsibleParty/gmd:organisationName/gco:CharacterString').get_attribute('textContent')
contact_name = driver.find_element(By.XPATH, '//gmd:CI_Address/gmd:electronicMailAddress/gco:CharacterString').get_attribute('textContent')
update_frequency = driver.find_element(By.XPATH, '//gmd:maintenanceAndUpdateFrequency/gmd:MD_MaintenanceFrequencyCode').get_attribute("codeListValue")
date_span_start = driver.find_element(By.XPATH, '//gml:TimePeriod/gml:beginPosition').get_attribute('textContent') or None
date_span_end = driver.find_element(By.XPATH,' //gml:TimePeriod/gml:endPosition').get_attribute('textContent') or None
else:
contact_email = None
contact_name = None
update_frequency = None
date_span_start = None
date_span_end = None
Here is a snippet of how the XML page looks like:
<gmd:address>
<gmd:CI_Address>
<gmd:deliveryPoint>
<gco:CharacterString>Great Glen House, Leachkin Road</gco:CharacterString>
</gmd:deliveryPoint>
<gmd:city>
<gco:CharacterString>INVERNESS</gco:CharacterString>
</gmd:city>
<gmd:postalCode>
<gco:CharacterString>IV3 8NW</gco:CharacterString>
</gmd:postalCode>
<gmd:country>
<gco:CharacterString>United Kingdom</gco:CharacterString>
</gmd:country>
<gmd:electronicMailAddress>
<gco:CharacterString>data_supply#snh.gov.uk</gco:CharacterString>
</gmd:electronicMailAddress>
</gmd:CI_Address>
Ever time it lands on a page without the given element, I get an error like so, depending on what field is missing:
InvalidSelectorException: Message: invalid selector: Unable to locate an element with the xpath expression //gml:TimePeriod/gml:beginPosition because of the following error:
NamespaceError: Failed to execute 'evaluate' on 'Document': The string '//gml:TimePeriod/gml:beginPosition' contains unresolvable namespaces.
I'm really hoping to get this sorted out. Thanks in advance!
I need to scrape this webpage: https://br.advfn.com/bolsa-de-valores/bovespa/aper3/balanco
My Selenium code crashes most times and I noticed that the data I need is not loaded with scripts, so I don't need to use Selenium for this specific webpage and I tried this:
def minerar_advfns(ticker):
quote = 'https://br.advfn.com/bolsa-de-valores/bovespa/' + ticker + '/balanco'
page = requests.get(quote)
time.sleep(0.9)
dom = etree.HTML(page.content)
f = open('www.htm', 'w')
temp1 = str(page.content)
f.write(temp1)
f.close()
lucro = dom.xpath('//*[#id="financials_table_2"]/table/tbody/tr[11]/td[2]/text()')[0]
print(lucro)
I'm trying to scrape the data point '1.605' with this code. It stops in the penultimate line and says 'IndexError: list index out of range'.
When I check the 'www.htm' webpage saved, that '1.605' value is there exactly where my xpath points to. I tried with and without '/text()' and '[0]'. If I remove the [0] it prints '[]'
Xpath has tbody tag problem
You can change //*[#id="financials_table_2"]/table/tbody/tr[11]/td[2]/text()
to //*[#id="financials_table_2"]/table//tr[11]/td[2]/text()
I am trying to scrape the data from naukri.com, here I am trying to scrape the location details for each recruiter visible on the page.
The code I am writing is :
def sol7(url):
driver = webdriver.Chrome(r'C:\Users\Rahul\Downloads\chromedriver_win32\chromedriver.exe')
driver.maximize_window()
driver.get(url)
#getting link of recruiters
recruiters_tag = driver.find_element_by_xpath("//ul[#class='midSec menu']/li[2]/a")
driver.get(recruiters_tag.get_attribute('href'))
#search and click for data scientist
driver.find_element_by_xpath("//input[#class='sugInp']").send_keys('Data science')
driver.find_element_by_xpath("//button[#id='qsbFormBtn']").click()
highlight_table_tag = driver.find_elements_by_xpath("//p[#class='highlightable']")
print(len(highlight_table_tag))
for i in highlight_table_tag:
try:
print((i.find_element_by_xpath("//small[#class='ellipsis']")).text)
except NoSuchElementException:
print('-')
1st I have extracted all recruiters details in list highlight_table_tag.
highlight_table_tag includes all the elements on the page however the loop only takes the 0th element of my list.
I want to scrape the location in such a way, if the location tag is not there in each element of highlight_table_tag then print '-' instead.
Please help!!!!
While xpathing child elements use . in front of the xpath otherwise you will be getting it from root node. Which ends up with 1 element everytime.
print((i.find_element_by_xpath(".//small[#class='ellipsis']")).text)
I want to extract data in <div class="user-profile_list __relatives"> ... (see image)
Source code of the page https://gist.github.com/mascai/59e3bf779c2ba7cecb973ab9653ed419
My code
def get_relatives(driver):
relatives = []
relatives_container = driver.find_element_by_class_name("user-profile_list __relatives")
return relatives
driver = webdriver.Chrome(executable_path='chromedriver')
get_relatives(driver)
Error text
Message: no such element: Unable to locate element: {"method":"css selector","selector":".user-profile_list __relatives"}
This thing happens many time, its better to use xpath and search for class in it,
relatives_container = driver.find_element_by_xpath('//*[class="classuser-profile_list __relatives"]')
You can also try contains in xpath, it also work if there are multiple classes in that element and write only one of them
relatives_container = driver.find_element_by_xpath('//*[contains(#class, 'user-profile_list __relatives')]')
I'm scraping a site and am able to pull down an email href attribute but all of the emails contain the mailto: tag. For example, I'd like the email mailto:john#gmail.com to just be john#gmail.com. I've searched stack and am finding several regular expression solutions but am unable to implement them. In Python 3.6 the import re stays gray. It seems like that must be a default library now but it isn't working. I've also tried altering the XPATH but am unclear on how to render the XPATH since Selenium doesn't allow you to do that apparently.
Here is my code:
try:
element = "//div[#class='business-buttons']/a[1]"
email_el = driver.find_element(By.XPATH, element)
email = email_el.get_attribute("href")
except NoSuchElementException:
print("Handled NoSuchElementException no email")
pass
You can try the method .replace():
email.replace("mailto:", "")
If you have a list of scraped emails you can use .replace() in a loop:
email_list = ['mailto:john#gmail.com','mailto:john2#gmail.com','mailto:john3#gmail.com']
for item in email_list:
item = item.replace("mailto:", "")
print(item)
Output:
john#gmail.com
john2#gmail.com
john3#gmail.com