How can I filter elements that have the same class?
<html>
<body>
<p class="content">Link1.</p>
</body>
</html>
<html>
<body>
<p class="content">Link2.</p>
</body>
</html>
You can try to get the list of all elements with class = "content" by using find_elements_by_class_name:
a = driver.find_elements_by_class_name("content")
Then you can click on the link that you are looking for.
By.CLASS_NAME was not yet mentioned:
from selenium.webdriver.common.by import By
driver.find_element(By.CLASS_NAME, "content")
This is the list of attributes which can be used as locators in By:
CLASS_NAME
CSS_SELECTOR
ID
LINK_TEXT
NAME
PARTIAL_LINK_TEXT
TAG_NAME
XPATH
As per the HTML:
<html>
<body>
<p class="content">Link1.</p>
</body>
<html>
<html>
<body>
<p class="content">Link2.</p>
</body>
<html>
Two(2) <p> elements are having the same class content.
So to filter the elements having the same class i.e. content and create a list you can use either of the following Locator Strategies:
Using class_name:
elements = driver.find_elements_by_class_name("content")
Using css_selector:
elements = driver.find_elements_by_css_selector(".content")
Using xpath:
elements = driver.find_elements_by_xpath("//*[#class='content']")
Ideally, to click on the element you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CLASS_NAME:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "content")))
Using CSS_SELECTOR:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".content")))
Using XPATH:
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[#class='content']")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
References
You can find a couple of relevant discussions in:
How to identify an element through classname even though there are multiple elements with the same classnames using Selenium and Python
Unable to locate element using className in Selenium and Java
What are properties of find_element_by_class_name in selenium python?
How to locate the last web element using classname attribute through Selenium and Python
Use nth-child, for example: http://www.w3schools.com/cssref/sel_nth-child.asp
driver.find_element(By.CSS_SELECTOR, 'p.content:nth-child(1)')
or http://www.w3schools.com/cssref/sel_firstchild.asp
driver.find_element(By.CSS_SELECTOR, 'p.content:first-child')
The most simple way is to use find_element_by_class_name('class_name')
The first answer has been deprecated, and the other answers only return one result. This is the correct answer:
driver.find_elements(By.CLASS_NAME, "content")
Related
I'm trying to get the text inside these div's but I'm not succeeding. They have a code between the class that doesn't change with each execution.
<div data-v-a5b90146="" class="html-content"> Text to be captured</div>
<div data-v-a5b90146="" class="html-content"><b> TEXT WANTED </b><div><br></div>
I've tried with XPATH, but I was not successful too.
content = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '/html/body/div/div/div/div/div/div/div[1]/div[2]/div[2]/div[4]/div/div/b'))).text
You need to change couple of things.
presence_of_all_elements_located() returns list of elements, so you can't use .text with a list. To get the text value of the element you need to iterate and then get the text.
xpath seems very fragile. you should use relative xpath since class name is unique you can use the classname.
your code should be like.
contents = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//div[#class="html-content"][#data-v-a5b90146]')))
for content in contents:
print(content.text)
You can use visibility_of_all_elements_located() as well
contents = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, '//div[#class="html-content"][#data-v-a5b90146]')))
for content in contents:
print(content.text)
Both the <div> tags have the attribute class="html-content"
Solution
To extract the texts from the <div> tags instead of presence_of_all_elements_located(), you have to induce WebDriverWait for visibility_of_all_elements_located() and using list comprehension and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.html-content")))])
Using XPATH:
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='html-content']")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Try to get the element first then get the text from the element;
element = WebDriverWait(
driver,
10
).until(
EC.presence_of_all_elements_located((
By.XPATH, '/html/body/div/div/div/div/div/div/div[1]/div[2]/div[2]/div[4]/div/div/b'
))
)
content = element.text
I'm trying to download a bunch of images and categorize them into folders using Selenium. To do so, I need to grab two ID's associated with each image within the URL. However I'm having trouble scraping the image link from the src attribute. Whether I try to grab by tag, Xpath, or other method the end result is merely "None".
Here's an example of an inspected image page:
<html style="height: 100%;"
><head><meta name="viewport" content="width=device-width, minimum-scale=0.1">
<title>index.php (2448×3264)</title>
</head>
<body style="margin: 0px; background: #0e0e0e; height: 100%">
<img style="-webkit-user-select: none;margin: auto;cursor: zoom-in;background-color: hsl(0, 0%, 90%);transition: background-color 300ms;" src="https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=LQCMY&fieldname=DT006_picture&p=show" width="444" height="593">
</body>
</html>
For this example, I would need to grab "LQCMY" and "DT006_picture" as strings from the URL above. The code below shows my attempt at scraping the URL link (edited down since prior screens I click through are locked behind passwords that I can't give out).
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Image = '/html/body/div[1]/div[2]/div/table/tbody/tr[1]/td[1]/a'
driver.find_element_by_xpath(Image).click()
Image_URL = WebDriverWait(driver, 100).until(EC.element_to_be_clickable((By.XPATH, Image))).get_attribute('src')
print(Image_URL)
Are there certain src's that can't be scraped, or am I scraping the wrong tag?
I've tried grabbing by tag but that also returns "None" as well.
Image_URL = driver.find_element_by_xpath(Image).get_attribute('src')
Other posts said WebDriverWait would help, but I've tried adjusting the wait time and am still receiving "None" too
To print the value of the src attribute you can use either of the following locator strategies:
Using css_selector:
print(driver.find_element_by_css_selector("body img[style*='webkit-user-select'][src^='https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=']").get_attribute("src"))
Using xpath:
print(driver.find_element_by_xpath("//body//img[contains(#style, 'webkit-user-select') and starts-with(#src, 'https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=')]").get_attribute("src"))
Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body img[style*='webkit-user-select'][src^='https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=']"))).get_attribute("src"))
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//body//img[contains(#style, 'webkit-user-select') and starts-with(#src, 'https://haalsi.net/haalsi_pride2/custom/picture/index.php?id=')]"))).get_attribute("src"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in Python Selenium - get href value
I'm trying to pull a specific number out of a div class in Python Selenium but can't figure out how to do it. I'd want to get the "post_parent" ID 947630 as long as it matches the "post_name" number starting 09007.
I'm looking to do this across multiple "post_name" classes, so I'd feed it something like this: search_text = "0900766b80090cb6", but there will be multiple in the future so it has to read the "post_name" first then pull the "post_parent" if that makes sense.
Appreciate any advice anyone has to offer.
<div class="hidden" id="inline_947631">
<div class="post_title">Interface Converter</div>
<div class="post_name">0900766b80090cb6</div>
<div class="post_author">28</div>
<div class="comment_status">closed</div>
<div class="ping_status">closed</div>
<div class="_status">inherit</div>
<div class="jj">06</div>
<div class="mm">07</div>
<div class="aa">2001</div>
<div class="hh">15</div>
<div class="mn">44</div>
<div class="ss">17</div>
<div class="post_password"></div>
<div class="post_parent">947630</div>
<div class="page_template">default</div>
<div class="tags_input" id="rs-language-code_947631">de</div>
</div>
If you see <div class="post_name">0900766b80090cb6</div> this and <div class="post_parent">947630</div> are siblings nodes to each other.
You can use xpath -> following-sibling like this:
Code:
search_text = "0900766b80090cb6"
post_parent_num = driver.find_element(By.XPATH, f"//div[#class='post_name' and text()='{search_text}']//following-sibling::div[#class='post_parent']").text
print(post_parent_num)
or Using ExplicitWait:
search_text = "0900766b80090cb6"
post_parent_num = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, f"//div[#class='post_name' and text()='{search_text}']//following-sibling::div[#class='post_parent']"))).get_attribute('innerText')
print(post_parent_num)
Imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Update:
NoSuchElementException:
Please check in the dev tools (Google chrome) if we have unique entry in HTML-DOM or not.
xpath that you should check :
//div[#class='post_name' and text()='0900766b80090cb6']//following-sibling::div[#class='post_parent']
Steps to check:
Press F12 in Chrome -> go to element section -> do a CTRL + F -> then paste the xpath and see, if your desired element is getting highlighted with 1/1 matching node.
If this is unique //div[#class='post_name' and text()='0900766b80090cb6']//following-sibling::div[#class='post_parent'] then you need to check for the below conditions as well.
Check if it's in any iframe/frame/frameset.
Solution: switch to iframe/frame/frameset first and then interact with this web element.
Check if it's in any shadow-root.
Solution: Use driver.execute_script('return document.querySelector to have returned a web element and then operates accordingly.
Make sure that the element is rendered properly before interacting with it. Put some hardcoded delay or Explicit wait and try again.
Solution: time.sleep(5) or
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='post_name' and text()='0900766b80090cb6']//following-sibling::div[#class='post_parent']"))).text
If you have redirected to a new tab/ or new windows and you have not switched to that particular new tab/new window, otherwise you will likely get NoSuchElement exception.
Solution: switch to the relevant window/tab first.
If you have switched to an iframe and the new desired element is not in the same iframe context then first switch to default content and then interact with it.
Solution: switch to default content and then switch to respective iframe.
I don't see any specific relation between "post_parent" ID 947630 and "post_name" number starting 09007. Moreover, the parent <div> is having class="hidden".
However, to pull the specific number you can use either of the following locator strategies:
Using css_selector:
print(driver.find_element(By.CSS_SELECTOR, "div[id^='inline'] div.post_parent").text)
Using xpath:
print(driver.find_element(By.XPATH, "//div[starts-with(#id, 'inline_')]//div[#class='post_parent']").text)
Ideally you need to induce WebDriverWait for the presence_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[id^='inline'] div.post_parent"))).text)
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(#id, 'inline_')]//div[#class='post_parent']"))).text)
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can create a method and use the following xpath to get the post_parent text based on post_name text.
def getPostPatent(postname):
element=driver.find_element(By.XPATH,"//div[#class='post_name' and starts-with(text(),'{}')]/following-sibling::div[#class='post_parent']".format(postname))
print(element.get_attribute("textContent"))
getPostPatent('09007')
This will return value if it is matches the text starts-with('09007')
It seems parent class is hidden you need to use textContent to get the value.
The following is the HTML structure:
<div class='list'>
<div>
<p class='code'>12345</p>
<p class='name'>abc</p>
</div>
<div>
<p class='code'>23456</p>
<p class='name'>bcd</p>
</div>
</div>
And there is a config.py for user input. If the user input 23456 to config.code, how can the selenium python select the second object? I am using find_by_css_selector() to locate and select the object, but it can only select the first object, which is Code='12345'. I tried to use find_by_link_text(), but it is a <p> element not <a> element. Anyone can help.....
To locate the element with respect to the input by the user using Selenium and python you need to to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using variable in XPATH:
user_input = '23456'
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='list']//div/p[#class='code' and text()='" +user_input+ "']")))
Using %s in XPATH:
user_input = '23456'
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='list']//div/p[#class='code' and text()='%s']"% str(user_input))))
Using format() in XPATH:
user_input = '23456'
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='list']//div/p[#class='code' and text()='{}']".format(str(user_input)))))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Try the below xpath:
code = '23456'
element = driver.find_element_by_xpath("//p[#class='code' and text()='" +code +"']")
I am trying to copy the href value from a website, and the html code looks like this:
<p class="sc-eYdvao kvdWiq">
<a href="https://www.iproperty.com.my/property/setia-eco-park/sale-
1653165/">Shah Alam Setia Eco Park, Setia Eco Park
</a>
</p>
I've tried driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") but it returned 'list' object has no attribute 'get_attribute'. Using driver.find_element_by_css_selector(".sc-eYdvao.kvdWiq").get_attribute("href") returned None. But i cant use xpath because the website has like 20+ href which i need to copy all. Using xpath would only copy one.
If it helps, all the 20+ href are categorised under the same class which is sc-eYdvao kvdWiq.
Ultimately i would want to copy all the 20+ href and export them out to a csv file.
Appreciate any help possible.
You want driver.find_elements if more than one element. This will return a list. For the css selector you want to ensure you are selecting for those classes that have a child href
elems = driver.find_elements_by_css_selector(".sc-eYdvao.kvdWiq [href]")
links = [elem.get_attribute('href') for elem in elems]
You might also need a wait condition for presence of all elements located by css selector.
elems = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".sc-eYdvao.kvdWiq [href]")))
As per the given HTML:
<p class="sc-eYdvao kvdWiq">
Shah Alam Setia Eco Park, Setia Eco Park
</p>
As the href attribute is within the <a> tag ideally you need to move deeper till the <a> node. So to extract the value of the href attribute you can use either of the following Locator Strategies:
Using css_selector:
print(driver.find_element_by_css_selector("p.sc-eYdvao.kvdWiq > a").get_attribute('href'))
Using xpath:
print(driver.find_element_by_xpath("//p[#class='sc-eYdvao kvdWiq']/a").get_attribute('href'))
If you want to extract all the values of the href attribute you need to use find_elements* instead:
Using css_selector:
print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_css_selector("p.sc-eYdvao.kvdWiq > a")])
Using xpath:
print([my_elem.get_attribute("href") for my_elem in driver.find_elements_by_xpath("//p[#class='sc-eYdvao kvdWiq']/a")])
Dynamic elements
However, if you observe the values of class attributes i.e. sc-eYdvao and kvdWiq ideally those are dynamic values. So to extract the href attribute you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a"))).get_attribute('href'))
Using XPATH:
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//p[#class='sc-eYdvao kvdWiq']/a"))).get_attribute('href'))
If you want to extract all the values of the href attribute you can use visibility_of_all_elements_located() instead:
Using CSS_SELECTOR:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "p.sc-eYdvao.kvdWiq > a")))])
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//p[#class='sc-eYdvao kvdWiq']/a")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
The XPATH
//p[#class='sc-eYdvao kvdWiq']/a
return the elements you are looking for.
Writing the data to CSV file is not related to the scraping challenge. Just try to look at examples and you will be able to do it.
To crawl any hyperlink or Href, proxycrwal API is ideal as it uses pre-built functions for fetching desired information. Just pip install the API and follow the code to get the required output. The second approach to fetch Href links using python selenium is to run the following code.
Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
list = ['https://www.heliosholland.com/Ampullendoos-voor-63-ampullen','https://www.heliosholland.com/lege-testdozen’]
driver = webdriver.Chrome()
wait = WebDriverWait(driver,29)
for i in list:
driver.get(i)
image = wait.until(EC.visibility_of_element_located((By.XPATH,'/html/body/div[1]/div[3]/div[2]/div/div[2]/div/div/form/div[1]/div[1]/div/div/div/div[1]/div/img'))).get_attribute('src')
print(image)
To scrape the link, use .get_attribute(‘src’).
Get the whole element you want with driver.find_elements(By.XPATH, 'path').
To extract the href link use get_attribute('href').
Which gives,
driver.find_elements(By.XPATH, 'path').get_attribute('href')
try something like:
elems = driver.find_elements_by_xpath("//p[contains(#class, 'sc-eYdvao') and contains(#class='kvdWiq')]/a")
for elem in elems:
print elem.get_attribute['href']