I am currently working on a small web scraping script with Python and Selenium.
I am trying to get some information from a table, which has in inspection mode a certain ID.
However, when I open the page as raw HTML (which I did after not being able to locate that table using neither xpath or css_selector), the table does not have the mentioned ID.
How is that possible?
For better explanation:
This is what it looks like in inspection mode in my browser
<table id='ext-gen1076' class='bats-table bats-table--center'>
[...]
</table>
And this is what it looks like when I open the page as raw HTML file
<table class='bats-table bats-table--center'>
[...]
</table>
How is it possible, that the ID is simply disappearing?
(JFI, this is my first question so apologies for the bad formatting!).
Thanks in advance!
The reason is, the ID was added during the runtime.
The value of the id attribute i.e. ext-gen1076 contains a number and is clearly dynamically generated. The prefix of the value of the id attribute i.e. ext-gen indicates that the id was generated runtime using Ext JS.
Ext JS
Ext JS is a JavaScript framework for building data-intensive, cross-platform web and mobile applications for any modern device.
This usecase
Possibly you have identified the <table> element even before the JavaScript have rendered the complete DOM Tree. Hence the id attribute was missing.
Identifying Ext JS elements
As the value of the id attribute changes i.e. dynamic in nature you won't be able to use the complete value of the id attribute and can use only partial value which is static. As per the HTML you have provided:
<table id='ext-gen1076' class='bats-table bats-table--center'>
[...]
</table>
To identify the <table> node you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[id^='ext-gen']")))
Using XPATH:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[starts-with(#id,'ext-gen')]")))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
However, there would be a lot other elements with id attribute starting with ext-gen. So to uniquely identify the <table> element you need to club up the class attribute as follows:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.bats-table.bats-table--center[id^='ext-gen']")))
Using XPATH:
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[#class='bats-table bats-table--center' and starts-with(#id,'ext-gen')]")))
Reference
You can find a relevant detailed discussion in:
How to find element by part of its id name in selenium with python
How to get selectors with dynamic part inside using Selenium with Python?
Related
I am trying to extract a table from a webpage with python. I managed to get all the contents inside of that table, but since I am very new to webscrapping I don't know how to keep only the elements that I am looking for.
I know that I should look for this class in the code: <a class="_3BFvyrImF3et_ZF21Xd8SC", which specify the items in the table.
So how can I keep only those classes to then extract the title of them?
<a class="_3BFvyrImF3et_ZF21Xd8SC" title="r/Python" href="/r/Python/">r/Python</a>
<a class="_3BFvyrImF3et_ZF21Xd8SC" title="r/Java" href="/r/Java/">r/Java</a>
I miserably failed in writing a code for that. I don't know how I could extract only these classes, so any inputs will be highly appreciated.
To extract the value of title attributes you can use list comprehension and you can use either of the following locator strategies:
Using CSS_SELECTOR:
print([my_elem.get_attribute("title") for my_elem in driver.find_elements(By.CSS_SELECTOR, "a._3BFvyrImF3et_ZF21Xd8SC[title]")])
Using XPATH:
print([my_elem.get_attribute("title") for my_elem in driver.find_elements(By.XPATH, "//a[#class='_3BFvyrImF3et_ZF21Xd8SC' and #title]")])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Okay, I have made a very simple thing that worked.
Basically I pasted the code on VSCODE and the selected all the occurrences of that class. Then I just had to copy and paste in another file. Not sure why the shortcut CTRL + Shift + L did not work, but I have managed to get what I needed.
Select all occurrences of selected word in VSCode
Last week User #KunduK kindly helped me scrap a website to return the address of a particular record
Record in question : https://register.fca.org.uk/s/firm?id=001b000000MfQU0AAN
By Using the following snippet of Code;
address=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h4[data-aura-rendered-by] ~p:nth-of-type(1)"))).text
print(address)
However whilst trying to understand the snippet i started to see some additional data being returned.
On the screen shot below, the Left is the expected results to be returned, however on the right is what is being returned.
Inspecting the element i can see there is an additional row (highlighted in yellow)(that's not being presented on the UI (right hand side)
I am also trying to get the "Website" and Reference Number" and following the example provided before, however following these steps (https://www.scrapingbee.com/blog/selenium-python/) i am not able to get the desired results being returned
Current Code:
Website=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".accordion_text h4"))).text
print(Website)
Website Inspect
Looking forward to your help!
To extract the Website address and Firm reference number ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using Website address:
driver.get('https://register.fca.org.uk/s/firm?id=001b000000MfQU0AAN')
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='Website']//following-sibling::a[1]"))).get_attribute("href"))
Using Firm reference number:
driver.get('https://register.fca.org.uk/s/firm?id=001b000000MfQU0AAN')
print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//h4[text()='Firm reference number']//following-sibling::p[1]"))).text)
Console Output:
https://www.masonowen.com/
311960
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
References
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium
So, I need to get the text (number of recovered people from covid) from this webpage into the console, but I can't find the class for the numbers can someone help me to locate the class, so I can print the numbers into the console. I need to use PhantomJS cuz I don't want the log to open when I run the code.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('https://www.tvnet.lv/covid19Live')
text = driver.find_element_by_class_name("covid-summary__count covid-c-recovered")
print(text)
find_element_by_class_name() expects a single class as an argument but you are providing two class names (class is a "multi-valued attribute", multiple values are separated by a space).
Either check for a single class:
driver.find_element_by_class_name("covid-c-recovered")
Or, switch to a CSS selector:
driver.find_element_by_css_selector(".covid-summary__count.covid-c-recovered")
Digging Deeper
Let's look at the source code. When elements are searched by class name, Python selenium actually constructs a CSS selector under the hood:
elif by == By.CLASS_NAME:
by = By.CSS_SELECTOR
value = ".%s" % value
This means that when you've used covid-summary__count covid-c-recovered as a class name value, the actual CSS selector that was used to find an element happened to be:
.covid-summary__count covid-c-recovered
which understandably did not match any elements (covid-c-recovered would be considered as a tag name here).
If you want the number make sure you have dots between class names.
driver.get('https://www.tvnet.lv/covid19Live')
element = driver.find_element_by_class_name("covid-summary__count.covid-c-recovered")
print(element.text)
Outputs
19 072
From per the documentation of selenium.webdriver.common.by implementation:
class selenium.webdriver.common.by.By
Set of supported locator strategies.
CLASS_NAME = 'class name'
So using find_element_by_class_name() you won't be able to pass multiple class names as it accepts a single class.
Solution
To print the number of people HEALED you can use either of the following Locator Strategies:
LATVIJĀ:
print(driver.find_element_by_xpath("//h1[contains(., 'COVID-19 LATVIJĀ')]//following::ul[1]//p[#class='covid-summary__count covid-c-recovered']").text)
PASAULĒ:
print(driver.find_element_by_xpath("//h1[contains(., 'COVID-19 PASAULĒ')]//following::ul[1]//p[#class='covid-summary__count covid-c-recovered']").text)
Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
LATVIJĀ:
driver.get("https://www.tvnet.lv/covid19Live")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[contains(., 'COVID-19 LATVIJĀ')]//following::ul[1]//p[#class='covid-summary__count covid-c-recovered']"))).text)
PASAULĒ:
driver.get("https://www.tvnet.lv/covid19Live")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[contains(., 'COVID-19 PASAULĒ')]//following::ul[1]//p[#class='covid-summary__count covid-c-recovered']"))).text)
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
19 072
52 546 925
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
References
You can find a couple of relevant detailed discussions in:
Invalid selector: Compound class names not permitted error using Selenium
How to locate an element with multiple class names?
I am trying to extract some information from the amazon website using selenium. But I am not able to scrape that information using xpath in selenium.
In the image below I want to extract the info highlighted.
This is the code I am using
try:
path = "//div[#id='desktop_buybox']//div[#class='a-box-inner']//span[#class='a-size-small')]"
seller_element = WebDriverWait(driver, 5).until(
EC.visibility_of_element_located((By.XPATH, path)))
except Exception as e:
print(e)
When I run this code, it shows that there is an error with seller_element = WebDriverWait(driver, 5).until( EC.visibility_of_element_located((By.XPATH, path))) but does not say what exception it is.
I tried looking online and found that this happens when selenium is not able to find the element in the webpage.
But I think the path I have specified is right. Please help me.
Thanks in advance
[EDIT-1]
This is the exception I am getting
Message:
//div[class='a-section a-spacing-none a-spacing-top-base']//span[class='a-size-small a-color-secondary']
XPath could be something like this. You can shorten this.
CSS selector could be and so forth.
.a-section.a-spacing-none.a-spacing-top-base
.a-size-small.a-color-secondary
I think the reason is xpath expression is not correct.
Take the following element as an example, it means the span has two class:
<span class="a-size-small a-color-secondary">
So, span[#class='a-size-small') will not work.
Instead of this, you can ues xpath as
//span[contains(#class, 'a-size-small') and contains(#class, 'a-color-secondary')]
or cssSelector as
span.a-size-small.a-color-secondary
Amazon is updating its content on the basis of the country you are living in, as I have clicked on the link provided by you, there I did not find the element you are looking for simply because the item is not sold here in India.
So in short if you are sitting in India and try to find your element, it is not there, but as you change the location to "United States". it is appearing there.
Solution - Change the location
To print the Ships from and sold by Amazon.com of an element you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR and get_attribute():
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.a-section.a-spacing-none.a-spacing-top-base > span.a-size-small.a-color-secondary"))).get_attribute("innerHTML"))
Using XPATH and text attribute:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='a-section a-spacing-none a-spacing-top-base']/span[#class='a-size-small a-color-secondary']"))).text)
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
Outro
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium
I've got markup of the following format I'm attempting to work with using Selenium/Python:
<tr>
<td>google</td>
<td>useless text</td>
<td>useless text2</td>
<td>useless text3</td>
<td>emailaddress</td>
</tr>
The idea being that given a known email address (part of the href in the emailaddress td), I can get to (and click) the a in the first td. It looks like xpath is the best choice to accomplish this with Selenium. I'm trying the following xpath:
//*[#id="page_content"]/table/tbody/tr[2]/td[2]/div/table[1]/tbody/tr/td[4]/a[contains(#href, "mailto:needle#email.com")]/../../td/a[0]
But I'm getting this error:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"xpathhere"}
I do know the xpath to get to the "needle#email.com" a is correct, as it's just copied from chrome dev tools, so the error must be with the part of the xpath after reaching the first a element. Can anyone shed some light on the problem with my xpath?
Try to use below code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
xpath = '//td[a[#href="needle#email.com"]]/preceding-sibling::td/a'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
This should allow you to match first link based on href attribute of the last link in table row (tr) and click it once it became clickable
First, note that (this may be a meaningless typo) you are looking for "mailto:needle#email.com" while the value of your href attribute is "needle#email.com".
Second, you actually know how to get back [...]. But Xpath indexing starts with 1. Thus why this 'a[0]', is this also a meaningless typo ?
Anyway, this xpath would get your sibling
'//a[contains(#href, "needle#email.com")]/../../td[1]/a[1]'
Or more accurately than using contains (since you may have other email adresses that can be matched, e.g. "otherneedle#email.com")
'//a[#href="needle#email.com"]/../../td[1]/a[1]'
Or even better, i.e. with no index and no parent/child like exploration.
'//td[a[#href="needle#email.com"]]/preceding-sibling::td/a'
All tested.
Try to find the tr that contains that email, and click on the first link from it.
//tr[.//a[contains(#href, 'your_email')]]//a
or
//tr[.//a[contains(#href, 'your_email')]]//a[#href]
or
//tr[.//a[contains(#href, 'your_email')]]//a[contains(#href, 'common_url_part')]
Your HTML should be this.
<tr>
<td>google</td>
<td>useless text</td>
<td>useless text2</td>
<td>useless text3</td>
<td>emailaddress</td>
</tr>
Otherwise, your user can click on the link until he or she works himself into a frenzy. :)
Then you can do this in selenium.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get("file://c:/scratch/temp2.htm")
>>> link = driver.find_element_by_xpath('.//a[contains(#href,"needle#email.com")]')
>>> link.click()
I've used contains because an email address in a link can be something like mailto:Jose Greco <needle.email.com>.
PS: And incidentally I've just executed this stuff on my machine.