How to find all element and text by using Selenium in python - python

There is a Anchor tag(<a>) under the div class and under the <a> tag there is a <p> tag with the class and the <p> class matched with 12 item. I was trying to find all the text under the p tag using python.
Here is my code.
First approach:
for ele in driver.find_element_by_xpath('//p[#class="BrandCard___StyledP-sc-1kq2v0k-1 bAWFRI text-sm font-semibold text-center cursor-pointer"]'):
print(ele.text)
Second approach:
content_blocks=driver.find(By.CSS_SELECTOR, "div.CategoryBrand__GridCategory-sc-17tjxen-0.PYjFK.my-4")
for block in content_blocks:
elements = block.find_elements_by_tag_name("a")
for el in elements:
list_of_hrefs.append(el.text)
but every time it gives me an error "WebElement is not iterable".
I have added a picture of the page element.
Page Element click here

This should help you, on your first approach you miss the S of elements (with S will return a list with all matches, without the first match).
I use xpath with contains some substring in the class.
r_elems = driver.find_elements_by_xpath("//p[contains(#class, 'BrandCard')]")
[x.text for x in r_elems]

Related

How to extract the href attribute of an element using Selenium and Python

I want to scrape the URLs within the HTML of the 'Racing-Next to Go' section of www.tab.com.au.
Here is an excerpt of the HTML:
<a ng-href="/racing/2020-07-31/MACKAY/MAC/R/8" href="/racing/2020-07-31/MACKAY/MAC/R/8"><i ng-
All I want to scrape is the last bit of that HTML which is a link, so:
/racing/2020-07-31/MACKAY/MAC/R/8
I have tried to find the element by using xpath, but I can't get the URL I need.
My code:
driver = webdriver.Firefox(executable_path=r"C:\Users\Harrison Pollock\Downloads\Python\geckodriver-v0.27.0-win64\geckodriver.exe")
driver.get('https://www.tab.com.au/')
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.text)
Probaly you want to use get_attribute insted of .text. Documentation here.
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))
Yes, you can use getAttribute(attributeLocator) function for your requirement.
selenium.getAttribute(//xpath#href);
Specify the Xpath of the element for which you require to know the class of.
The value /racing/2020-07-31/MACKAY/MAC/R/8 within the HTML is the value of href attribute but not the innerText.
Solution
Instead of using the text attribute you need to use get_attribute("href") and the effective lines of code will be:
elements = driver.find_elements_by_xpath('/html/body/ui-view/main/div[1]/ui-view/version[2]/div/section/section/section/race-list/ul/li[1]/a')
for e in elements:
print(e.get_attribute("href"))

Scrapy selector: get nth-child text of an element

I am using Scrapy selector to extract fields from html
xpath = /html/body/path/to/element/text()
This is similar to question scrapy get nth-child text of same class
and following the documentation we can use .getall() method to get all element and select specific one from the list.
selected_list = Selector(text=soup.prettify()).xpath(xpath).getall()
Is it possible to directly specify which nth element to select in the xpath itself?
Something like below
xpath = /html/body/path/to/element/text(2) #to select 3 child text
Example
<body>
<div>
<i class="ent_sprite remind_icon">
</i>
text that needs to be
</div>
</body>
The result of response.xpath('/body/div/text()').getall() consist of 2 elements
'\n'
'text that needs to be'
You can use following-sibling:: to have the nearest sibling (downward) of the expression. For example in this case you want the nearest text() of <i> tag, so you can do:
response.xpath('//i[#class="ent_sprite remind_icon"]/following-sibling::text()').get()
Which gives you the nearest text() node to <i class="ent_sprite remind_icon">.
If you want to get nth nearest sibling (downward) of a node the XPath would be following-sibling::node[n] in our case comes following:
'//i[#class="ent_sprite remind_icon"]/following-sibling::text()[n]'

Python/Selenium Finding a specific class element, analyzing if it contains a specific span class, if it does, copy the link

Trying to create a script that loops through my inbox and find all div classes that contain "relative flex", if the div class contains a span class labelled "dn dib-1" then it copies and saves the following href link to my list and moves onto the next div.
Here is the html code:
<div class="relative flex">
<span class="dn dib-l" style="left: -16px;"</span>
hey how are you?
Here is the code I have now:
link_list = []
sex_list = []
message = browser.find_elements_by_xpath('//*[#class="relative flex"]')
message_new = browser.find_elements_by_xpath('//*[#class="dn dib-l"]')
for item in message:
link = item.find_element_by_xpath('.//a').get_attribute('href')
if message_new in message:
link_list.append(link)
Issue:
message, message_new all contain data when requested, however despite there being multiple messages with these classes, link variable only contains one element and link_list contains no elements. What changes do I need to make in my code in order for it to save all links within div classes that contain this span class?
I would restructure this code a bit to make it more efficient. To me, it sounds like you want to analyze all of the div elements that have class relative flex. Then, if the div contains a certain span element, you want to save the href tag of the following a item. Here's how I would write this:
# locate the span elements which exist under your desired div
spans_to_iterate = browser.find_elements_by_xpath("//div[contains(#class, 'relative flex')]/span[contains(#class, 'dn dib-1')]")
link_list = []
# iterate span elements to save the href attribute of a element
for span in spans_to_iterate:
# get the href element, where 'a' element is following sibling of span.
link_text = span.find_element_by_xpath("following-sibling::a").get_attribute("href")
link_list.append(link_text)
The idea behind this code is that we first retrieve the span elements that exist in your desired div. In your problem description, you mentioned you only wanted to save the link if the div and span elements contained specific class names. So, we query directly on the elements that you have mentioned, rather than find div first then find span.
Then, we iterate these span elements and use XPath's following-sibling notation to grab the a element that appears right after. We can get get_attribute to grab the href tag, and then append the link to the list.
Try this:
xpth = "//div[#class='relative flex' and /span[#class='dn dib-l']]//#href"
links = browser.find_elements_by_xpath(xpth)

How to extract the required element using find_all()

I am trying to extract the authors' names in the amazon page. The problem is, there are so many tags with same class and there are no other attributes to identify the exact element. Now i want to extract the author name. It is present in the second span tag.
<div class="a-row a-spacing-none">
<span class="a-size-small a-color-secondary">by </span>
<span class="a-size-small a-color-secondary"><a class="a-link-normal a-text-normal" href="/Arthur-Conan-Doyle/e/B000AQ43GQ/ref=sr_ntt_srch_lnk_2?qid=1510823399&sr=8-2">Arthur Conan Doyle</a></span></div>
As we can, there is a same class for both span tags. I want the second span tag.And more over, the a tag is not present in all blocks. So i have to use only span tag to extract the author name. How could i get the author name?
I am using BeautifulSoup and selenium.My code is:
soup=BeautifulSoup(self.driver.page_source,"html.parser")
titles=soup.find_all("h2",{"class":"a-size-medium s-inline s-access-title a-text-normal"})
authors=soup.find_all("span",{"class":"a-size-small a-color-secondary"})
for value in range(len(titles)):
d={}
d["Title"]=titles[value].text
d["Author"]=authors[value+2].text
title.append(d)
Find the above "div" element for that "span". then extract the entire text of the div tag.As u observe, there will be a "by" substring in every block of code. Use it to split the text and copy it to the d["Author"] part. If "by" is not present, check before copying it to dictionary using if condition. If u copy directly, then u may get Array out of Bound exception. So use if.
Here is the code:
temp = authors[value].text
temp1 = temp.split("by")
#print(temp[1])
if temp1[0]!=temp:
d["Author"] = temp1[1]
else:
d["Author"] = "None"

Iterating Over Elements and Sub Elements With lxml

This one is for legitimate lxml gurus. I have a web scraping application where I want to iterate over a number of div.content (content is the class) tags on a website. Once in a div.content tag, I want to see if there are any <a> tags that are the children of <h3> elements. This seems relatively simple by just trying to create a list using XPath from the div.cont tag, i.e.,
linkList = tree.xpath('div[contains(#class,"cont")]//h3//a')
The problem is, I then want to create a tuple that contains the link from the div.content box as well as the text from the paragraph element of the same div.content box. I could obviously iterate over the whole document and store all of the paragraph text as well as all of the links, but I wouldn't have any real way of matching the appropriate paragraphs to the <a> tags.
lxml's Element.iter() function could ALMOST achieve this by iterating over all of the div.cont elements, ignoring those without <a> tags, and pairing up the paragraph/a combos, but unfortunately there doesn't seem to be any option for iterating over class names, only tag names, with that method.
Edit: here's an extremely stripped down version of the HTML I want to parse:
<body>
<div class="cont">
<h1>Random Text</h1>
<p>The text I want to obtain</p>
<h3>The link I want to obtain</h3>
</div>
</body>
There are a number of div.conts like this that I want to work with -- most of them have far more elements than this, but this is just a sketch to give you an idea of what I'm working with.
You could just use a less specific XPath expression:
for matchingdiv in tree.xpath('div[contains(#class,"cont")]'):
# skip those without a h3 > a setup.
link = matchingdiv.xpath('.//h3//a')
if not link:
continue
# grab the `p` text and of course the link.
You could expand this (be ambitious) and select for the h3 > a tags, then go to the div.cont ancestor (based off XPath query with descendant and descendant text() predicates):
for matchingdiv in tree.xpath('.//h3//a/ancestor::*[self::div[contains(#class,"cont")]]'):
# no need to skip anymore, this is a div.cont with h3 and a contained
link = matchingdiv.xpath('.//h3//a')
# grab the `p` text and of course the link
but since you need to then scan for the link anyway that doesn't actually buy you anything.

Categories

Resources