How to extract the required element using find_all() - python

I am trying to extract the authors' names in the amazon page. The problem is, there are so many tags with same class and there are no other attributes to identify the exact element. Now i want to extract the author name. It is present in the second span tag.
<div class="a-row a-spacing-none">
<span class="a-size-small a-color-secondary">by </span>
<span class="a-size-small a-color-secondary"><a class="a-link-normal a-text-normal" href="/Arthur-Conan-Doyle/e/B000AQ43GQ/ref=sr_ntt_srch_lnk_2?qid=1510823399&sr=8-2">Arthur Conan Doyle</a></span></div>
As we can, there is a same class for both span tags. I want the second span tag.And more over, the a tag is not present in all blocks. So i have to use only span tag to extract the author name. How could i get the author name?
I am using BeautifulSoup and selenium.My code is:
soup=BeautifulSoup(self.driver.page_source,"html.parser")
titles=soup.find_all("h2",{"class":"a-size-medium s-inline s-access-title a-text-normal"})
authors=soup.find_all("span",{"class":"a-size-small a-color-secondary"})
for value in range(len(titles)):
d={}
d["Title"]=titles[value].text
d["Author"]=authors[value+2].text
title.append(d)

Find the above "div" element for that "span". then extract the entire text of the div tag.As u observe, there will be a "by" substring in every block of code. Use it to split the text and copy it to the d["Author"] part. If "by" is not present, check before copying it to dictionary using if condition. If u copy directly, then u may get Array out of Bound exception. So use if.
Here is the code:
temp = authors[value].text
temp1 = temp.split("by")
#print(temp[1])
if temp1[0]!=temp:
d["Author"] = temp1[1]
else:
d["Author"] = "None"

Related

How to find all element and text by using Selenium in python

There is a Anchor tag(<a>) under the div class and under the <a> tag there is a <p> tag with the class and the <p> class matched with 12 item. I was trying to find all the text under the p tag using python.
Here is my code.
First approach:
for ele in driver.find_element_by_xpath('//p[#class="BrandCard___StyledP-sc-1kq2v0k-1 bAWFRI text-sm font-semibold text-center cursor-pointer"]'):
print(ele.text)
Second approach:
content_blocks=driver.find(By.CSS_SELECTOR, "div.CategoryBrand__GridCategory-sc-17tjxen-0.PYjFK.my-4")
for block in content_blocks:
elements = block.find_elements_by_tag_name("a")
for el in elements:
list_of_hrefs.append(el.text)
but every time it gives me an error "WebElement is not iterable".
I have added a picture of the page element.
Page Element click here
This should help you, on your first approach you miss the S of elements (with S will return a list with all matches, without the first match).
I use xpath with contains some substring in the class.
r_elems = driver.find_elements_by_xpath("//p[contains(#class, 'BrandCard')]")
[x.text for x in r_elems]

How to scrape a single element out of 2 elements having same set of attributes and same hierarchy in html source code (using python's beautiful soup)

I want to scrap the element highlighted in blue color in the image.That element represents the "no of votes" for a particular movie.When ever I try to scrape it, I am also getting bottom element in the image which represents the "collections" for that movie because both elements have same attributes and in same hierarchy.Is there a way to extract only the highlighted element?
One approach could be iterating over all siblings of <p class="sort-num_votes-visible"> and if you find a <span name="nv"> thats surrounded by a <span class="text-muted"> and a <span class="ghost"> then this must be the span you're looking for. This of course implies that the structure of this snippet of HTML is always the same. If one of those spans could be missing then this method obviously fails.
If it's guaranteed that those two spans always are there and in that exact order you could do something like this (your souped HTML is in html_soup):
votes = html_soup.find("p", {"class": "sort-num_votes-visible").find_all("span", {"name": "nv"})[0]
EDIT:
According to your comment you could do the following in order to parse the votes for multiple movies:
for p in html_soup.find("p", {"class": "sort-num_votes-visible"}):
votes = p.find_all("span", {"name": "nv"})[0]
< Put whatever code here for each of your movies
...
>
You can use something like this(assuming that you are using BeautifulSoup):
soup = BeautifulSoup('yourhtml', 'lxml')
p_sort = soup.find('p', {'class':'sort-num_votes-visible'})
req_span = p_sort.find_all('span', {'name':'nv'})[0]
req_span will contain the tag you were asking about.
If order of these 2 similar span elements are same, then you can select first element of result or use .find() instead of .find_all()[0].
I think parsel is a better html parse package with xpath support.
from parsel import Selector
s = Selector(text=html)
nv_data = s.xpath('//span[#name="nv"]/#data-value').extract_first()

Python - How do I find the text of all spans with the id of 'value' using Beautiful Soup?

I would like to get all of the text of the spans which have the class of 'value'.
I then need to get the online ISSN of the page by using the first 9 characters of the text. I don't need the ones with text ending in "(print)" but I do need the ones ending in "(online)
Example
<span class="bold">ISSN: </span>
<span class="value">0890-037X (Print)</span>
<span class="value">1550-2740 (Online)</span>
Here I would need to get "1550-2740" as it is the online ISSN.
I think I need to find all the spans, check the class and then check the text. If the text ends in "(online)" then I need to get the first 9 characters.
How do I do this?
Thank you in advance.
Use find_all to extract the elements. Create a generator (or list if you want) which is just the text attribute of each of these. Filter out those which do not end in "(Online)" and slice them to just extract the ISBN. I have used a generator and next() to just get the first occurrence, but you could just use a list if you wanted all of them (if there are multiple).
Hope this works for the whole file!
soup = BeautifulSoup(open("p.html").read(), "lxml")
txt = (t.text for t in soup.find_all("span", class_="value"))
isbn = next(t[:9] for t in txt if t.endswith("(Online)"))
which gives isbn as '1550-2740'.
Another way could be something like below:
soup = BeautifulSoup(content,"lxml")
for item in soup.find_all(class_="value"):
if "Online" in item.text:
print(item.text.split()[0])
Output:
1550-2740

How to extract text from HTML (after certain string)

I have the following HTML:
<li class="group-ib medium-gap line-120 vertical-offset-10">
<i class="fa fa-angle-right font-bold font-95 text-primary text-dark">
::before
</i>
<span>
abc:
<b class="text-primary text-dark">st1</b>
</span>
</li>
And I want to extract str1 which always happens after abc. I was able to do it by using the XPATH link:
xpath('.//b[#class = "text-primary text-dark"]')[0].text
But the solution depended on it being the first appearance of this particular class, which appears more than once and isn't always in the same order. I was wondering if there was a way to search the HTML for abc and pull the subsequent text?
Maybe find the element that contains abc, navigate to child/parent if needed, get text.
Example of selectors:
Find any(* is for any tag) element that contains abc text and select any child.
//*[contains(text(), 'abc')]/*
Find any(* is for any tag) element that contains abc text and select his b child.
//*[contains(text(), 'abc')]/b
Find li element that has an element which contains text abc and select b element from inside it (inside li), use // since b is not first child of li.
//li[.//[contains(text(), 'abc')]]//b
If you know abc then start from there, see what element is returned and if needed to navigate to parent/ancestor/child.
For more about xpath please see w3schools xpath selectors
The following xpath should give the text you are searching for
//*[contains(text(),'abc')]/*[#class='text-primary text-dark'][1]/text()
assuming the str1 you are looking for should always be under elements with attribute class=text-primary text-dark
also assuming that you want to get the first such occurrence ( ignore the other text-primary text-darks )- that is why [1]
This xpath ensures that the node you are searching for those classes have a text abc before searching them.

XPATH to check on a specific text within a node

I have this as a node to parse:
<h3 class="atag">
<a href="http://www.example.com">
<span class="btag">text to be ignored</span>
</a>
<span class="ctag">text to be checked</span>
</h3>
I'm gonna need to extract "http://www.example.com" but not the part text to to be ignored; I also need to check that if ctag contains text to be checked.
I came up with this but it seems it doesn't do the job.
response.xpath("//h3/a/#*[not(self::span)]").extract()
any idea on this?
If you need to just select href from 'a' tag, use #href.
To also check, whether the ctag contains some text, I think you can use code like this:
'//h3[contains(span[#class="ctag"]/text(), "text to be checked")]/a/#href'
This would check whether there is a span with "text to be checked" inside given h3 block. If the text exists, the 'www.example.com' would be found, otherwise there would be an empty result.
Do you mean something like this XPath? :
//h3/a[following-sibling::span[#class='ctag' and .='text to be checked']/#href
above XPath get <a> tag that followed by <span class="ctag"> containing value of "text to be checked", then return href attribute from the previously mentioned <a> tag.

Categories

Resources