BS4 text inside <span> which has no class

BS4 text inside <span> which has no class - python

i am trying to scrape this 4.1 rating in span tag using this python code but it is returning empty.
for item in soup.select("._9uwBC wY0my"):
n = soup.find("span").text()
print(n)
---------------------------------------
<div class="_9uwBC wY0my">
<span class="icon-star _537e4"></span>
<span>4.1</span>
</div>

#Aditya, I think soup.find("span") will only return the first "span" and you want the text from the second one.
I would try:
for item in soup.select("div._9uwBC.wY0my"):
spans = item.find_all("span")
for span in spans:
n = span.text
if n != '':
print(n)
Which should print the text of the non-empty span tags, under the you specified.
Does accomplish what you want?

OK, here's one approach for getting the names and stars for each restaurant on the page. It's not necessarily the most elegant way to do it, but I've tried it a couple of times and it seems to work:
divs = soup.find_all('div')
for div in divs:
if div.has_attr('class'):
if div['class'] == ['nA6kb']: ## the class of the divs with the name
name = div.text
k = div.find_next('div') ## the next div
l = k.find_next('div') ## the div with the stars
spans = l.find_all('span') ## this part is same as the answer above
for span in spans:
n = span.text
if n != '':
print(name, n)
This assumes that the div that contains the stars span is always the second div after the div that contains the restaurant name. It looks like that's always the case, but I'm not positive that it never changes.

Related

Scraping the attribute of the first child from multiple div (selenium)

I'm trying to scrap the class name of the first child (span) from multiple div.
Here is the html code:
<div class="ui_column is-9">
<span class="name1></span>
<span class="...">...</span>
...
<div class ="ui_column is-9">
<span class="name2></span>
<span class="...">...</span>
...
<div class ..
URL of the page for the complete code.
I'm achieving this task with this code for the first five div:
i=0
liste=[]
while i <= 4:
parent= driver.find_elements_by_xpath("//div[#class='ui_column is-9']")[i]
child= parent.find_element_by_xpath("./child::*")
class_name= child.get_attribute('class')
i = i+1
liste.append(nom_classe)
But do you know if there is an easier way to do it ?

You can directly get all these first span elements and then extract their class attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[1]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)
You can also extract the class attribute values from 5 first elements only by limiting the loop for 5 iterations
UPD
Well, after updating your question the answer becomes different and much simpler.
You can get the desired elements directly and extract their class name attribute values as following:
liste = []
first_spans = driver.find_elements_by_xpath("//div[#class='ui_column is-9']//span[contains(#class,'ui_bubble_rating')]")
for element in first_spans:
class_name= element.get_attribute('class')
liste.append(class_name)

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text

I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

BeautifulSoup Scraping Span Class HTML

I am trying to scrape from the <span class= ''>. The code looks like this on the pages I am scraping:
< span class = "catnum"> Disc Number < / span>
"1"
< br >
< span class = "catnum"> Track Number < / span>
"1"
< br>
< span class = "catnum" > Duration < /span>
"5:28"
<br>
What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.
I tried this code as a test on one page:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("Smith.html"), "html.parser")
for tag in soup.findAll('span'):
if tag.has_key('class'):
if tag['class'] == 'catnum':
print tag.string
I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:
/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning:
has_key is deprecated. Use has_attr("class") instead. key))

as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.
Hope it helps.
Simone

You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number'). Then use .next_sibling to find the text:
from bs4 import BeautifulSoup
import re
s = '''
<span class = "catnum"> Disc Number </span>
"1"
<br/>
<span class = "catnum"> Track Number </span>
"1"
<br/>
<span class = "catnum"> Duration </span>
"5:28"
<br/>'''
soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling

BeautifulSoup: How to skip a child node within a find_all?

I have the following code to scrape this page:
soup = BeautifulSoup(html)
result = u''
# Find Starting point
start = soup.find('div', class_='main-content-column')
if start:
news.image_url_list = []
for item in start.find_all('p'):
The problem I'm facing is that it also grabs the <p> inside <div class="type-gallery">, which I would like to avoid. But can't find a way to achieve it.
Any ideas please?

You want direct children, not just any descendant, which is what element.find_all() returns. Your best bet here is to use a CSS selector instead:
for item in soup.select('div.main-content-column > div > p'):
The > operator limits this to p tags that are a direct child nodes of div tags within the div with the given class. You can make this as specific as you like; adding in the itemprop attribute for example:
for item in soup.select('div.main-content-column > div[itemprop="articleBody"] > p'):
The alternative is to loop over the element.children iterable:
start = soup.find('div', class_='main-content-column')
if start:
news.image_url_list = []
for item in start.children:
if item.name != 'div':
# skip children that are not <div> tags
continue
for para in item.children:
if item.name != 'p':
# skip children that are not <p> tags
continue

How to count the number of lines of code retrieved using beautiful soup?

Is there any function in beautiful soup to count the number of lines retrieved? Or is there any other way this can be done?
from bs4 import BeautifulSoup
import string
content = open("webpage.html","r")
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)

If you meant to get the number of elements retrieved by find_all(), try using len() function :
......
redditAll = soup.find_all("a")
print(len(redditAll))
UPDATE :
You can change the logic to select specific elements in one go, using CSS selector. This way, getting number of elements retrieved is as easy as calling len() function on the return value :
imgTags = soup.select("div.classname ul.classname a.classname img")
#print number of <img> retreived :
print(len(imgTags))
for tag in imgTags:
name = tag['alt']
print(name)
Or you can keep the logic using multiple for loops, and manually keep track number of elements in a variable :
counter = 0
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
#update counter:
counter += 1
print(counter)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BS4 text inside <span> which has no class - python

Related

Scraping the attribute of the first child from multiple div (selenium)

How to find text of <div><span>text</span></div> in beautifulsoup?

BeautifulSoup Scraping Span Class HTML

BeautifulSoup: How to skip a child node within a find_all?

How to count the number of lines of code retrieved using beautiful soup?

Categories

Resources