BeautifulSoup Scraping Span Class HTML - python

I am trying to scrape from the <span class= ''>. The code looks like this on the pages I am scraping:
< span class = "catnum"> Disc Number < / span>
"1"
< br >
< span class = "catnum"> Track Number < / span>
"1"
< br>
< span class = "catnum" > Duration < /span>
"5:28"
<br>
What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.
I tried this code as a test on one page:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("Smith.html"), "html.parser")
for tag in soup.findAll('span'):
if tag.has_key('class'):
if tag['class'] == 'catnum':
print tag.string
I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:
/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning:
has_key is deprecated. Use has_attr("class") instead. key))

as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.
Hope it helps.
Simone

You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number'). Then use .next_sibling to find the text:
from bs4 import BeautifulSoup
import re
s = '''
<span class = "catnum"> Disc Number </span>
"1"
<br/>
<span class = "catnum"> Track Number </span>
"1"
<br/>
<span class = "catnum"> Duration </span>
"5:28"
<br/>'''
soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling

Related

cant get text from span with Beautifulsoup

Why can I not get the text 3.7M from those (multiple with same class name) span with below code?:
result_prices_pc = soup.find_all("span", attrs={"class": "pc_color font-weight-bold"})
HTML:
<td><span class="pc_color font-weight-bold">3.7M <img alt="c" class="small-coins-icon" src="/design/img/coins_bin.png"></span></td>
I try to get all prices with a for loop:
for price in result_prices_pc:
print(price.text)
But I cant get the text from it.
The "problem" is the pc_color CSS class. When you load the page, you need to specify what version of page do you need (PS4/XBOX/PC) - this is done by "platform" cookie (or you can use ps4_color instead of pc_color, for example):
import requests
from bs4 import BeautifulSoup
url = "https://www.futbin.com/players"
cookies = {"platform": "pc"}
soup = BeautifulSoup(requests.get(url, cookies=cookies).content, "html.parser")
result_prices_pc = soup.find_all(
"span", attrs={"class": "pc_color font-weight-bold"}
)
for price in result_prices_pc:
print(price.text)
Prints:
0
1.15M
3.75M
1.7M
4.19M
1.81M
351.65K
0
1.66M
98K
1.16M
3M
775K
99K
1.62M
187K
280K
245K
220K
1.03M
395K
100K
185K
864.2K
0
1.95M
540K
0
0
89K
These elements are actually having multiple class names: pc_color font-weight-bold are actually pc_color and font-weight-bold class names.
Forthis case you should use this syntax:
result_prices_pc = soup.find_all("span", attrs={"class": ['pc_color', 'font-weight-bold']})

How to scrape values of an array under a "ul" tag using BeautifulSoup?

I need to get values from "ul" element but there is no "li" items in it. Instead it has tag with array values. Like below.
<div class ="family">
<ul class ="age">
<ll-per-person count ="[4, 36, 60]" extracount="[]"></ll-per-person>
</ul>
</div>
I want to retrieve the count values. This is the code I have tried in python
r = requests.get(**url**)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs={'class': 'family'})
for ul in table.findAll('ul', attrs={'class': 'age'}):
print(ul)
for li in ul.findAll('ll-per-person'):
print(li)
for numbers in li.findAll(attrs = {"ll-per-person" : "count"}):
print(numbers)
I'm getting output for "print(ul)" and "print(li)". But not "print(numbers)". Not getting any error too.
I need to get the values of count which is an array. How to do that?
You can just do this because count is the attribute of ll-per-person and you can get attribute of element like this.
for li in ul.findAll('ll-per-person'):
print(li["count"])
If it helps with your problem then don't forget to mark this as answer.
To extract the numbers from <ll-per-person> tag you can use json module for example:
import json
from bs4 import BeautifulSoup
html_doc = """
<div class ="family">
<ul class ="age">
<ll-per-person count="[4, 36, 60]" extracount="[]"></ll-per-person>
</ul>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for item in soup.select("ll-per-person"):
lst = json.loads(item["count"])
print("Numbers are:")
for number in lst:
print(number)
Prints:
Numbers are:
4
36
60
Since the "u" tag has <ll-per-person count="[4, 36, 60]" extracount="[]"></ll-per-person> as its second child (Use soup.u.contents to view the children) we can access it and get the value of count attribute.
from bs4 import BeautifulSoup as bs
html_doc = """
<div class ="family">
<ul class ="age">
<ll-per-person count ="[4, 36, 60]" extracount="[]"></ll-per-person>
</ul>
</div>"""
soup = bs(html_doc,'html.parser')
tag_ll = soup.ul.contents[1]
print(tag_ll['count'])

Python BeautifulSoup and HTML with unusual spaces

I am trying to update product prices by scraping their prices from a website. However I have reached an unusual html formatting which is giving me some trouble. I am trying to return the price without the spaces. Currently my code brings in all the spaces.
<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>
I am trying the following:
soup = BeautifulSoup(web_page, "html.parser")
for product in soup.find_all('div', class_="product-wrapper"):
# Get product name
product_title = product.find('p', class_='h4 product__title').text
# Get product price
product_price = product.find('p', class_='product__price').text
product_price.strip()
But unfortunately using the .strip() method does not work and the script returns the prices with a bunch of space and "Regular price".
Any ideas on how I can get exactly "£9.99" ?
The reason this does not work is because the p element contains two children:
A span element
A text node
When you cann .text on the parent p element you will drop the "span" tag. In addition to this, the content contains quotes which will make strip() ignore the spaces inside those quotes.
To solve the problem you must first isolate the text content from the span node, which you can do by diving into the span node using .children.
Finally, you can tell .strip() which characters to remove.
So, assumning the structure inside the p element is always like this we can do the following:
from bs4 import BeautifulSoup
data = """
<div>
<p class='product__price'>
<span class='visuallyhidden'>Regular price</span>
"
£9.99
"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price')
raw_data = list(product_price.children)[-1]
# Remove spaces, newlines and quotes
cleaned = raw_data.strip(' \n"')
print(repr(cleaned))
You can use contents and get the last element and then split string with "
from bs4 import BeautifulSoup
data='''<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>'''
soup=BeautifulSoup(data,'html.parser')
items=soup.select_one('.product__price').contents
print(items[-1].split('"')[1].strip())
you should try this
product_price = product_price.strip().replace(" ","")
An alternative approach: how about regex?
from bs4 import BeautifulSoup
import re
html = """<div><p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p></div>"""
soup = BeautifulSoup(html, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price').text
# Regex
price = re.search("(£\d*\.?\d*)", product_price)
# Print only when there is a match
if price: print(price[0])

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text
I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

Pulling specific (text) spaced between HTML tag during BeautifulSoup

I'm trying to pull something that is categorized as (text) when I look at it in "Inspect Element" mode:
<div class="sammy"
<div class = "sammyListing">
<a href="/Chicago_Magazine/blahblahblah">
<b>BLT</b>
<br>
"
Old Oak Tap" <---**THIS IS THE TEXT I WANT**
<br>
<em>Read more</em>
</a>
</div>
</div>
This is my code thus far, with the line in question being the bottom list comprehension at the end:
STEM_URL = 'http://www.chicagomag.com'
BASE_URL = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
soup = BeautifulSoup(urlopen(BASE_URL).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = []
for div in sammies:
if div.a["href"].startswith("http"):
sammy_urls.append(div.a["href"])
else:
sammy_urls.append(STEM_URL + div.a["href"])
restaurant_names = [x for x in div.a.content]
I've tried div.a.br.content, div.br, but can't seem to get it right.
If suggesting a RegEx way, I'd also really appreciate a nonRegEx way if possible.
Locate the b element for every listing using a CSS selector and find the next text sibling:
for b in soup.select("div.sammy > div.sammyListing > a > b"):
print b.find_next_sibling(text=True).strip()
Demo:
In [1]: from urllib2 import urlopen
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(urlopen('http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'))
In [4]: for b in soup.select("div.sammy > div.sammyListing > a > b"):
...: print b.find_next_sibling(text=True).strip()
...:
Old Oak Tap
Au Cheval
...
The Goddess and Grocer
Zenwich
Toni Patisserie
Phoebe’s Bakery

Categories

Resources