Python HTML Parsing via CSS Selectors

Python HTML Parsing via CSS Selectors - python

I'm trying to collect the plain text/business title from the following:
<div class = "business-detail-text>
<h1 class = "business-title" style="position:relative;" itemprop="name">H&H Construction Co.</h1>
What is the best way to do this? The style & itemprop attribute's are where I get stuck. I know I can use soup.select but I've had no luck so far.
Here is my code so far:
def bbb_profiles(profile_urls):
sauce_code = requests.get(profile_urls)
plain_text = sauce_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for profile_info in soup.findAll("h1", {"class": "business-title"}):
print(profile_info.string)

is it what you need?
>>> from bs4 import BeautifulSoup
>>> txt='''<div class = "business-detail-text">
<h1 class = "business-title" style="position:relative;" itemprop="name">H&H Construction Co.</h1></div>'''
>>> soup = BeautifulSoup(txt, "html.parser")
>>> soup.find_all('h1', 'business-title')
[<h1 class="business-title" itemprop="name" style="position:relative;">H&H; Construction Co.</h1>]
>>> soup.find_all('h1', 'business-title')[0].text
u'H&H; Construction Co.'
I see your html is missing " after "business-detail-text and < /div> in the very end

Related

Beautiful Soup: Test if a div is children of a div

Is it possible to test with Beautiful Soup whether a div is a (not necessarily immediate) child of a div?
Eg.
<div class='a'>
<div class='aa'>
<div class='aaa'>
<div class='aaaa'>
</div>
</div>
</div>
<div class='ab'>
<div class='aba'>
<div class='abaa'>
</div>
</div>
</div>
</div>
Now I want to test whether the div with class aaaa and the div with class abaa are (not necessarily immediate) children of the div with class aa.
import bs4
with open('test.html','r') as i_file:
soup = bs4.BeautifulSoup(i_file.read(), 'lxml')
div0 = soup.find('div', {'class':'aa'})
div1 = soup.find('div', {'class':'aaaa'})
div2 = soup.find('div', {'class':'abaa'})
print(div1 in div0) # must return True, but returns False
print(div2 in div0) # must return False
How can this be done?
(Of course, the actual HTML is more complicated, with more nested divs.)

You can use find_parent method from Beautifulsoup.
import bs4
with open("test.html", "r") as i_file:
soup = bs4.BeautifulSoup(i_file.read(), "lxml")
div0 = soup.find("div", {"class": "aa"})
div1 = soup.find("div", {"class": "aaaa"})
div2 = soup.find("div", {"class": "abaa"})
print(div1.find_parent(div0.name, attrs=div0.attrs) is not None) # Returns True
print(div2.find_parent(div0.name, attrs=div0.attrs) is not None) # Returns False

try finding all the child elements using find_all_next and see if the child elements has the required class attribute.
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, "html.parser")
def is_child(element, parent_class, child_class):
return any(
child_class in i.attrs['class']
for i in soup.find("div", attrs={"class": parent_class}).find_all_next(element)
)
print(is_child("div", "aa", "aaa")) # True
print(is_child("div", "abaa", "aa")) # False

Okay, I think I found a way. You gotta get all children divs of the parent div with find_all:
import bs4
with open('test.html','r') as i_file:
soup = bs4.BeautifulSoup(i_file.read(), 'lxml')
div0 = soup.find('div', {'class':'aa'})
div1 = soup.find('div', {'class':'aaaa'})
div2 = soup.find('div', {'class':'abaa'})
children = div0.find_all('div')
print(div1 in children)
print(div2 in children)

How to get text inside tag using python and BeautifulSoup

I'm trying to get text (example text) inside tags using beautiful soup
The html structure looks like this:
...
<div>
<div>Description</div>
<span>
<div><span>example text</span></div>
</span>
</div>
...
What i tried:
r = requests.get(url)
soup = bs(r.content, 'html.parser')
desc = soup.find('div.div.span.div.span')
print(str(desc))

You cannot use .find() with multiple tag names in it stacked like this. You need to repeatedly call .find() to get desired result. Check out docs for more information. Below code will give you desired output:
soup.find('div').find('span').get_text()

Your selector is wrong.
>>> from bs4 import BeautifulSoup
>>> data = '''\
... <div>
... <div>Description</div>
... <span>
... <div><span>example text</span></div>
... </span>
... </div>'''
>>> soup = BeautifulSoup(data, 'html.parser')
>>> desc = soup.select_one('div span div span')
>>> desc.text
'example text'
>>>

r = requests.get(url)
soup = bs(r.content, 'html.parser')
desc = soup.find('div').find('span')
print(desc.getText())

Check this out -
soup = BeautifulSoup('''<div>
<div>Description</div>
<span>
<div><span>example text</span></div>
</span>
</div>''',"html.parser")
text = soup.span.get_text()
print(text)

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you

You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)

An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

Certain content not loading when scraping a site with Beautiful Soup

I'm trying to scrape the ratings off recipes on NYT Cooking but having issues getting the content I need. When I look at the source on the NYT page, I see the following:
<div class="ratings-rating">
<span class="ratings-header ratings-content">194 ratings</span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content four-star-rating avg-rating">
The content I'm trying to pull out is 194 ratings and four-star-rating. However, when I pull in the page source via Beautiful Soup I only see this:
<div class="ratings-rating">
<span class="ratings-header ratings-content"><%= header %></span>
<div class="ratings-stars-wrap">
<div class="ratings-stars ratings-content <%= ratingClass %> <%= state %>">
The code I'm using is:
url = 'https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill'
r = get(url, headers = headers, timeout=15)
page_soup = soup(r.text,'html.parser')
Any thoughts why that information isn't pulling through?

Try using below code
import requests
import lxml
from lxml import html
import re
url = "https://cooking.nytimes.com/recipes/1019706-spiced-roasted-cauliflower-with-feta-and-garlic?action=click&module=Recirculation%20Band%20Recipe%20Card&region=More%20recipes%20from%20Alison%20Roman&pgType=recipedetails&rank=1"
r = requests.get(url)
tree = html.fromstring(r.content)
t = tree.xpath('/html/body/script[14]')[0]
# look for value for bootstrap.recipe.avg_rating
m = re.search("bootstrap.recipe.avg_rating = ", t.text)
colon = re.search(";", t.text[m.end()::])
rating = t.text[m.end():m.end()+colon.start()]
print(rating)
# look for value for bootstrap.recipe.num_ratings =
n = re.search("bootstrap.recipe.num_ratings = ", t.text)
colon2 = re.search(";", t.text[n.end()::])
star = t.text[n.end():n.end()+colon2.start()]
print(star)

much easier to use attribute = value selectors to grab from span with class ratings-metadata
import requests
from bs4 import BeautifulSoup
data = requests.get('https://cooking.nytimes.com/recipes/1020049-lemony-chicken-soup-with-fennel-and-dill')
soup = BeautifulSoup(data.content, 'lxml')
rating = soup.select_one('[itemprop=ratingValue]').text
ratingCount = soup.select_one('[itemprop=ratingCount]').text
print(rating, ratingCount)

Beautifulsoup: parsing html – get part of href

I'm trying to parse
<td height="16" class="listtable_1">76561198134729239</td>
for the 76561198134729239. and I can't figure out how to do it. what I tried:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td",
{
"class":"listtable_1",
"target":"_blank"
})
print(element.text)

There are many such entries in that HTML. To get all of them you could use the following:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")
for td in soup.findAll("td", class_="listtable_1"):
for a in td.findAll("a", href=True, target="_blank"):
print(a.text)
This would then return:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

"target":"_blank" is a class of anchor tag a within the td tag. It's not a class of td tag.
You can get it like so:
from bs4 import BeautifulSoup
html="""
<td height="16" class="listtable_1">
<a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
76561198134729239
</a>
</td>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)
Output:
76561198134729239

As others mentioned you are trying to check attributes of different elements in a single find(). Instead, you can chain find() calls as MYGz suggested, or use a single CSS selector:
soup.select_one("td.listtable_1 a[target=_blank]").get_text()
If, you need to locate multiple elements this way, use select():
for elm in soup.select("td.listtable_1 a[target=_blank]"):
print(elm.get_text())

"class":"listtable_1" belong to td tag and target="_blank" belong to a tag, you should not use them together.
you should use Steam Community as an anchor to find the numbers after it.
OR use URL, The URL contain the info you need and it's easy to find, you can find the URL and split it by /:
for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
num = a['href'].split('/')[-1]
print(num)
Code:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
num = td.find_next_sibling('td').text
print(num)
out:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

You could chain together two finds in gazpacho to solve this problem:
from gazpacho import Soup
html = """<td height="16" class="listtable_1">76561198134729239</td>"""
soup = Soup(html)
soup.find("td", {"class": "listtable_1"}).find("a", {"target": "_blank"}).text
This outputs:
'76561198134729239'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python HTML Parsing via CSS Selectors - python

Related

Beautiful Soup: Test if a div is children of a div

How to get text inside tag using python and BeautifulSoup

Getting only numbers from BeautifulSoup instead of whole div

Certain content not loading when scraping a site with Beautiful Soup

Beautifulsoup: parsing html – get part of href

Categories

Resources