Beautifulsoup: parsing html – get part of href

Beautifulsoup: parsing html – get part of href - python

I'm trying to parse
<td height="16" class="listtable_1">76561198134729239</td>
for the 76561198134729239. and I can't figure out how to do it. what I tried:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td",
{
"class":"listtable_1",
"target":"_blank"
})
print(element.text)

There are many such entries in that HTML. To get all of them you could use the following:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")
for td in soup.findAll("td", class_="listtable_1"):
for a in td.findAll("a", href=True, target="_blank"):
print(a.text)
This would then return:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

"target":"_blank" is a class of anchor tag a within the td tag. It's not a class of td tag.
You can get it like so:
from bs4 import BeautifulSoup
html="""
<td height="16" class="listtable_1">
<a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
76561198134729239
</a>
</td>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)
Output:
76561198134729239

As others mentioned you are trying to check attributes of different elements in a single find(). Instead, you can chain find() calls as MYGz suggested, or use a single CSS selector:
soup.select_one("td.listtable_1 a[target=_blank]").get_text()
If, you need to locate multiple elements this way, use select():
for elm in soup.select("td.listtable_1 a[target=_blank]"):
print(elm.get_text())

"class":"listtable_1" belong to td tag and target="_blank" belong to a tag, you should not use them together.
you should use Steam Community as an anchor to find the numbers after it.
OR use URL, The URL contain the info you need and it's easy to find, you can find the URL and split it by /:
for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
num = a['href'].split('/')[-1]
print(num)
Code:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
num = td.find_next_sibling('td').text
print(num)
out:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

You could chain together two finds in gazpacho to solve this problem:
from gazpacho import Soup
html = """<td height="16" class="listtable_1">76561198134729239</td>"""
soup = Soup(html)
soup.find("td", {"class": "listtable_1"}).find("a", {"target": "_blank"}).text
This outputs:
'76561198134729239'

Related

How can I extract href links from a within a table th using BeautifulSoup

I am trying to create a list of all football teams/links from any one of a number of tables within the base URL: https://fbref.com/en/comps/10/stats/Championship-Stats
I would then use the link from the href to scrape each individual team's data. The href is embedded within the th tag as per below
th scope="row" class="left " data-stat="squad">Barnsley</th
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley</a
The following code gives me a list of the 'a' tags
page = "https://fbref.com/en/comps/10/Championship-Stats"
pageTree = requests.get(page)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Teams = pageSoup.find_all("th", {"class": "left"})
Output(for each class of 'left'):
th class="left" data-stat="squad" scope="row">
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley,
I have tried the guidance from a previous Stack question (Extract links after th in beautifulsoup)
However, the following code based on that thread produces errors
AttributeError: 'NoneType' object has no attribute 'find_parent'
def import_TeamList():
BASE_URL = "https://fbref.com/en/comps/10/Championship-Stats"
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
team_list = []
team_tr = soup.find('a', {'data-stat': 'squad'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text != 'squad':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return TeamList

Here is a version using CSS selectors, which I find simpler than most other methods.
import requests
from bs4 import BeautifulSoup
url = 'https://fbref.com/en/comps/10/stats/Championship-Stats'
data = requests.get(url).text
soup = BeautifulSoup(data)
links = BeautifulSoup(data).select('th a')
urls = [link['href'] for link in links]
print(urls)

Is this what you're looking for?
import requests
from bs4 import BeautifulSoup as BS
from lxml import etree
with requests.Session() as session:
r = session.get('https://fbref.com/en/comps/10/stats/Championship-Stats')
r.raise_for_status()
dom = etree.HTML(str(BS(r.text, 'lxml')))
for a in dom.xpath('//th[#class="left"]/a'):
print(a.attrib['href'])

Getting only numbers from BeautifulSoup instead of whole div

I am trying to learn python by creating a small websraping program to make life easier, although I am having issues with only getting number when using BS4. I was able to get the price when I scraped an actual ad, but I would like to get all the prices from the page.
Here is my code:
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.append(price)
print (prices)
Here is my output
[<div class="price">
$46,999.00
<div class="dealer-logo">
<div class="dealer-logo-image">
<img src="https://i.ebayimg.com/00/s/NjBYMTIw/z/xMQAAOSwi9ZfoW7r/$_69.PNG"/>
</div>
</div>
</div>
Ideally, I would only want the output to be "46,999.00".
I tried with text=True, although this did not work and I would not get any output from it besides an empty list.
Thank you

You need to get the text portion of tag and then perform some regex processing on it.
import re
def get_price_from_div(div_item):
str_price = re.sub('[^0-9\.]','', div_item.text)
float_price = float(str_price)
return float_price
Just call this method in your code after you find the divs
from bs4 import BeautifulSoup
import requests
prices = []
url = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
result = requests.get(url)
print (result.status_code)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print ("CLEARING")
price = soup.findAll("div", class_="price")
prices.extend([get_price_from_div(curr_div) for curr_div in price])
print (prices)

An option without using RegEx, is to filter out tags that startwith() a dollar sign $:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kijiji.ca/b-cars-trucks/calgary/new__used/c174l1700199a49?ll=51.044733%2C-114.071883&address=Calgary%2C+AB&radius=50.0'
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
price_tags = soup.find_all("div", class_="price")
prices = [
tag.get_text(strip=True)[1:] for tag in price_tags
if tag.get_text(strip=True).startswith('$')
]
print(prices)
Output:
['48,888.00', '21,999.00', '44,488.00', '5,500.00', '33,000.00', '14,900.00', '1,750.00', '35,600.00', '1,800.00', '25,888.00', '36,888.00', '32,888.00', '30,888.00', '18,888.00', '21,888.00', '29,888.00', '22,888.00', '30,888.00', '17,888.00', '17,888.00', '16,888.00', '22,888.00', '22,888.00', '34,888.00', '31,888.00', '32,888.00', '30,888.00', '21,888.00', '15,888.00', '21,888.00', '28,888.00', '19,888.00', '18,888.00', '30,995.00', '30,995.00', '30,995.00', '19,888.00', '47,995.00', '21,888.00', '46,995.00', '32,888.00', '29,888.00', '26,888.00', '21,888.00']

How to get a li with no class underneath a div tag

I am attempting to pull line and over under data for games from ESPN. To do this I need to pull a list item underneath a div tag.I can successfully get the over/under data because it's clear to me what the tag is, but the list item for the line doesn't seem to have a clear tag. Essentially I would be wanting to pull out "Line: IOWA -3.5" from this specific URL.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college- football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
#Get over/under
game_ou = soup.find('li',class_='ou')
game_ou2 = game_ou.contents[0]
game_ou3=game_ou2.strip()
#Get Line
game_line = soup.find('div',class_='odds-details')
print(game_line)

Add in the parent class (with descendant combinator and child li type selector) then you can retrieve both li in a list and index in or just use select_one to retrieve the first
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = bs(r.content, 'lxml')
lis = [i.text.strip() for i in soup.select('.odds-details li')]
print(lis[0])
print(lis[1])
print(soup.select_one('.odds-details li').text)

Use find('li') after find the div element.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find("div",class_="odds-details").find('li').text)
print(soup.find("div",class_="odds-details").find('li',class_='ou').text.strip())
Output:
Line: IOWA -3.5
Over/Under: 47

Got empty list with Beautiful Soup and Selenium

https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king
I want to get TOMATOMETER and AUDIENCE SCORE from that website,
but got an empty list.
soup = BeautifulSoup(html, 'html.parser')
notices = soup.select('#tomato_meter_link > span.mop-ratings-wrap__percentage')

You can use last-child selector for span type with the parent class. This is using BeautifulSoup 4.7.1
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king')
soup = bs(res.content, 'lxml')
ratings = [item.text.strip() for item in soup.select('h1.mop-ratings-wrap__score span:last-child')]
print(ratings)

Your code works well
>>> from bs4 import BeautifulSoup
>>> html = requests.get('https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king').text
>>> soup = BeautifulSoup(html, 'html.parser')
>>> notices = soup.select('#tomato_meter_link > span.mop-ratings-wrap__percentage')
>>> notices
[<span class="mop-ratings-wrap__percentage">93%</span>]
How did you get html variable?

Scrape all href into list with BeautifulSoup

I'd like to to grab links from this page and put them in a list.
I have this code:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/236').read()
soup = bs.BeautifulSoup(source,'lxml')
links = soup.find_all('a', attrs={'class': 'view'})
print(links)
It produces following output:
[<a class="view" href="/en/catalog/view/514">
<img alt="View details" height="32" src="/img/actions/file.png" title="View details" width="32"/>
</a>,
"""There are 28 lines more"""
<a class="view" href="/en/catalog/view/565">
<img alt="View details" height="32" src="/img/actions/file.png" title="View details" width="32"/>
</a>]
I need to get following: [/en/catalog/view/514, ... , '/en/catalog/view/565']
But then I go ahead and add following: href_value = links.get('href') I got an error.

Try:
soup = bs.BeautifulSoup(source,'lxml')
links = [i.get("href") for i in soup.find_all('a', attrs={'class': 'view'})]
print(links)
Output:
['/en/catalog/view/514', '/en/catalog/view/515', '/en/catalog/view/179080', '/en/catalog/view/45518', '/en/catalog/view/521', '/en/catalog/view/111429', '/en/catalog/view/522', '/en/catalog/view/182223', '/en/catalog/view/168153', '/en/catalog/view/523', '/en/catalog/view/524', '/en/catalog/view/60228', '/en/catalog/view/525', '/en/catalog/view/539', '/en/catalog/view/540', '/en/catalog/view/31642', '/en/catalog/view/553', '/en/catalog/view/558', '/en/catalog/view/559', '/en/catalog/view/77672', '/en/catalog/view/560', '/en/catalog/view/55377', '/en/catalog/view/55379', '/en/catalog/view/32001', '/en/catalog/view/561', '/en/catalog/view/562', '/en/catalog/view/72185', '/en/catalog/view/563', '/en/catalog/view/564', '/en/catalog/view/565']

Your links is currently a python list. What you want to do is loop into that list and fetch the hrefs as below.
final_hrefs = []
for each_link in links:
final_hrefs.append(each_link.a['href'])
or a one-liner
final_hrefs = [each_link['href'] for each_link in links]
print(final_hrefs)

Try the code below. You get the HTML list in one step:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/236').read()
soup = bs.BeautifulSoup(source,'lxml')
links = [i.get("href") for i in soup.find_all('a', attrs={'class': 'view'})]
for link in links:
print('http://www.gcoins.net'+ link)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup: parsing html – get part of href - python

You could chain together two finds in gazpacho to solve this problem: from gazpacho import Soup html = """<td height="16" class="listtable_1">76561198134729239</td>""" soup = Soup(html) soup.find("td", {"class": "listtable_1"}).find("a", {"target": "_blank"}).text This outputs: '76561198134729239'

Related

How can I extract href links from a within a table th using BeautifulSoup

Getting only numbers from BeautifulSoup instead of whole div

How to get a li with no class underneath a div tag

Got empty list with Beautiful Soup and Selenium

Scrape all href into list with BeautifulSoup

Categories

Resources