How to get the text and URL from a link using beautifulsoup - python

I have the following code which prints out a list of the links for each team in a table:
import requests
from bs4 import BeautifulSoup
# Get all teams in Big Sky standings table
URL = 'https://www.espn.com/college-football/standings/_/group/20/view/fcs-i-aa'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
standings = soup.find_all('table', 'Table Table--align-right Table--fixed Table--fixed-left')
for team in standings:
team_data = team.find_all('span', 'hide-mobile')
print(team_data)
The code prints out the entire list and if I pinpoint an index such as 'print(team_data[0])', it will print out the specific link from the page.
How can I then go into that link and get the string from the URL as well as the text for the link?
For example, my code prints out the following for the first index in the list.
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
How can I pull
/college-football/team/_/id/2692/weber-state-wildcats
and
Weber State Wildcats
from the link?
Thank you for your time and if there is anything I can add for clarification, please don't hesitate to ask.

Provided that you have an html like:
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
To get the /college-football/team/_/id/2692/weber-state-wildcats:
>>> team_data.find_all('a')[0]['href']
'/college-football/team/_/id/2692/weber-state-wildcats'
To get the Weber State Wildcats:
>>> team_data.find_all('a')[0].text
'Weber State Wildcats''

In terms of the href/url, you can do something like this.
In regards to the link text, you could do something like this.
Both amount to filtering down to the target element, and then extracting the desired attribute.

Related

Scrape <div<span from HTML-page

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name
You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

Navigating html tree with beautifulsoup

I'm trying to scrape some data with beautifulsoup on python (url:http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html)
When the data is first occurrence, no problem like
titlebook = soup.find("h1")
titlebook = titlebook.text
but i want to scrape different values, further in page, like upc, price incl.tax, etc
Upc value is first and i have it running universal_product_code= soup.find("tr").find("td").text
I tried so many solutions to access the other ones (i've read beautifulsoup documentation and tried lot of things but it didn't really help me)
So my question is, how to access specific values in a tree where tags are same? I joined a screenshot of the tree to help you understand what i'm talking about
Thank you for your help
For example, if you want to find price (excluding tax), you can use string= parameter in .find and then search for text in next <td>:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# get "Price (excl. tax)" from the table:
key = "Price (excl. tax)"
print("{}: {}".format(key, soup.find("th", string=key).find_next("td").text))
Prints:
Price (excl. tax): £53.74
Or: Use CSS selector:
print(soup.select_one('th:-soup-contains("Price (excl. tax)") + td').text)

How to parse information between {} on web page using Beautifulsoup

I am scraping sports odds from a web page and, as seen below, my result from a find request. The .get_text() will display -110 which is fine.
What if I wanted to get any of the numbers within the {}. How would I go about getting these values?
I purposely deleted the opening < and closing > from the first div statement down below in order for it to appear.
results = soup.find('div', attrs={'class':'op-item spread-price'})
print(results)
div class="op-item spread-price" data-op-info='{"fullgame":"-110","firsthalf":"-121","secondhalf":"-115","firstquarter":"-109","secondquarter":"","thirdquarter":"","fourthquarter":""}' data-op-overprice='{"fullgame":"-110","firsthalf":"-109","secondhalf":"-122","firstquarter":"-103","secondquarter":"","thirdquarter":"","fourthquarter":""}'>-110</div
screen
Long story short - This should show what todo
from bs4 import BeautifulSoup
html_doc = """<div class="op-item spread-price" data-op-info='{"fullgame":"-110","firsthalf":"-121","secondhalf":"-115","firstquarter":"-109","secondquarter":"","thirdquarter":"","fourthquarter":""}' data-op-overprice='{"fullgame":"-110","firsthalf":"-109","secondhalf":"-122","firstquarter":"-103","secondquarter":"","thirdquarter":"","fourthquarter":""}'>-110</div>"""
bs = BeautifulSoup(html_doc)
data = bs.find('div')
print(data['data-op-info'])
print(data['data-op-overprice'])
Based on your "provided" information
print(result['data-op-info'])
print(result['data-op-overprice'])
Based on your comment you can replace the printing by
import json
for k,v in json.loads(result['data-op-info']).items():
print(k,v)
Hope that helps, let us know

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Scraping page with BS in python only captures first column of splitColumn

I'm trying to scrape the last part of this page through BeautifulSoup in python.
I want to retrieve all the companies listed in the bottom. Furthermore, the companies are ordered alphabetically, where the companies with titles starting with "A-F" appear under the first tab, then "G-N" under the second tab and so on. You have to click the tabs for the names to appear, so I'll loop through the different "name pages" and apply the same code.
I'm having trouble retrieving all the names of a single page, however.
When looking at the companies named "A-F" I can only retrieve the names of the first column of the table.
My code is:
from bs4 import BeautifulSoup as Soup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-
responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url)
soup = Soup(page.content, "html.parser")
for header in soup.find("h2").next_siblings:
try:
for a in header.childGenerator():
if str(type(a)) == "<class 'bs4.element.NavigableString'>":
print(str(a))
except:
pass
As can be seen by running this, I only get the names from the first column.
Any help is very much appreciated.
Give this a shot and tell me this is not what you wanted:
from bs4 import BeautifulSoup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".splitColumn p"):
title = '\n'.join([item for item in items.strings])
print(title)
Result:
3iGroup
8point3 Energy Partners 
A
ABN AMRO
Accell Group
Accsys Technologies
Achmea
Acuity Brands
Adecco
Adidas
Adobe Systems

Categories

Resources