Navigating html tree with beautifulsoup - python

I'm trying to scrape some data with beautifulsoup on python (url:http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html)
When the data is first occurrence, no problem like
titlebook = soup.find("h1")
titlebook = titlebook.text
but i want to scrape different values, further in page, like upc, price incl.tax, etc
Upc value is first and i have it running universal_product_code= soup.find("tr").find("td").text
I tried so many solutions to access the other ones (i've read beautifulsoup documentation and tried lot of things but it didn't really help me)
So my question is, how to access specific values in a tree where tags are same? I joined a screenshot of the tree to help you understand what i'm talking about
Thank you for your help

For example, if you want to find price (excluding tax), you can use string= parameter in .find and then search for text in next <td>:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# get "Price (excl. tax)" from the table:
key = "Price (excl. tax)"
print("{}: {}".format(key, soup.find("th", string=key).find_next("td").text))
Prints:
Price (excl. tax): £53.74
Or: Use CSS selector:
print(soup.select_one('th:-soup-contains("Price (excl. tax)") + td').text)

Related

Using Beautfiul Soup to extract specific groups of links from a blogspot website

I want to extract, let's say, every 7th year link on a school website. On the archives, it's pretty easy to find with ctrl + f "year-7". It's not that easy on beautifulSoup, though. Or maybe I'm doing it wrong.
import requests
from bs4 import BeautifulSoup
URL = '~school URL~'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
This gives me every link on the website archive. Every link that matters to me is pretty much like this:
~school URL~blogspot.com/2020/10/mathematics-activity-year-x.html
I tried storing the "(link.get('href'))" on a variable and searching for "year-x" on it but that doesn't work.
Any ideas on how I'd search through it? Blogspot search is horrific. I'm doing this as a project to help kids from a poor area on finding their classes easier, because it was all just left on the website for the next school year and there's hundreds of link without tags for different school years. I'm trying to at least compile a list of the links for each school year to help them.
If I understand, you want to extract the years from from the links. Try using regex to extract the years.
In your case it would be:
import re
from bs4 import BeautifulSoup
txt = """<a href="blogspot.com/2020/10/mathematics-activity-year-x.html"</a>"""
soup = BeautifulSoup(txt, "html.parser")
years = []
for tag in soup.find_all("a"):
link = tag.get("href")
year = re.search(r"year-.?", link).group()
years.append(year)
print(years)
Output:
['year-x']
Edit Try using a CSS selector to select all href’s that end with year-7.html
...
for tag in soup.select('a[href$="year-7.html"]'):
print(tag)

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

How to get the text and URL from a link using beautifulsoup

I have the following code which prints out a list of the links for each team in a table:
import requests
from bs4 import BeautifulSoup
# Get all teams in Big Sky standings table
URL = 'https://www.espn.com/college-football/standings/_/group/20/view/fcs-i-aa'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
standings = soup.find_all('table', 'Table Table--align-right Table--fixed Table--fixed-left')
for team in standings:
team_data = team.find_all('span', 'hide-mobile')
print(team_data)
The code prints out the entire list and if I pinpoint an index such as 'print(team_data[0])', it will print out the specific link from the page.
How can I then go into that link and get the string from the URL as well as the text for the link?
For example, my code prints out the following for the first index in the list.
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
How can I pull
/college-football/team/_/id/2692/weber-state-wildcats
and
Weber State Wildcats
from the link?
Thank you for your time and if there is anything I can add for clarification, please don't hesitate to ask.
Provided that you have an html like:
<span class="hide-mobile"><a class="AnchorLink" data-clubhouse-uid="s:20~l:23~t:2692" href="/college-football/team/_/id/2692/weber-state-wildcats" tabindex="0">Weber State Wildcats</a></span>
To get the /college-football/team/_/id/2692/weber-state-wildcats:
>>> team_data.find_all('a')[0]['href']
'/college-football/team/_/id/2692/weber-state-wildcats'
To get the Weber State Wildcats:
>>> team_data.find_all('a')[0].text
'Weber State Wildcats''
In terms of the href/url, you can do something like this.
In regards to the link text, you could do something like this.
Both amount to filtering down to the target element, and then extracting the desired attribute.

Scraping page with BS in python only captures first column of splitColumn

I'm trying to scrape the last part of this page through BeautifulSoup in python.
I want to retrieve all the companies listed in the bottom. Furthermore, the companies are ordered alphabetically, where the companies with titles starting with "A-F" appear under the first tab, then "G-N" under the second tab and so on. You have to click the tabs for the names to appear, so I'll loop through the different "name pages" and apply the same code.
I'm having trouble retrieving all the names of a single page, however.
When looking at the companies named "A-F" I can only retrieve the names of the first column of the table.
My code is:
from bs4 import BeautifulSoup as Soup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-
responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url)
soup = Soup(page.content, "html.parser")
for header in soup.find("h2").next_siblings:
try:
for a in header.childGenerator():
if str(type(a)) == "<class 'bs4.element.NavigableString'>":
print(str(a))
except:
pass
As can be seen by running this, I only get the names from the first column.
Any help is very much appreciated.
Give this a shot and tell me this is not what you wanted:
from bs4 import BeautifulSoup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".splitColumn p"):
title = '\n'.join([item for item in items.strings])
print(title)
Result:
3iGroup
8point3 Energy Partners 
A
ABN AMRO
Accell Group
Accsys Technologies
Achmea
Acuity Brands
Adecco
Adidas
Adobe Systems

how to extract a certain number from a website in python

I need to make a bot for a client that gets data from http://backpack.tf/stats/Unique/AWPer%20Hand/Tradable/Craftable and does some research.
On the top of the website you see the recomended price (3 ref) and if you scroll down you can see what people are actaully selling them for.
I need to see if what there selling them for is less then the recomended price. I have inspected the element and found that each listing uses a class called "media listing" followed by a random ID. Where do i go from here?
I would suggest reading the BeautifulSoup documentation, but this should give you a good idea of what you want to do:
from bs4 import BeautifulSoup
import requests
url = "http://backpack.tf/stats/Unique/AWPer%20Hand/Tradable/Craftable"
r = requests.get(url)
soup = BeautifulSoup(r.text)
curPrice = soup.find('h2').findNext('a').text
print 'The current price is: {0}'.format(curPrice)
print 'These are the prices they are being sold at: '
print '\n'.join([item.text for item in soup.find_all('span', attrs={'class': 'label label-black', 'data-tip': 'bottom'})])

Categories

Resources