BeautifulSoup - can't find attribute - python

I'm trying to scrape this link.
I want to get to this part here:
I can see where this part of the website is when I inspect the page:
But I can't get to it from BeautifulSoup.
Here is the code that I'm using and all the ways I've tried to access it:
from bs4 import BeautifulSoup
import requests
link = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
html_text = requests.get(link).text
soup = BeautifulSoup(html_text, 'html.parser')
soup.find_all(class_='data_grid')
soup.find_all(string="data_grid")
soup.find_all(attrs={"class": "data_grid"})
Also, when I just look at the html I can see that it is there:

You need to look at the actual source html code that you get in response (not the html you inspect, which you have shown to have done), you'll notice those tables are within the comments of the html Ie. <!-- and -->. BeautifulSoup ignores comments.
There are a few ways to go about it. BeautifulSoup does have a method to search and pull out comments, however with this particular site, I find it just easier to remove the comment tags.
Once you do that, you can easily parse the html with BeautifulSoup to get the desired <div> tag, then just let pandas parse the <table> tag within there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
response = requests.get(url)
html = response.text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
leaderboard_pts = soup.find('div', {'id':'leaderboard_pts'})
df = pd.read_html(str(leaderboard_pts))[0]
Output:
print(df)
0
0 2017-18 OVC 405 (18th)
1 2018-19 NCAA 808 (9th)
2 2018-19 OVC 808 (1st)

if you re looking for the point section i suggest to search with id like this:
point_section=soup.find("div",{"id":"leaderboard_pts"})

Related

Neither pandas.read_html nor BeautifulSoup can find all tables on webpage

I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.
url = 'https://www.pro-football-reference.com/years/2021/'
data_pd = pd.read_html(url)
print(len(data_pd))
Output:
2
and also
url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
print(table.get('class'))
Output:
['sortable', 'stats_table']
['sortable', 'stats_table']
I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?
Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.
The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.
import requests
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")
data_pd = pd.read_html(response)
print(len(data_pd))
Output:
print(len(data_pd))
13
OR Using BEautifulSoup to co through the comments:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
data_pd = pd.read_html(url)
for each in comments:
if '<table' in str(each):
data_pd.append(pd.read_html(str(each))[0])
print(len(data_pd))

How to python Scrape text in span class

So I'm making a bitcoin checker practice and I'm having trouble scraping data because the data I want is in a span class and I don't know how to retrieve the data.
so here is the line that I got from inspect:
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
I want to scrape the "11,511.31" number. How do I do this?
I tried so many different things and I honestly have no clue what to do anymore.
here is the URL:link
Im scraping the current USD price (right next to "BTC/USD")
EDIT: Guys a lot of the examples you gave me is where i input the data. Thats not useful because i want to refresh the page every 30 seconds so I need the program to find the span class and extract the data and print it'
EDIT:current code. need to get programm to get "html" part by itself
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
url = 'https://www.gdax.com/trade/BTC-USD'
#program need to retrieve this by itself
html = """<span class="MarketInfo_market-num_1lAXs">11,560.00 USD</span>"""
soup = BeautifulSoup(html, "html.parser")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
You just have to search for the right tag and class -
from bs4 import BeautifulSoup
html_text = """
<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>
"""
html = BeautifulSoup(html_text, "lxml")
spans = html.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD', '').strip())
Searching for all <span> tags and then filtering them by class attribute, which in you case has a value of MarketInfo_market-num_1lAXs. Once the filter is done just loop through the spans and using the .text attribute you can retrieve the text, then just replace the 'USD'.
UPDATE
import requests
import json
url = 'https://api.gdax.com/products/BTC-USD/trades'
res = requests.get(url)
json_res = json.loads(res.text)
print(json_res[0]['price'])
No need to understand the HTML. The data in that HTML tag is getting populated from an API call which has a JSON response. You can call that API directly. This will keep your data current.
you can use beautifulsoup or lxml.
For beautifulsoup, the code like as following
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""", "lxml")
print(soup.string)
The lxml is more quickly
from lxml import etree
span = etree.HTML("""<span class="MarketInfo_market-num_1lAXs"> 11,511.31 USD </span>""")
for i in span.xpath("//span/text()"):
print(i)
Try a real browser like Selenium-Firefox. I tried to use Selenium-PhantomJS, but I failed...
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = 'https://www.gdax.com/trade/BTC-USD'
driver = webdriver.Firefox(executable_path='./geckodriver')
driver.get(url)
sleep(10) # Sleep 10 seconds while waiting for the page to load...
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
spans=soup.find_all('span', {'class': 'MarketInfo_market-num_1lAXs'})
for span in spans:
print(span.text.replace('USD','').strip())
driver.close()
Output:
11,493.00
+
3.06 %
13,432 BTC
[Finished in 15.0s]

Python BeautifulSoup cannot find table ID

I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!

Parsing NBA reference with python beautiful soup

So I'm trying to scrape out the miscellaneous stats table from this site http://www.basketball-reference.com/leagues/NBA_2016.html using python and beautiful soup. This is the basic code so far I just want to see if it is even reading the table but when I do print table I just get none.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2016.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print table
When I inspect the html on the webpage itself, the table that I want appears with this symbol in front <!-- and the html text is green for the portion. What can I do?
<!-- is the start of a comment and --> is the end in html so just remove the comments before you parse it:
from bs4 import BeautifulSoup
import requests
comm = re.compile("<!--|-->")
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
tableStats = cleaned_soup.find('table', {'id':'team_stats'})
print(tableStats)

Beautifulsoup url loading error

So I am trying to get the content of this page using beautiful soup. I want to create a dictionary of all the css color names and this seemed like a quick and easy way to access this. So naturally I did the quick basic:
from bs4 import BeautifulSoup as bs
url = 'http://www.w3schools.com/cssref/css_colornames.asp'
soup = bs(url)
for some reason I am only getting the url in a p tag inside the body and that's it:
>>> print soup.prettify()
<html>
<body>
<p>
http://www.w3schools.com/cssref/css_colornames.asp
</p>
</body>
</html>
why wont BeautifulSoup give me access to the information I need?
Beautifulsoup does not load a URL for you.
You need to pass in the full HTML page, which means you need to load it from the URL first. Here is a sample using the urllib2.urlopen function to achieve that:
from urllib2 import urlopen
from bs4 import BeautifulSoup as bs
source = urlopen(url).read()
soup = bs(source)
Now you can extract the colours just fine:
css_table = soup.find('table', class_='reference')
for row in css_table.find_all('tr'):
cells = row.find_all('td')
if cells:
print cells[0].a.text, cells[1].a.text

Categories

Resources