Parsing NBA reference with python beautiful soup - python

So I'm trying to scrape out the miscellaneous stats table from this site http://www.basketball-reference.com/leagues/NBA_2016.html using python and beautiful soup. This is the basic code so far I just want to see if it is even reading the table but when I do print table I just get none.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2016.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print table
When I inspect the html on the webpage itself, the table that I want appears with this symbol in front <!-- and the html text is green for the portion. What can I do?

<!-- is the start of a comment and --> is the end in html so just remove the comments before you parse it:
from bs4 import BeautifulSoup
import requests
comm = re.compile("<!--|-->")
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
tableStats = cleaned_soup.find('table', {'id':'team_stats'})
print(tableStats)

Related

BeautifulSoup - can't find attribute

I'm trying to scrape this link.
I want to get to this part here:
I can see where this part of the website is when I inspect the page:
But I can't get to it from BeautifulSoup.
Here is the code that I'm using and all the ways I've tried to access it:
from bs4 import BeautifulSoup
import requests
link = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
html_text = requests.get(link).text
soup = BeautifulSoup(html_text, 'html.parser')
soup.find_all(class_='data_grid')
soup.find_all(string="data_grid")
soup.find_all(attrs={"class": "data_grid"})
Also, when I just look at the html I can see that it is there:
You need to look at the actual source html code that you get in response (not the html you inspect, which you have shown to have done), you'll notice those tables are within the comments of the html Ie. <!-- and -->. BeautifulSoup ignores comments.
There are a few ways to go about it. BeautifulSoup does have a method to search and pull out comments, however with this particular site, I find it just easier to remove the comment tags.
Once you do that, you can easily parse the html with BeautifulSoup to get the desired <div> tag, then just let pandas parse the <table> tag within there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
response = requests.get(url)
html = response.text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
leaderboard_pts = soup.find('div', {'id':'leaderboard_pts'})
df = pd.read_html(str(leaderboard_pts))[0]
Output:
print(df)
0
0 2017-18 OVC 405 (18th)
1 2018-19 NCAA 808 (9th)
2 2018-19 OVC 808 (1st)
if you re looking for the point section i suggest to search with id like this:
point_section=soup.find("div",{"id":"leaderboard_pts"})

Neither pandas.read_html nor BeautifulSoup can find all tables on webpage

I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.
url = 'https://www.pro-football-reference.com/years/2021/'
data_pd = pd.read_html(url)
print(len(data_pd))
Output:
2
and also
url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
print(table.get('class'))
Output:
['sortable', 'stats_table']
['sortable', 'stats_table']
I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?
Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.
The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.
import requests
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")
data_pd = pd.read_html(response)
print(len(data_pd))
Output:
print(len(data_pd))
13
OR Using BEautifulSoup to co through the comments:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
data_pd = pd.read_html(url)
for each in comments:
if '<table' in str(each):
data_pd.append(pd.read_html(str(each))[0])
print(len(data_pd))

how can I get the names in this html code by python?

I want to get both of names "Justin Cutroni" and "Krista Seiden" without the tags
this is my html code that I want to get the names by python3:
I used beautifulsoup but I don't know how to get deep in the html codes and get the names.
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_lessons(web_page):
# Load the webpage content
r = requests.get(web_page)
# Convert to a beautiful soup object
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__title"]')
data = [x.text.split(';')[-1].strip() for x in table]
return data
find_teachers(web_pages[0])
You are looking at course-card__title, when it appears you want is course-card__teacher. When you're using requests, it's often more useful to look at the real HTML (using wget or curl) rather than the object model, as in your image.
What you have pretty much works with that change:
import requests
from bs4 import BeautifulSoup as bs
web_pages = ["https://maktabkhooneh.org/learn/"]
def find_teachers(web_page):
# Load the webpage content
r = requests.get(web_page)
soup = bs(r.content, features="html.parser")
table = soup.select('div[class="course-card__teacher"]')
return [x.text.strip() for x in table]
print(find_teachers(web_pages[0]))

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML
3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Categories

Resources