So I am trying to get the content of this page using beautiful soup. I want to create a dictionary of all the css color names and this seemed like a quick and easy way to access this. So naturally I did the quick basic:
from bs4 import BeautifulSoup as bs
url = 'http://www.w3schools.com/cssref/css_colornames.asp'
soup = bs(url)
for some reason I am only getting the url in a p tag inside the body and that's it:
>>> print soup.prettify()
<html>
<body>
<p>
http://www.w3schools.com/cssref/css_colornames.asp
</p>
</body>
</html>
why wont BeautifulSoup give me access to the information I need?
Beautifulsoup does not load a URL for you.
You need to pass in the full HTML page, which means you need to load it from the URL first. Here is a sample using the urllib2.urlopen function to achieve that:
from urllib2 import urlopen
from bs4 import BeautifulSoup as bs
source = urlopen(url).read()
soup = bs(source)
Now you can extract the colours just fine:
css_table = soup.find('table', class_='reference')
for row in css_table.find_all('tr'):
cells = row.find_all('td')
if cells:
print cells[0].a.text, cells[1].a.text
Related
I'm trying to scrape this link.
I want to get to this part here:
I can see where this part of the website is when I inspect the page:
But I can't get to it from BeautifulSoup.
Here is the code that I'm using and all the ways I've tried to access it:
from bs4 import BeautifulSoup
import requests
link = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
html_text = requests.get(link).text
soup = BeautifulSoup(html_text, 'html.parser')
soup.find_all(class_='data_grid')
soup.find_all(string="data_grid")
soup.find_all(attrs={"class": "data_grid"})
Also, when I just look at the html I can see that it is there:
You need to look at the actual source html code that you get in response (not the html you inspect, which you have shown to have done), you'll notice those tables are within the comments of the html Ie. <!-- and -->. BeautifulSoup ignores comments.
There are a few ways to go about it. BeautifulSoup does have a method to search and pull out comments, however with this particular site, I find it just easier to remove the comment tags.
Once you do that, you can easily parse the html with BeautifulSoup to get the desired <div> tag, then just let pandas parse the <table> tag within there.
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sports-reference.com/cbb/players/temetrius-morant-1.html'
response = requests.get(url)
html = response.text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
leaderboard_pts = soup.find('div', {'id':'leaderboard_pts'})
df = pd.read_html(str(leaderboard_pts))[0]
Output:
print(df)
0
0 2017-18 OVC 405 (18th)
1 2018-19 NCAA 808 (9th)
2 2018-19 OVC 808 (1st)
if you re looking for the point section i suggest to search with id like this:
point_section=soup.find("div",{"id":"leaderboard_pts"})
I tried this, it doesn't seem to be working. I only need the article links in a list.
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml")
bsObj = BeautifulSoup(html.read(),"html.parser");
for link in bsObj.find_all('a'):
print(link.get('href'))
Even though it renders as HTML when accessed through a browser, the server returns an XML to Python. If you print(html.read()) you will see that XML.
In this XML the <a> tags are replaced with <link> tags (with no attributes), so you need to change your code to reflect that:
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml")
bsObj = BeautifulSoup(html.read(),"html.parser");
for link in bsObj.find_all('link'):
print(link.text)
# http://www.bbc.co.uk/news/
# http://www.bbc.co.uk/news/
# http://www.bbc.co.uk/news/entertainment-arts-41914725
# http://www.bbc.co.uk/news/entertainment-arts-41886207
# http://www.bbc.co.uk/news/entertainment-arts-41886475
# ...
# ...
import feedparser
url='http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml'
data = feedparser.parse(url)
i=0
while i < len(data):
print(data['entries'][i]["link"])
i=i+1
So I'm trying to scrape out the miscellaneous stats table from this site http://www.basketball-reference.com/leagues/NBA_2016.html using python and beautiful soup. This is the basic code so far I just want to see if it is even reading the table but when I do print table I just get none.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2016.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print table
When I inspect the html on the webpage itself, the table that I want appears with this symbol in front <!-- and the html text is green for the portion. What can I do?
<!-- is the start of a comment and --> is the end in html so just remove the comments before you parse it:
from bs4 import BeautifulSoup
import requests
comm = re.compile("<!--|-->")
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
tableStats = cleaned_soup.find('table', {'id':'team_stats'})
print(tableStats)
I'm trying to download the text from a news website. The HTML is:
<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
<div class="field-item odd">
<p>"My Text" target="_blank">www.injuv.cl</a></strong></p> </div>
The output should be: My Text
I'm using the following python code:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)
But the output of the code is: "None". Do you know what is wrong with my code??
The problem is that you are not parsing the HTML, you are parsing the URL string:
html = "My URL"
parsed_html = BeautifulSoup(html)
Instead, you need to get/retrieve/download the source first, example in Python 2:
from urllib2 import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
In Python 3, it would be:
from urllib.request import urlopen
html = urlopen("My URL")
parsed_html = BeautifulSoup(html)
Or, you can use the third-party "for humans"-style requests library:
import requests
html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)
Also note that you should not be using BeautifulSoup version 3 at all - it is not maintained anymore. Replace:
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
with just:
from bs4 import BeautifulSoup
BeautifulSoup accepts a string of HTML. You need to retrieve the HTML from the page using the URL.
Check out urllib for making HTTP requests. (Or requests for an even simpler way.) Retrieve the HTML and pass that to BeautifulSoup like so:
import urllib
from bs4 import BeautifulSoup
# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()
# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)
From here, just parse as you attempted previously.
p = soup.find("div", attrs={'class':'pane-content'})
print(p)
I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find().
I am trying to obtain "#1 in Beauty" only from the following.
<ul>
<li>...<li>
<li>...<li>
<li id="salesRank">
<b>Amazon Best Sellers Rank:</b>
"#1 in Beauty ("
See top 100
")
Can anyone help me with this?
You need to use the find_all method of soup. Try below
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='your url here'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
print soup.find_all('#1 in Beauty')