I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!
Related
I am trying to get the 3rd and 6th tables from a webpage (https://www.pro-football-reference.com/years/2021/) but pandas.read_html and BeautifulSoup are both only finding the first two tables on the page. Here is what I've tried.
url = 'https://www.pro-football-reference.com/years/2021/'
data_pd = pd.read_html(url)
print(len(data_pd))
Output:
2
and also
url = 'https://www.pro-football-reference.com/years/2021/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for table in soup.find_all('table'):
print(table.get('class'))
Output:
['sortable', 'stats_table']
['sortable', 'stats_table']
I am guessing it has something to do with the way the webpage is formatted, but is there anything I can do to grab the tables that I need?
Yes you could use Selenium to let the page render then pull in the html. However I try to avoid Selenium if I could as to avoid the overhead.
The better option though is through the simple request, the static html does have the other tables in there, but within the comments. You could do a) BeautifulSoup does have the ability to pull out the Comments to then parse those tables. Or simply remove the comment tags and then parse.
import requests
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
response = requests.get(url).text.replace("<!--","").replace("-->","")
data_pd = pd.read_html(response)
print(len(data_pd))
Output:
print(len(data_pd))
13
OR Using BEautifulSoup to co through the comments:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2021/'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
data_pd = pd.read_html(url)
for each in comments:
if '<table' in str(each):
data_pd.append(pd.read_html(str(each))[0])
print(len(data_pd))
i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?
You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan
I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.
https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()
I am trying to find a table in a Wikipedia page using BeautifulSoup and for some reason I don't get the table.
Can anyone tell why I don't get the table?
my code:
import BeautifulSoup
import requests
url='https://en.wikipedia.org/wiki/List_of_National_Historic_Landmarks_in_Louisiana'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')
tab=soup.find("table",{"class":"wikitable sortable jquery-tablesorter"})
print tab
prints: None
You shouldn't use jquery-tablesorter to select against in the response you get from requests because it is dynamically applied after the page loads. If you omit that, you should be good to go.
tab = soup.find("table",{"class":"wikitable sortable"})