I am trying to find a table in a Wikipedia page using BeautifulSoup and for some reason I don't get the table.
Can anyone tell why I don't get the table?
my code:
import BeautifulSoup
import requests
url='https://en.wikipedia.org/wiki/List_of_National_Historic_Landmarks_in_Louisiana'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')
tab=soup.find("table",{"class":"wikitable sortable jquery-tablesorter"})
print tab
prints: None
You shouldn't use jquery-tablesorter to select against in the response you get from requests because it is dynamically applied after the page loads. If you omit that, you should be good to go.
tab = soup.find("table",{"class":"wikitable sortable"})
Related
I am having trouble getting hyperlinks for tennis matches listed on a webpage, how do I go about fixing the code below so that it can obtain it through print?
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02")
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
print(soup.findAll('a href'))
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Select your elements more specific and use set comprehension to avoid duplicates:
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02')
soup = BeautifulSoup(r.text)
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Output
{'https://www.betexplorer.com/tennis/itf-men-singles/m15-new-delhi-2/sinha-nitin-kumar-vardhan-vishnu/tOasQaJm/',
'https://www.betexplorer.com/tennis/itf-women-doubles/w25-jerusalem/mushika-mao-mushika-mio-cohen-sapir-nagornaia-sofiia/xbNOHTEH/',
'https://www.betexplorer.com/tennis/itf-men-singles/m25-jakarta-2/barki-nathan-anthony-sun-fajing/zy2r8bp0/',
'https://www.betexplorer.com/tennis/itf-women-singles/w15-solarino/margherita-marcon-abbagnato-anastasia/lpq2YX4d/',
'https://www.betexplorer.com/tennis/itf-women-singles/w60-sydney/lee-ya-hsuan-namigata-junri/CEQrNPIG/',
'https://www.betexplorer.com/tennis/itf-men-doubles/m15-sharm-elsheikh-16/echeverria-john-marrero-curbelo-ivan-ianin-nikita-jasper-lai/nsGbyqiT/',...}
Change the last line to
print([a['href'] for a in soup.findAll('a')])
See a full tutorial here: https://pythonprogramminglanguage.com/get-links-from-webpage/
I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)
the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.
I am trying to extract one table from the web page (https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League) by using selenium and BeautifulSoup.
But I am stuck with parsing table.
I want just one table from the web page which is "League table" but whatever I've tried, I got error messages.
Here are my code that I've tried.
import selenium
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver.get("https://google.com")
elem = driver.find_element_by_xpath('//*[#id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input')
elem.send_keys("2018 epl")
elem.submit()
try:
print(driver.title)
driver.find_element_by_partial_link_text("Wikipedia").click()
website = requests.get(driver.current_url).text
soup = BeautifulSoup(website, 'html.parser')
And then I'm facing trouble..
I've tried several codes, one of them are below.
rows=soup.find_all('td')
So can you help me to complete my code?
Thank you a lot.
You could just use pandas read_html and extract via appropriate index. I will however show using the :has selector for bs4 4.7.1 + to ensure you select h2 that has id League_table then immediate sibling combinator to get adjacent table
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('h2:has(#League_table) + table')))
print(table)
Just read_html
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
print(tables[4])
Maybe that will help you start:
import requests
from bs4 import BeautifulSoup
respond = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = BeautifulSoup(respond.text, 'lxml')
table = soup.find_all('table', {'class': 'wikitable'})
i got the table by using this code below your code.
soup.body.find_all("table", class_="wikitable")[3]
I found the table by using trial and error method i.e first see the class of the table and then use find_all and then list individual items and verifying the output.
https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()
I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!