On the website https://www.shanghairanking.com/rankings/arwu/2020
the URL doesn't change when I hit "next". Any ideas on how to scrape the tables on the next pages. Using bs4 in Python, I am able to only scrape the table on the first page.
What I did so far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_text = requests.get('https://www.shanghairanking.com/rankings/arwu/2020').text
soup = BeautifulSoup(html_text,'lxml')
data = soup.find('table', class_= "rk-table").text.replace(' ','')
print(data)
Related
I am struggling to data scrape this website:
https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=k983l1LiiUeOz5_3Pd_CLXbjfadc08q1fEu54xfh9aA.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTMwVDAxOjIzOjAyLjU1MVoiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0zMFQwNToyMzowMi41NTFaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ¤cy=GBP¤tCurrency=GBP&vsi=795183b4-8f30-4854-bd85-77678dbe4cf8&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D
This URL has a table but for some reason I am not able to scrape this into an excel file. This is my current code in Python and this is what I have tried. Any help is much appreciated thank you legends!
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=dxGyx3zK9ULK0A8UtGOrLw-__FTD9EBEfzQojJ7Bz00.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTI5VDE4OjM0OjQwLjgwM1oiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0yOVQyMjozNDo0MC44MDNaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ¤cy=GBP¤tCurrency=GBP&vsi=57130cda-8191-488e-8089-f472928266e3&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D"
table_id = "theTable"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table', attrs={"id" : theTable})
df = pd.read_html(str(table))
The page is loading the data using JavaScript. You can find the URL using the Network tab of Firefox. Even better news, the data is in the CSV format so you don't even need an HTML parser to parse it.
You can find the CSV here.
I am trying to extract one table from the web page (https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League) by using selenium and BeautifulSoup.
But I am stuck with parsing table.
I want just one table from the web page which is "League table" but whatever I've tried, I got error messages.
Here are my code that I've tried.
import selenium
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver.get("https://google.com")
elem = driver.find_element_by_xpath('//*[#id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input')
elem.send_keys("2018 epl")
elem.submit()
try:
print(driver.title)
driver.find_element_by_partial_link_text("Wikipedia").click()
website = requests.get(driver.current_url).text
soup = BeautifulSoup(website, 'html.parser')
And then I'm facing trouble..
I've tried several codes, one of them are below.
rows=soup.find_all('td')
So can you help me to complete my code?
Thank you a lot.
You could just use pandas read_html and extract via appropriate index. I will however show using the :has selector for bs4 4.7.1 + to ensure you select h2 that has id League_table then immediate sibling combinator to get adjacent table
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('h2:has(#League_table) + table')))
print(table)
Just read_html
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
print(tables[4])
Maybe that will help you start:
import requests
from bs4 import BeautifulSoup
respond = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = BeautifulSoup(respond.text, 'lxml')
table = soup.find_all('table', {'class': 'wikitable'})
i got the table by using this code below your code.
soup.body.find_all("table", class_="wikitable")[3]
I found the table by using trial and error method i.e first see the class of the table and then use find_all and then list individual items and verifying the output.
I am trying to scrape box-score data from ProFootball reference. After running into issues with javascript, I turned to selenium to get the initial soup object. I'm trying to find a specific table on a website and subsequently iterate through its rows.
The code words if I simply find_all('table')[#] however the # changes depending on which box score I am looking at so it isn't reliable. I therefore want to use the id='player_offense' tag to identify the same table across games but when I use it it returns nothing. What am I missing here?
from selenium import webdriver
import os
from bs4 import BeautifulSoup
#path to chromedriver
chrome_path=os.path.expanduser('~/Documents/chromedriver.exe')
driver = webdriver.Chrome(path)
driver.get('https://www.pro-football-
reference.com/boxscores/201709070nwe.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
#doesn't work
soup.find('table',id='player_offense')
#works
table = soup.find_all('table')[3]
Data is in comments. Find the appropriate comment and then extract table
import requests
from bs4 import BeautifulSoup as bs
from bs4 import Comment
import pandas as pd
r= requests.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm#')
soup = bs(r.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="player_offense"' in comment:
print(pd.read_html(comment)[0])
break
This also works.
from requests_html import HTMLSession, HTML
import pandas as pd
with HTMLSession() as s:
r = s.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm')
r = HTML(html=r.text)
r.render()
table = r.find('table#player_offense', first=True)
df = pd.read_html(table.html)
print(df)
I'm scraping a table from a page.
But the table's caption is 'blind'.
Is there no way to extract the table from the site?
Using BeautifulSoup like:
from urllib.request import urllib
from bs4 import BeautifulSoup
Take a look at this:
import bs4 as bs
import urllib.request
link = 'http://companyinfo.stock.naver.com/v1/company/c1010001.aspx?cn=&cmp_cd=005930&menuType=block'
source = urllib.request.urlopen(link)
soup = bs.BeautifulSoup(source, 'html.parser')
table = soup.find('table', attrs={'id' : 'cTB24'})
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
print(td.text)
So I'm trying to scrape out the miscellaneous stats table from this site http://www.basketball-reference.com/leagues/NBA_2016.html using python and beautiful soup. This is the basic code so far I just want to see if it is even reading the table but when I do print table I just get none.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2016.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print table
When I inspect the html on the webpage itself, the table that I want appears with this symbol in front <!-- and the html text is green for the portion. What can I do?
<!-- is the start of a comment and --> is the end in html so just remove the comments before you parse it:
from bs4 import BeautifulSoup
import requests
comm = re.compile("<!--|-->")
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
tableStats = cleaned_soup.find('table', {'id':'team_stats'})
print(tableStats)