I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?
Related
Here is the link,
https://www.vit.org/WebReports/vesselschedule.aspx
I'm using BeautifulSoup and my goal was to extract the table from it.
I wrote the code..
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.vit.org/WebReports/vesselschedule.aspx"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"id": "ctl00_ContentPlaceHolder1_VesselScheduleControl1_Grid1"})
The final line of code gave me an error displaying 'None'.
I'm new to this web scraping, can you help me find a solution to get the table?
Why not pd.read_html(url)?
It will extract tables automatically
Problem is that the id by which you are looking for this table is appended to the element dynamically via js and as requests library is only downloading files at URL, nothing is dynamically appended and in result Your table is without id
If you encounter a similar error in the future (element exists but bs4 cant find it) try saving the response as text to an HTML file and inspect it in your browser.
For your particular case this code could be used:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
with open("tmp.html", "w") as f:
f.write(resp.text)
bs = BeautifulSoup(resp.text)
table = bs.find_all("table")[6] # not the best way to select elements
rows = table.find_all("tr")
Warning: Try avoiding such style of relative selecting, web pages are constantly updating and such code may procude errors in the future
I Parsed The Table and Added Each Rows in A List And Appended That To Data List
And Here You Go!..
And I Added The Total List In [Hashbin]
from bs4 import BeautifulSoup
import requests
url="https://www.vit.org/WebReports/vesselschedule.aspx"
soup = BeautifulSoup( requests.get(url).text )
table = soup.find_all('table')[6] # as it is not the best way as told by darkKnight
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
print(data)
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
soup=BeautifulSoup(res.text,"html.parser")
find columns by below code:
table=soup.find_all("table")[6]
columns=[col.get_text(strip=True) for col in table.find("tr",class_="HeadingRow").find_all("td")[1:-1]]
find row data by below code:
main_lst=[]
for row in table.find_all("tr",class_="Row"):
lst=[i.get_text(strip=True) for i in row.find_all("td")[1:-1]]
main_lst.append(lst)
create table using pandas
import pandas as pd
df=pd.DataFrame(columns=columns,data=main_lst)
df
Image:
You need a way to specify a pattern that uniquely identifies the target table given the nested tabular structure. The following css pattern will grab that table based on a string it contains ("Shipline"), an attribute that is not present, as well as the table's relationship to other elements within the DOM.
You can then pass that specific table to read_html and do some cleaning on the returned DataFrame.
import requests
from bs4 import BeautifulSoup as bs
from pandas import read_html as rh
r = requests.get('https://www.vit.org/WebReports/vesselschedule.aspx').text
soup = bs(r, 'lxml')
df = rh(str(soup.select_one('table table:not([style]):-soup-contains("Shipline")')))[0] #earlier soupsieve version use :contains
df.dropna(how='all', axis = 1, inplace = True)
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]
I am scraping the ranking data from Vivino.
The website is: link
but I got an empty data list can anyone help me. Image of the result i got Image
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszVXy02ssDU2UEuutPXzUUu2dQ0NUiuwNVRLT7MtSyzKTC1JzFHLT7ItSizJzEsvjk8sSy1KTE9Vy7dNSS1OVisviY4FKgZTRhDKGMozgdDmEMoEAJ7xJhY%3D" #the url of the website we're scraping
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
Name_html = html.find_all(class_="item_description")
Rating_html = html.find_all(class_="vivinoRating__averageValue--3Navj")
Name_list = []
Rating_list = []
#loop over HTML elements, get text and add to list
for i in Name_html:
Name_list.append(i)
for i in Rating_html:
Rating_list.append(i)
#make a dictionary with column names for data to put in DataFrame
data = {"description": Name_list,
"airdate": Rating_list}
df = pd.DataFrame(data) #make a dataframe
The wines labels are loaded through Javascript, you can't scrape them with bs4 as if they're part of a static html.
You need to either manually send the POST requests to get the response content, or scrape the page dinamically with selenium through any of the available browser simulation drivers (i.e. chromedriver).
I am having difficulty extracting text when web-scraping a table, and I think the filters on the page are to blame. I have tried isolating all the "tr" and "th" elements, but cannot seem to get the underlying text into Python. What am I doing wrong?
My code:
from bs4 import BeautifulSoup
import requests
page_link ='https://www.ersteliga.hu/stats#/players/1945/regular/points'
page = requests.get(page_link)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
my_table = soup.find_all('table', class_= "IHD-TABLE")
columns = my_table.find('th')
I expect the output to contain the text values seen online, but I seem to get output that looks different than the HTML on the web page.
Data comes dynamically from another xhr request the page makes you can find in the network tab. It returns json.
import requests
import pandas as pd
headers = {
'referer': 'https://www.ersteliga.hu/stats',
'user-agent': 'Mozilla/5.0'
}
data = {'championshipId': '1945', 'division': 'Alapszakasz','type': 'playerStatsChampionShipPeriod'}
r = requests.post('https://www.ersteliga.hu/ajax/CallWS', headers = headers, data=data).json()
df = pd.DataFrame([i for i in r['d']], columns = list(r['d'][0].keys()))
print(df)
Sort on point column desc to get same order as on page
print(df.sort_values(['point'], ascending=[False]))
I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.
I'm trying to scrape the following website:
https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs
I'm able to successfully scrape the events listed on the page using beautifulsoup, using the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
dates = soup.find_all('div', {'class': 'event-b58f7990'})
month=[]
day=[]
for i in dates:
md = i.find_all('div')
month.append(md[0].text)
day.append(md[1].text)
However, the issue I'm having is that I'm only able to scrape the first 18 events - the rest of the page is only available if the 'view all' button is clicked at the bottom. Is there a way in beautifulsoup, or otherwise, to simulate this button being clicked, so that I can scrape ALL of the data? I'd prefer to keep this in python as I'm doing most scraping with beautifulsoup. Thanks so much!
If you can work out the end point or set an end point for range in the following (with error handling for going too far) you can get a json response and parse out the info you require as follows. Depending on how many requests making you may choose to re-use connection with session.
import requests
import pandas as pd
url = 'https://www.bandsintown.com/upcomingEvents?came_from=257&sort_by_filter=Number+of+RSVPs&page={}&latitude=51.5167&longitude=0.0667'
results = []
for page in range(1,20):
data = requests.get(url.format(page)).json()
for item in data['events']:
results.append([item['artistName'], item['eventDate']['day'],item['eventDate']['month']])
df = pd.DataFrame(results)
print(df)