I am mostly new. I have a go-to formula for scraping sports data, but am unable to get this to work with yahoo. I am able to get my tokens, but then using beautifulsoup to scrape returns nothing. Here is the working part of my script:
# URL to be scraped
url = "https://basketball.fantasysports.yahoo.com/nba/75810/1"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# use findALL() to get the column headers
soup.findAll('tr')
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr')[0].findAll('th')]
# avoid the first header row
rows = soup.findAll('tr')[1:]
team_stats_advanced = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# create data frames
advancedstats = pd.DataFrame(team_stats_advanced[1:], columns = headers)
Related
I would scrape all data-oid tag from this page, but return nothing in the output
Code
url = 'https://www.betexplorer.com/soccer/south-korea/k-league-2/bucheon-fc-1995-jeonnam/EDwej14E/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', class_='table-main')
for rows in table.find_all('tr')[1:]:
for row in rows.find_all('td'):
data = row.get_attrs['data-oid']
print(data)
The part table part of the page is loaded from external URL via JavaScript. To get the data along with the tags with data-oid= parameters, you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://www.betexplorer.com/soccer/south-korea/k-league-2/bucheon-fc-1995-jeonnam/EDwej14E/"
match_id = "EDwej14E" # <-- this is the last part of URL
api_url = "https://www.betexplorer.com/match-odds/{}/1/1x2/".format(match_id)
headers = {"Referer": "https://www.betexplorer.com"}
data = requests.get(api_url, headers=headers).json()
soup = BeautifulSoup(data["odds"], "html.parser")
# your code:
table = soup.find("table", class_="table-main")
for rows in table.find_all("tr")[1:]:
for row in rows.select("td[data-oid]"):
data = row["data-oid"]
print(data)
Prints:
...
4kqjpxv464x0xc6aif
4kqjpxv464x0xc6aie
4kqjpxv498x0x0
4kqjpxv464x0xc6aif
4kqjpxv464x0xc6aie
4kqjpxv498x0x0
4kqjpxv464x0xc6aif
I want to Retrieve a financial dataset from a Website which has a log in. I've managed to log in using requests and access the HTML
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
login_data = dict(email='my login', password='password')
s.post('*portal webiste with/login*', data=login_data)
r = s.get(' *website with finacial page* ')
print (r.content)
## work on r as its a direct link
url = r # stock url
page = url
soup = BeautifulSoup(page.text) # returns the htm of the finance page.
The above code allows me to log in and get the html from the correct page.
headers = []
# finds all the headers.
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns = headers)
print(df)
this block finds the table and gets the column headers.
which are printed as:
Columns: [Date, Type, Type, Credit, Debit, Outstanding, Case File, ]
The next part is the problem. when I attempt to retrieve the financials using the following code:
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip()for td in data]
print(row_data)
it returns this
['"Loading Please Wait..."']
HTML of the site looks like this
html of the site i want to scrape
I am trying to scrape data from stathead.com, basketball-reference.com's new subscription service. When using my normal approach that I would've used on BR, it won't scrape the first 10 rows, or 21-100 rows, only 11-20. Any thoughts? For example, stats only returns a subset of the full data.
url = "https://stathead.com/basketball/lineup_finder.cgi?request=1&match=single&order_by_asc=0&order_by=diff_pts&lineup_type=2-man&output=per_poss&is_playoffs=N&year_id=2015&ccomp%5B1%5D=gt&cval%5B1%5D=100&cstat%5B1%5D=mp&game_month=0&game_num_min=0&game_num_max=99"
html = urlopen(url)
soup = BeautifulSoup(html)
rows = soup.findAll('tr')[1:]
headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')][1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns = headers)
** you can try the below code and later filter out the required data.**
import pandas as pd
url = 'https://stathead.com/basketball/lineup_finder.cgi?request=1&match=single&order_by_asc=0&order_by=diff_pts&lineup_type=2-man&output=per_poss&is_playoffs=N&year_id=2015&ccomp%5B1%5D=gt&cval%5B1%5D=100&cstat%5B1%5D=mp&game_month=0&game_num_min=0&game_num_max=99'
df = pd.read_html(url)
print(df)
I've written a script in python to scrape the tablular content from a webpage. In the first column of the main table there are the names. Some names have links to lead another page, some are just the names without any link. My intention is to parse the rows when a name has no link to another page. However, when the name has link to another page then the script will first parse the concerning rows from the main table and then follow that link to parse associated information of that name from the table located at the bottom under the title Companies. Finally, write them in a csv file.
site link
I've tried so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
first_table = [i.text for i in item.select("td")]
print(first_table)
else:
first_table = [i.text for i in item.select("td")]
print(first_table)
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info = [elem.text for elem in elems.select("td")]
print(associated_info)
My above script can do almost everything but I can't create any logic to print once rather than printing thrice to get all the data atltogether so that I can write them in a csv file.
Put all your scraped data into a list, here I've called the list associated_info then all the data is in one place & you can iterate over the list to print it out to a CSV if you like...
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
link = "https://suite.endole.co.uk/insight/company/ajax_people.php?ajax_url=ajax_people&page=1&company_number=03512889"
base = "https://suite.endole.co.uk"
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
associated_info = []
for item in soup.select("table tr")[1:]:
if not item.select_one("td a[href]"):
associated_info.append([i.text for i in item.select("td")])
else:
associated_info.append([i.text for i in item.select("td")])
url = urljoin(base,item.select_one("td a[href]").get("href"))
resp = requests.get(url)
soup_ano = BeautifulSoup(resp.text,"lxml")
for elems in soup_ano.select(".content:contains(Companies) table tr")[1:]:
associated_info.append([elem.text for elem in elems.select("td")])
print(associated_info)
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html
base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []
for link in soup.find_all('a'):
if link.has_attr('href'):
if link.get_text() == 'boxscore':
url = base_url + link['href']
for x in url:
response = requests.get('x')
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
I am using the code in order to get all the boxscore urls from http://www.pro-football-reference.com/years/2014/games.htm. After I get these boxscore urls I would like to loop through them to scrape the quarter by quarter data for each team but my syntax always seems to be off no matter how I format the code.
If it is possible I would like to scrape more than just the scoring data by also getting the Game Info, officials, and Expected points per game.
If you modify your loop slightly to:
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url = base_url + link['href']
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
# Scores
table = soup.find('table', attrs={'id': 'scoring'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
That returns each of the cells for each row in the scoring table for each page linked to with the 'boxscore' text.
The issues I found with the existing code were:
You were attempting to loop through each character in the href returned for the 'boxscore' link.
You were always requesting the string 'x'.
Not so much an issue, but I changed the table selector to identify the table by its id 'scoring' rather than the class. Ids at least should be unique within the page (though there is no guarentee).
I'd recommend that you find each table (or HTML element) containing the data you want in the main loop (e.g score_table = soup.find('table'...) but that you move the code that parses that data (e.g)...
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
...into a separate function that returns said data (one for each type of data you are extracting), just to keep the code slightly more manageable. The more the code indents to handle if tests and for loops the more difficult it tends to be to follow the flow. For example:
score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)
other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)