This is a context-specific question regarding how to use BeautifulSoup to parse an html table in python2.7.
I would like to extract the html table here and place it in a tab-delim csv, and have tried playing around with BeautifulSoup.
Code for context:
proxies = {
"http://": "198.204.231.235:3128",
}
site = "http://sloanconsortium.org/onlineprogram_listing?page=11&Institution=&field_op_delevery_mode_value_many_to_one[0]=100%25%20online"
r = requests.get(site, proxies=proxies)
print 'r: ', r
html_source = r.text
print 'src: ', html_source
soup = BeautifulSoup(html_source)
Why doesn't this code get the 4th row?
soup.find('table','views-table cols-6').tr[4]
How would I print out all of the elements in the first row (not the header row)?
Okey, someone might be able to give you a one liner, but the following should get you started
table = soup.find('table', class_='views-table cols-6')
for row in table.find_all('tr'):
row_text = list()
for item in row.find_all('td'):
text = item.text.strip()
row_text.append(text.encode('utf8'))
print row_text
I believe your tr[4] is believed to be an attribute and not an index as you suppose.
Related
I am trying to extract each row individually to eventually create a dataframe to export them into a csv. I can't locate the individual parts of the html.
I can find and save the entire content (although I can only seem to save this on a loop so the pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first row:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
data = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
df = pd.DataFrame(data)
df.columns = df.columns.str.lower().str.replace(': ','')
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df.to_csv (r'savehere.csv', index = False, header = True)
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what.
Any help would be much appreciated.
What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div class="views-field views-field-title-1"> <span class="views-label views-label-title-1">RRR: </span> <span class="field-content"><div class="type type_18">Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
website = #link here#
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
...
I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = #insert url#
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[class="views-field views-field-title-1"] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
List
I'm trying to get the advanced stats of players onto an excel sheet but the table it's scraping is the first one instead of the advanced stats table.
ValueError: Length of passed values is 23, index implies 21
If i try to use the id instead, i get an another error about tbody.
Also, I get an error about
lname=name.split(" ")[1]
IndexError: list index out of range.
I think that has to do with 'Nene' in the list. Is there a way to fix that?
import requests
from bs4 import BeautifulSoup
playernames=['Carlos Delfino',
'Yao Ming',
'Andris Biedrins',
'Nene']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}).find_next('tbody')
print(table)
columns = ['Season', 'Team', 'League', 'GP', 'GS', 'TS%', 'eFG%', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'TOV%', 'STL%', 'BLK%', 'USG%', 'Total S%', 'PPR', 'PPS', 'ORtg', 'DRtg', 'PER']
df = pd.DataFrame(columns=columns)
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '') for td in tds]
df = df.append(pd.Series(row, index=columns), ignore_index=True)
df.to_csv('international players.csv', index=False)
Brazilians only use one name for soccer think Fred. If you want to use their moniker (Nene/Fred) then you need to implement exception handling for this, something like
try:
lname=name.split(" ")[1]
except IndexError:
lname=name
For your scraping issue, try using find_all as opposed to find, this will give you every data table on a given page and then you can pull the correct table out of the list
Change table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}, {'id':'table-3554'}) to find_all
FYI also, the table ID's change every time you refresh the page so you can't use ID as a search mechanism.
I am trying to scrape some baseball related data and keep getting an empty list. I'm somewhat new to scraping and hoping someone can help. Thanks!
from bs4 import BeautifulSoup
import requests
url = 'https://www.fangraphs.com/statss.aspx?playerid=2520&position=P'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
playerData = soup.find_all('tr', {"id":"SeasonStats1_dgSeason11_ctl00"})
print(playerData)
That's because the rows have a different id than the parent table.
You can access them like this:
playerData = soup.find('table', {"id":"SeasonStats1_dgSeason11_ctl00"}).find_all('tr')
SeasonStats1_dgSeason11_ctl00 doesn't exist in your data. You need to wildcard it with lamda or regex
playerData = soup.find_all('tr',{"id": lambda L: L and L.startswith('SeasonStats1_dgSeason11_ctl00')})
print(playerData)
There is no row with the id "SeasonStats1_dgSeason11_ctl00"
But you could get the whole table with 'table' instead of the row 'tr'
playerData = soup.find_all('table', {"id":"SeasonStats1_dgSeason11_ctl00"})
I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html
base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []
for link in soup.find_all('a'):
if link.has_attr('href'):
if link.get_text() == 'boxscore':
url = base_url + link['href']
for x in url:
response = requests.get('x')
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
I am using the code in order to get all the boxscore urls from http://www.pro-football-reference.com/years/2014/games.htm. After I get these boxscore urls I would like to loop through them to scrape the quarter by quarter data for each team but my syntax always seems to be off no matter how I format the code.
If it is possible I would like to scrape more than just the scoring data by also getting the Game Info, officials, and Expected points per game.
If you modify your loop slightly to:
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url = base_url + link['href']
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
# Scores
table = soup.find('table', attrs={'id': 'scoring'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
That returns each of the cells for each row in the scoring table for each page linked to with the 'boxscore' text.
The issues I found with the existing code were:
You were attempting to loop through each character in the href returned for the 'boxscore' link.
You were always requesting the string 'x'.
Not so much an issue, but I changed the table selector to identify the table by its id 'scoring' rather than the class. Ids at least should be unique within the page (though there is no guarentee).
I'd recommend that you find each table (or HTML element) containing the data you want in the main loop (e.g score_table = soup.find('table'...) but that you move the code that parses that data (e.g)...
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
...into a separate function that returns said data (one for each type of data you are extracting), just to keep the code slightly more manageable. The more the code indents to handle if tests and for loops the more difficult it tends to be to follow the flow. For example:
score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)
other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)