Trying to scrape table and keep getting empty list - python

I am trying to scrape some baseball related data and keep getting an empty list. I'm somewhat new to scraping and hoping someone can help. Thanks!
from bs4 import BeautifulSoup
import requests
url = 'https://www.fangraphs.com/statss.aspx?playerid=2520&position=P'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
playerData = soup.find_all('tr', {"id":"SeasonStats1_dgSeason11_ctl00"})
print(playerData)

That's because the rows have a different id than the parent table.
You can access them like this:
playerData = soup.find('table', {"id":"SeasonStats1_dgSeason11_ctl00"}).find_all('tr')

SeasonStats1_dgSeason11_ctl00 doesn't exist in your data. You need to wildcard it with lamda or regex
playerData = soup.find_all('tr',{"id": lambda L: L and L.startswith('SeasonStats1_dgSeason11_ctl00')})
print(playerData)

There is no row with the id "SeasonStats1_dgSeason11_ctl00"
But you could get the whole table with 'table' instead of the row 'tr'
playerData = soup.find_all('table', {"id":"SeasonStats1_dgSeason11_ctl00"})

Related

HTML parts locating

I am trying to extract each row individually to eventually create a dataframe to export them into a csv. I can't locate the individual parts of the html.
I can find and save the entire content (although I can only seem to save this on a loop so the pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first row:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
data = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
df = pd.DataFrame(data)
df.columns = df.columns.str.lower().str.replace(': ','')
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df.to_csv (r'savehere.csv', index = False, header = True)
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what.
Any help would be much appreciated.
What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div class="views-field views-field-title-1"> <span class="views-label views-label-title-1">RRR: </span> <span class="field-content"><div class="type type_18">Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
website = #link here#
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
...
I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = #insert url#
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[class="views-field views-field-title-1"] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
List

Extracting names from Wikipedia bullet lists only returns the first name for each letter

I am trying to grab all the names from this following Wikipedia page: https://ro.wikipedia.org/wiki/List%C4%83_de_prenume_rom%C3%A2ne%C8%99ti
This is the code I'm running:
from bs4 import BeautifulSoup
import requests
url = 'https://ro.m.wikipedia.org/wiki/List%C4%83_de_prenume_rom%C3%A2ne%C8%99ti'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
wikiName = [x.find('a').text.upper() for x in soup.findAll('div', class_ = 'div-col columns column-count column-count-5')]
for i in wikiName:
print(i)
I want to preface this that I'm an absolute beginner. I have tried to input different strings after class_, but nothing returns the entire list of names. The only names that get returned are the first from each letter:
ADA
BEATRICE
CAMELIA
DACIANA
ECATERINA
FABIA
etc.
I would appreciate it if somebody could let me know what I have to do in order to get all the names from the page. Thank you very much in advance!
You can try this. Use find_all to get all names and filter the junk out later.
from bs4 import BeautifulSoup
import requests
url = 'https://ro.m.wikipedia.org/wiki/List%C4%83_de_prenume_rom%C3%A2ne%C8%99ti'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
wikiName = [x.find_all('a') for x in soup.find_all('div', class_ = 'div-col columns column-count column-count-5')]
for names in wikiName:
print([name.text for name in names if name.text != 'wikt' and name.text != '#'])

Scraping wrong table

I'm trying to get the advanced stats of players onto an excel sheet but the table it's scraping is the first one instead of the advanced stats table.
ValueError: Length of passed values is 23, index implies 21
If i try to use the id instead, i get an another error about tbody.
Also, I get an error about
lname=name.split(" ")[1]
IndexError: list index out of range.
I think that has to do with 'Nene' in the list. Is there a way to fix that?
import requests
from bs4 import BeautifulSoup
playernames=['Carlos Delfino',
'Yao Ming',
'Andris Biedrins',
'Nene']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}).find_next('tbody')
print(table)
columns = ['Season', 'Team', 'League', 'GP', 'GS', 'TS%', 'eFG%', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'TOV%', 'STL%', 'BLK%', 'USG%', 'Total S%', 'PPR', 'PPS', 'ORtg', 'DRtg', 'PER']
df = pd.DataFrame(columns=columns)
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text.replace('\n', '') for td in tds]
df = df.append(pd.Series(row, index=columns), ignore_index=True)
df.to_csv('international players.csv', index=False)
Brazilians only use one name for soccer think Fred. If you want to use their moniker (Nene/Fred) then you need to implement exception handling for this, something like
try:
lname=name.split(" ")[1]
except IndexError:
lname=name
For your scraping issue, try using find_all as opposed to find, this will give you every data table on a given page and then you can pull the correct table out of the list
Change table = soup.find('table', attrs={'class': 'tablesaw', 'data-tablesaw-mode-exclude': 'columntoggle'}, {'id':'table-3554'}) to find_all
FYI also, the table ID's change every time you refresh the page so you can't use ID as a search mechanism.

Beautiful Soup scrape table with table breaks

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.
If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

How to use BeautifulSoup to parse a table?

This is a context-specific question regarding how to use BeautifulSoup to parse an html table in python2.7.
I would like to extract the html table here and place it in a tab-delim csv, and have tried playing around with BeautifulSoup.
Code for context:
proxies = {
"http://": "198.204.231.235:3128",
}
site = "http://sloanconsortium.org/onlineprogram_listing?page=11&Institution=&field_op_delevery_mode_value_many_to_one[0]=100%25%20online"
r = requests.get(site, proxies=proxies)
print 'r: ', r
html_source = r.text
print 'src: ', html_source
soup = BeautifulSoup(html_source)
Why doesn't this code get the 4th row?
soup.find('table','views-table cols-6').tr[4]
How would I print out all of the elements in the first row (not the header row)?
Okey, someone might be able to give you a one liner, but the following should get you started
table = soup.find('table', class_='views-table cols-6')
for row in table.find_all('tr'):
row_text = list()
for item in row.find_all('td'):
text = item.text.strip()
row_text.append(text.encode('utf8'))
print row_text
I believe your tr[4] is believed to be an attribute and not an index as you suppose.

Categories

Resources