I have an html document with 4 or 5 different tables. The one I want has an attribute class = "data". I can't figure out how to make BeautifulSoup return just that table.
soup = BeautifulSoup(myhtml)
t = soup.findAll('table', 'class="data"')
for table in t:
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
print td
If I remove the 'class="data"' in the above, I get the results from every table. Is it possible to select only the one with class = "data". Or, is there some other way to iterate through the tables?
Specify the class atttribute as a dictionary as follow:
t = soup.findAll('table', {'class': 'data'})
If you use bs4, you can use CSS Selector using select method:
t = css_soup.select("table.data")
Look the following code :
t = soup.find_all('table', class_='data')
The attribute class need the under to refer
Related
I am trying to scrape a bunch of tables from one web page, with the code below I can get one table and the output to show correctly with pandas, but I cannot get more than one table at a time.
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen('https://www.URLHERE.com').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')[-1]
rows = tables.find_all('tr')
output = []
for rows in rows:
cols = rows.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6'])
df = df.iloc[1:]
print(df)
If I remove the [-1] from my table variable then I get the error below.
AttributeError: 'list' object has no attribute 'find_all'
What do I need to change to get all the tables off the page?
You're on the right track already, just like a commenter already said, you'll need to find_all tables, then you can apply the row logic you are already using to each table in a loop instead of just the first table. Your code will look something like this:
tables = soup.find_all('table')
for table in tables:
# individual table logic here
rows = table.find_all('tr')
for row in rows:
# individual row logic here
I took a better look on that, and here is the sample code that i tested:
source = urllib.request.urlopen('URL').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')
print("I found " + str(len(tables)) + " tables.")
all_rows = []
for table in tables:
print("Searching for <tr> items...")
rows = table.find_all('tr')
print("Found " + str(len(rows)) + "rows.")
for row in rows:
all_rows.append(row)
print("In total i have got " + str(len(all_rows)) + " rows.")
# example of first row
print(all_rows[0])
Little explanation: The problem with the Atribute Error when you removed [-1] was, that the tables variable was List object - and it don't have find_all method.
Your track with [-1] is okay - I assume that you know, that [-1] grabs the last items from list. So the same you have to do with all elements - which is shown in the code above.
You might find interesting to read about for construction on python and iterables: https://pythonfordatascience.org/for-loops-and-iterations-python/
Well if you want to extract all different tables present on a web-page in one time, you should try :
tables = pd.read_html("<URL_HERE>")
tables would be a list of dataframes for each table present on that page.
For more specific documentation refer to Pandas-Documentation
I'm scraping from this link: https://www.pro-football-reference.com/boxscores/201809060phi.htm
My code is as follows:
import requests
from bs4 import BeautifulSoup
# assign url
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
#parse and format url
r = requests.get(url).text
res = r.replace("<!--","").replace("-->","")
soup = BeautifulSoup(res, 'lxml')
#get tables
tables = soup.findAll("div",{"class":"table_outer_container"})
#get offense_stats table
offense_table = tables[5]
rows = offense_table.tbody.findAll("tr")
#here i want to iterate through the player rows and pull their stats
player = test_row.find("th",{"data-stat":"player"}).text
carries = test_row.find("td",{"data-stat":"rush_att"}).text
rush_yds = test_row.find("td",{"data-stat":"rush_yds"}).text
rush_tds = test_row.find("td",{"data-stat":"rush_td"}).text
targets = test_row.find("td",{"data-stat":"targets"}).text
recs = test_row.find("td",{"data-stat":"rec"}).text
rec_yds= test_row.find("td",{"data-stat":"rec_yds"}).text
rec_tds= test_row.find("td",{"data-stat":"rec_td"}).text
The table on the page that I need (offensive stats) has the stats for all the players in the game. I want to iterate through the rows pulling the stats for each player. Problem is that there are two rows in the middle that are headers and not player stats. My "rows" variable pulled all "tr" elements in the "tbody" of my "offense_table" variable. This includes the two header rows that I do not want. They would be rows[8] and rows[9] in this particular case, but that could be different from game to game.
#this is how the data rows begin (the ones I want)
<tr data-row="0">
#and this is how the header rows begin (the ones I want to skip over)
<tr class="over_header thead" data-row="8">
Anybody know a way for me to ignore these rows when iterating through?
To select only tr without class try to replace
rows = offense_table.tbody.findAll("tr")
by
rows = offense_table.findAll("tr", attrs={'class': None})
If the rows you want to skip always have the over_header class, and the rows you want to keep never do, you can filter the results of findAll("tr") for rows that don't have the over_header class:
rows = offense_table.tbody.findAll("tr")
rows = filter(lambda row: not row.find(class_='over_header'), rows)
I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)
I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data
Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.
Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that?
Here's my code:
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")
# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website
print (soup.prettify())
li = soup.prettify().split('\n')
print str(li[line_number-1])
Don't use pretty print to try and parse tds, select the tag specifically, if the attribute is unique then use that, if the class name is unique then just use that:
td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")
If it was the fourth td:
td = soup.select_one("td:nth-of-type(4)")
If it is in a specific table, then select the table and then find the td in the table, trying to split the html into lines and indexing is actually worse than using a regex to parse html.
You can get the specific td using the text from the bold tag preceding the td i.e Department of Finance Building Classification::
In [19]: from bs4 import BeautifulSoup
In [20]: import urllib.request
In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"
In [22]: r = urllib.request.urlopen(url).read()
In [23]: soup = BeautifulSoup(r, "html.parser")
In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS
Pick the nth table and row:
In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS