How to skip over certain rows in table when web scraping - python

I'm scraping from this link: https://www.pro-football-reference.com/boxscores/201809060phi.htm
My code is as follows:
import requests
from bs4 import BeautifulSoup
# assign url
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
#parse and format url
r = requests.get(url).text
res = r.replace("<!--","").replace("-->","")
soup = BeautifulSoup(res, 'lxml')
#get tables
tables = soup.findAll("div",{"class":"table_outer_container"})
#get offense_stats table
offense_table = tables[5]
rows = offense_table.tbody.findAll("tr")
#here i want to iterate through the player rows and pull their stats
player = test_row.find("th",{"data-stat":"player"}).text
carries = test_row.find("td",{"data-stat":"rush_att"}).text
rush_yds = test_row.find("td",{"data-stat":"rush_yds"}).text
rush_tds = test_row.find("td",{"data-stat":"rush_td"}).text
targets = test_row.find("td",{"data-stat":"targets"}).text
recs = test_row.find("td",{"data-stat":"rec"}).text
rec_yds= test_row.find("td",{"data-stat":"rec_yds"}).text
rec_tds= test_row.find("td",{"data-stat":"rec_td"}).text
The table on the page that I need (offensive stats) has the stats for all the players in the game. I want to iterate through the rows pulling the stats for each player. Problem is that there are two rows in the middle that are headers and not player stats. My "rows" variable pulled all "tr" elements in the "tbody" of my "offense_table" variable. This includes the two header rows that I do not want. They would be rows[8] and rows[9] in this particular case, but that could be different from game to game.
#this is how the data rows begin (the ones I want)
<tr data-row="0">
#and this is how the header rows begin (the ones I want to skip over)
<tr class="over_header thead" data-row="8">
Anybody know a way for me to ignore these rows when iterating through?

To select only tr without class try to replace
rows = offense_table.tbody.findAll("tr")
by
rows = offense_table.findAll("tr", attrs={'class': None})

If the rows you want to skip always have the over_header class, and the rows you want to keep never do, you can filter the results of findAll("tr") for rows that don't have the over_header class:
rows = offense_table.tbody.findAll("tr")
rows = filter(lambda row: not row.find(class_='over_header'), rows)

Related

Scraping worldometers homepage to pull COVID-19 table data but values doesn't pulls incorrectly (Python)

I'm scraping worldometers home page to pull the data in the table in Python, but I am struggling as the values aren't pulling in correctly. (The strings are... (Country: USA, Spain, Italy...).
import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate
url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
print(colLen)
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
if len(T)!=len(tr_elements[0]): break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])
#Print Country Col (this is correct)
print(col[0][0:])
I can't seem to figure out what the issue is. Please help to solve the issue. I'm also open for suggestion to do this another way :)
Data Table on Webpage
Command Prompt output for Country ( Correct)
Command Prompt output for Total Cases ( incorrect)

Scrape multiple individual tables on one web page

I am trying to scrape a bunch of tables from one web page, with the code below I can get one table and the output to show correctly with pandas, but I cannot get more than one table at a time.
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen('https://www.URLHERE.com').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')[-1]
rows = tables.find_all('tr')
output = []
for rows in rows:
cols = rows.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6'])
df = df.iloc[1:]
print(df)
If I remove the [-1] from my table variable then I get the error below.
AttributeError: 'list' object has no attribute 'find_all'
What do I need to change to get all the tables off the page?
You're on the right track already, just like a commenter already said, you'll need to find_all tables, then you can apply the row logic you are already using to each table in a loop instead of just the first table. Your code will look something like this:
tables = soup.find_all('table')
for table in tables:
# individual table logic here
rows = table.find_all('tr')
for row in rows:
# individual row logic here
I took a better look on that, and here is the sample code that i tested:
source = urllib.request.urlopen('URL').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')
print("I found " + str(len(tables)) + " tables.")
all_rows = []
for table in tables:
print("Searching for <tr> items...")
rows = table.find_all('tr')
print("Found " + str(len(rows)) + "rows.")
for row in rows:
all_rows.append(row)
print("In total i have got " + str(len(all_rows)) + " rows.")
# example of first row
print(all_rows[0])
Little explanation: The problem with the Atribute Error when you removed [-1] was, that the tables variable was List object - and it don't have find_all method.
Your track with [-1] is okay - I assume that you know, that [-1] grabs the last items from list. So the same you have to do with all elements - which is shown in the code above.
You might find interesting to read about for construction on python and iterables: https://pythonfordatascience.org/for-loops-and-iterations-python/
Well if you want to extract all different tables present on a web-page in one time, you should try :
tables = pd.read_html("<URL_HERE>")
tables would be a list of dataframes for each table present on that page.
For more specific documentation refer to Pandas-Documentation

Python Web Scraping Script not iterating over HTML table properly

I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)

Forgetting something - Python BeautifulSoup and FinViz

I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data
Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.

Parsing html data into python list for manipulation

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Categories

Resources