Beautiful soup: processing cell data using Python - python

I am using python 2.7 with beautifulsoup to read in a simple HTML table.
After reading in the table, I then try to access the returned data.
As far as I can see, a python list object is returned. But when I try to access the data using statements such as cell=row[0] I get an "IndexError: list index out of range" error.
from bs4 import BeautifulSoup
# read in HTML data
html = open("in.html").read()
soup = BeautifulSoup(html,"lxml")
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
# process some cell data
for row in output_rows:
name=row[0]
print name
# fails with list index out of range error```

I have come up with this code, to parse each cell of a row into a variable, where I can then process further. But it's not very elegant...ideas for a more elegant solution, suitable for newbies, are welcomed.
for x in range(len(row)):
if x==0:
name=row[x]
print name
if x==1:
address=row[x]
print address

Related

Scraping worldometers homepage to pull COVID-19 table data but values doesn't pulls incorrectly (Python)

I'm scraping worldometers home page to pull the data in the table in Python, but I am struggling as the values aren't pulling in correctly. (The strings are... (Country: USA, Spain, Italy...).
import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate
url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
print(colLen)
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
if len(T)!=len(tr_elements[0]): break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])
#Print Country Col (this is correct)
print(col[0][0:])
I can't seem to figure out what the issue is. Please help to solve the issue. I'm also open for suggestion to do this another way :)
Data Table on Webpage
Command Prompt output for Country ( Correct)
Command Prompt output for Total Cases ( incorrect)

Scrape multiple individual tables on one web page

I am trying to scrape a bunch of tables from one web page, with the code below I can get one table and the output to show correctly with pandas, but I cannot get more than one table at a time.
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen('https://www.URLHERE.com').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')[-1]
rows = tables.find_all('tr')
output = []
for rows in rows:
cols = rows.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6'])
df = df.iloc[1:]
print(df)
If I remove the [-1] from my table variable then I get the error below.
AttributeError: 'list' object has no attribute 'find_all'
What do I need to change to get all the tables off the page?
You're on the right track already, just like a commenter already said, you'll need to find_all tables, then you can apply the row logic you are already using to each table in a loop instead of just the first table. Your code will look something like this:
tables = soup.find_all('table')
for table in tables:
# individual table logic here
rows = table.find_all('tr')
for row in rows:
# individual row logic here
I took a better look on that, and here is the sample code that i tested:
source = urllib.request.urlopen('URL').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')
print("I found " + str(len(tables)) + " tables.")
all_rows = []
for table in tables:
print("Searching for <tr> items...")
rows = table.find_all('tr')
print("Found " + str(len(rows)) + "rows.")
for row in rows:
all_rows.append(row)
print("In total i have got " + str(len(all_rows)) + " rows.")
# example of first row
print(all_rows[0])
Little explanation: The problem with the Atribute Error when you removed [-1] was, that the tables variable was List object - and it don't have find_all method.
Your track with [-1] is okay - I assume that you know, that [-1] grabs the last items from list. So the same you have to do with all elements - which is shown in the code above.
You might find interesting to read about for construction on python and iterables: https://pythonfordatascience.org/for-loops-and-iterations-python/
Well if you want to extract all different tables present on a web-page in one time, you should try :
tables = pd.read_html("<URL_HERE>")
tables would be a list of dataframes for each table present on that page.
For more specific documentation refer to Pandas-Documentation

Python Web Scraping Script not iterating over HTML table properly

I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)

Forgetting something - Python BeautifulSoup and FinViz

I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data
Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.

Error with beautiful soup: list index out of range

I'm a **very new programmer to python. Working on a webcrawler using urllib and beautifulsoup. Please ignore the while loop at the top and incrementation of i, I'm just running this test version, and for one page, but it will eventually include a whole set. My problem is that this gets the soup, but generates an error. I'm not sure that I'm collecting the table data correctly, but I hope that this code can ignore the links and just write the text to a .csv file. For now I'm focused on just printing the text to the screen correctly.
line 17, in <module>
uspc = col[0].string
IndexError: list index out of range
HERE is the code:
import urllib
from bs4 import BeautifulSoup
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print url + '\n'
i += 1
data = urllib.urlopen(url).read()
print data
#get the table data from dump
#append to csv file
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
uspc = col[0].string
cpc1 = col[1].string
cpc2 = col[2].string
cpc3 = col[3].string
record = (uspc, cpc1, cpc2, cpc3)
print "|".join(record)
In the end, I solved this problem by changing the following line:
for row in table.findAll('tr')[1:]:
to:
for row in table.findAll('tr')[2:]:
the error was because the first row of the table had split columns

Categories

Resources