I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)
Related
I am using python 2.7 with beautifulsoup to read in a simple HTML table.
After reading in the table, I then try to access the returned data.
As far as I can see, a python list object is returned. But when I try to access the data using statements such as cell=row[0] I get an "IndexError: list index out of range" error.
from bs4 import BeautifulSoup
# read in HTML data
html = open("in.html").read()
soup = BeautifulSoup(html,"lxml")
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
# process some cell data
for row in output_rows:
name=row[0]
print name
# fails with list index out of range error```
I have come up with this code, to parse each cell of a row into a variable, where I can then process further. But it's not very elegant...ideas for a more elegant solution, suitable for newbies, are welcomed.
for x in range(len(row)):
if x==0:
name=row[x]
print name
if x==1:
address=row[x]
print address
I scraped some data from a website. I need to do a simple if statement with two values from the list of strings. I am totally lost. The values I need are both floats, but they are in the list as strings. This is the website url. 'http://vixcentral.com/historical/?days=30' Here is my code so far. I need the float values of the 3 most recent 'Contango 2/1' values for my if statements. Check out the website to get a better idea of what values I need.
import requests
from bs4 import BeautifulSoup
data = requests.get('http://vixcentral.com/historical/?days=30')
soup = BeautifulSoup(data.text, 'html.parser')
data = []
for tr in soup.find_all('tr'):
values = [td.text for td in tr.find_all('td')]
data.append(values)
print(data)
You want to define a conversion function for string to float.
Also I would not use the variable data for your request, which you later overwrite as a list. The convention is r = requests.get(...).
def str_to_float(s):
try:
return float(s)
except ValueError:
return s
r = requests.get('http://vixcentral.com/historical/?days=30')
soup = BeautifulSoup(r.text, 'html.parser')
data = [[str_to_float(td.text) for td in tr.find_all('td')]
for tr in soup.find_all('tr')]
Then you can loop through data, which is a list of lists, to pull out the specific values your are looking for.
I am new to Python and I want to get the "price" column of data from a table however I'm unable to retrieve that data.
Currently what I'm doing:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr"):
col = row.find_all("td")
print(col[2])
print("---")
I keep getting a list index out of value range. I've read the documentation and tried a few different ways, but I can't seem to get it down.
Also, I am using Python3.
The problem is that you are iterating over all tr inside the table, and there is 1 header tr at the beginning that you don't need, so just avoid using that one:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
print(col[2])
print("---")
Probably means that one of the rows has no td tag. You could wrap the print or whatever usage of col[2] in a try except block and ignore cases where the col is empty or has less than three items
for row in table.find_all("tr"):
col = row.find_all("td")
try:
print(col[2])
print("---")
except IndexError:
pass
I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data
Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.
I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.