Get column from a table with Python and Beautiful Soup

Get column from a table with Python and Beautiful Soup - python

I am new to Python and I want to get the "price" column of data from a table however I'm unable to retrieve that data.
Currently what I'm doing:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr"):
col = row.find_all("td")
print(col[2])
print("---")
I keep getting a list index out of value range. I've read the documentation and tried a few different ways, but I can't seem to get it down.
Also, I am using Python3.

The problem is that you are iterating over all tr inside the table, and there is 1 header tr at the beginning that you don't need, so just avoid using that one:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
print(col[2])
print("---")

Probably means that one of the rows has no td tag. You could wrap the print or whatever usage of col[2] in a try except block and ignore cases where the col is empty or has less than three items
for row in table.find_all("tr"):
col = row.find_all("td")
try:
print(col[2])
print("---")
except IndexError:
pass

Related

Python Web Scraping Script not iterating over HTML table properly

I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1

I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)

Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)

Looking to get a json and csv file from this

import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen("http://www.nhl.com/scores/htmlreports/20172018/TH020070.HTM").read()
soup = bs.BeautifulSoup(sauce, "html.parser")
table = soup.table
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
I am trying to output this to a csv and json. How would i do both(not at the same time). Eventually when i get it properly formatted i would like to dump it straight into postgres. New to python so any help and suggestions would be appreciated. I got help previously with output to csv using pandas but i cant get it to format the way i would like it using pandas although ive been told its much easier..

Assuming you want to output the row variable in each iteration to a JSON / CSV.
For JSON, you can simply dump the list of all rows to JSONs. Something like:
import json
#Your logic here
rows=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
rows.append(row)
with open("out", "w") as fp:
json.dump(rows, fp)
For CSV, you can use a similar logic as well.
Check out the documentation:
https://docs.python.org/2/library/csv.html
https://docs.python.org/2/library/json.html

Parse complex multi-header html table with pandas and bs4

Complex Table link
I have used bs4, pandas and lxml libraries to parse the html table above but i am not having success. With pandas i try to skip rows and setting header to 0 however the result is a DataFrame highly unstructured and it also seems that some data is missing.
With the other 2 libraries i tried to use selectors and even the xpath from the tbody section but i receive a empty list in both cases.
This would be what i want to retrieve:
Can anyone give me a hand about how i can i scrape that data?
Thank you!

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
page = urlopen('https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&datepicker-day-offset-select-dv-date-from_input=D&dateTime.dateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YES-REE------0!BZN%7C10YES-REE------0&productionType.values=B01&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC')
soup = BeautifulSoup(page.read())
table = soup.find('tbody')
res = []
row = []
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
row.append(td.text)
res.append(row)
row = []
df = pd.DataFrame(data=res)
Then add column names with df.columns and drop empty columns.
EDIT: Suggest this modifed for-loop. (BillBell)
>>> for tr in table.find_all('tr'):
... for td in tr.find_all('td'):
... row.append(td.text.strip())
... res.append(row)
... row = []
The original form of the for statement failed compilation.
The original form of the the append left new-lines and blanks in constants.

Forgetting something - Python BeautifulSoup and FinViz

I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data

Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.

Print specific line (Beautifulsoup)

Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that?
Here's my code:
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")
# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website
print (soup.prettify())

li = soup.prettify().split('\n')
print str(li[line_number-1])

Don't use pretty print to try and parse tds, select the tag specifically, if the attribute is unique then use that, if the class name is unique then just use that:
td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")
If it was the fourth td:
td = soup.select_one("td:nth-of-type(4)")
If it is in a specific table, then select the table and then find the td in the table, trying to split the html into lines and indexing is actually worse than using a regex to parse html.
You can get the specific td using the text from the bold tag preceding the td i.e Department of Finance Building Classification::
In [19]: from bs4 import BeautifulSoup
In [20]: import urllib.request
In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"
In [22]: r = urllib.request.urlopen(url).read()
In [23]: soup = BeautifulSoup(r, "html.parser")
In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS
Pick the nth table and row:
In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get column from a table with Python and Beautiful Soup - python

Related

Python Web Scraping Script not iterating over HTML table properly

Looking to get a json and csv file from this

Parse complex multi-header html table with pandas and bs4

Forgetting something - Python BeautifulSoup and FinViz

Print specific line (Beautifulsoup)

Categories

Resources