I'm getting stuck trying to grab the text values off the a.href tags. I've managed to isolate the the target values but keep running into an error when I try to get_text().
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
table = main_div.find('table')
sub = table.findAll('tr')
rows = sub[5].findAll('td')
for row in rows:
data = row.a
print data
Assuming you are actually trying to print data.get_text(), it would fail for some of the row in rows - because, in some cases, there are no child link elements in the td cells. You can check that a link was found beforehand:
for row in rows:
link = row.a
if link is not None:
print(link.get_text())
Note that "row" and "rows" are probably not the best variable names since you are actually iterating over the "cells" - td elements.
Related
I am relatively new to programming and completely new to stack overflow. I thought a good way to learn would be with a python & excel based project, but am stuck. My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. Sorry if this is a very basic question, any advice would help though:
from bs4 import BeautifulSoup
from urllib.request import Request,
urlopen
import re
import pandas as pd
req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
count = 0
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
count = count +1
print(links)
print("count is", count)
po = links
pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False)
you are on the right track. Instead of 'a', you need to use different html tag 'td' for the rows. Also 'th' for column names. here is one way to implement it. list_slide function converts each 14 elements to one row since the original table has 14 columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')
# Get column headers from the html file
header = []
for c_name in soup.findAll('th'):
header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]
# get each row of the table
row = []
for link in soup.find_all('td'):
row.append(link.get_text().strip())
def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
return [my_list[x:x + step] for x in range(0, len(my_list), step)]
# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])
I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)
Complex Table link
I have used bs4, pandas and lxml libraries to parse the html table above but i am not having success. With pandas i try to skip rows and setting header to 0 however the result is a DataFrame highly unstructured and it also seems that some data is missing.
With the other 2 libraries i tried to use selectors and even the xpath from the tbody section but i receive a empty list in both cases.
This would be what i want to retrieve:
Can anyone give me a hand about how i can i scrape that data?
Thank you!
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
page = urlopen('https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&datepicker-day-offset-select-dv-date-from_input=D&dateTime.dateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=09.08.2017%2000:00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YES-REE------0!BZN%7C10YES-REE------0&productionType.values=B01&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC')
soup = BeautifulSoup(page.read())
table = soup.find('tbody')
res = []
row = []
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
row.append(td.text)
res.append(row)
row = []
df = pd.DataFrame(data=res)
Then add column names with df.columns and drop empty columns.
EDIT: Suggest this modifed for-loop. (BillBell)
>>> for tr in table.find_all('tr'):
... for td in tr.find_all('td'):
... row.append(td.text.strip())
... res.append(row)
... row = []
The original form of the for statement failed compilation.
The original form of the the append left new-lines and blanks in constants.
I am new to Python and I want to get the "price" column of data from a table however I'm unable to retrieve that data.
Currently what I'm doing:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr"):
col = row.find_all("td")
print(col[2])
print("---")
I keep getting a list index out of value range. I've read the documentation and tried a few different ways, but I can't seem to get it down.
Also, I am using Python3.
The problem is that you are iterating over all tr inside the table, and there is 1 header tr at the beginning that you don't need, so just avoid using that one:
# Libraies
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table")
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
print(col[2])
print("---")
Probably means that one of the rows has no td tag. You could wrap the print or whatever usage of col[2] in a try except block and ignore cases where the col is empty or has less than three items
for row in table.find_all("tr"):
col = row.find_all("td")
try:
print(col[2])
print("---")
except IndexError:
pass
Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that?
Here's my code:
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")
# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website
print (soup.prettify())
li = soup.prettify().split('\n')
print str(li[line_number-1])
Don't use pretty print to try and parse tds, select the tag specifically, if the attribute is unique then use that, if the class name is unique then just use that:
td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")
If it was the fourth td:
td = soup.select_one("td:nth-of-type(4)")
If it is in a specific table, then select the table and then find the td in the table, trying to split the html into lines and indexing is actually worse than using a regex to parse html.
You can get the specific td using the text from the bold tag preceding the td i.e Department of Finance Building Classification::
In [19]: from bs4 import BeautifulSoup
In [20]: import urllib.request
In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"
In [22]: r = urllib.request.urlopen(url).read()
In [23]: soup = BeautifulSoup(r, "html.parser")
In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS
Pick the nth table and row:
In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS