How to select the specific table of webpage with Python - python

I am newbie in programming and python. But I want to parse HTML in my python script.
Here is the webpage:
http://stock.finance.sina.com.cn/hkstock/finance/00759.html
Question 1:
This page is about the financial information of particular share. These four tables is about:
financial summary,
Balance Sheet,
Cash Flow
Income Statement.
And I want to extract the information in table 3 & 4. Here is my code:
import urllib
from bs4 import BeautifulSoup
url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'
html = urllib.urlopen(url).read() #.read() mean read all into a string
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", { "class" : "tab05" })
for row in table.findAll("tr"):
print row.findAll("td")
But this code only can get the first table information. How can I change the code in order to get the third and fourth table information? I found that those 4 tables do not contain unique id or classname, I dont know how to locate them....
Question 2:
Also this is simplify Chinese webpage, how to keep the original text on output?
Question 3:
On the upper right corner of each table, there is a drop down menu for selecting the appropriated period, namely: "All", "Whole Year", "Half Year", "First Quarter" and "Third Quarter". Is urllib able to change this drop down menu?
Thank you very much.

According to the website, all four tables have the class name "tab05".
Therefore, all you have to do is simply change the .find method to .findAll at the var soup, then all four table can be accessed.
import urllib
from bs4 import BeautifulSoup
url = 'http://stock.finance.sina.com.cn/hkstock/finance/00759.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
tables = soup.findAll("table", { "class" : "tab05" })
print len(tables) #4
for table in tables:
for row in table.findAll("tr"):
for col in row.findAll("td"):
print col.getText()
As for the encoding of simplify Chinese, print col.getText() will get the right words on the terminal. If you are seeking for writing them to a file, you have to encode the string to gb2312.
f.write(col.getText().encode('gb2312'))
For the 3rd question, since the data are rendered by javascript function written in datatable.js, I think it is not possible to get all of them simply by urllib. Better check out some other library to find a suitable usage.

Thanks for your reply.
I may be misunderstand your meaning. I rewrite the code as following:
tables = soup.findAll("table", { "class" : "tab05" })
print len(tables)
for row in tables[0].findAll("tr"):
for col in row.findAll("td"):
print col.getText()
The result of "len(tables)" is 1. Only the first table can be accessed.
Also I found that if I use
for row in tables[0].findAll("tr"):
for col in row.findAll("td"):
I cannot get all information of that table. The last figure got from this code is "-45.7852", which is only the half of that table.

Related

Beautiful Soup to scrape data

I'm trying to scrape the EPS Estimates, EPS Earnings History (1st and 3rd tables) using BeautifulSoup from yahoo finance into an existing csv file. https://uk.finance.yahoo.com/quote/MSFT/analysis?p=MSFT
I have made a start but am struggling to be able to pull the exact data that I need, I am guessing I will need a for loop across the rows and td tags.
url = 'https://uk.finance.yahoo.com/quote/' + index +'/analysis?p=' + index
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
EP = soup.find('table', attrs={'class':"W(100%)"})
print(EP)
This appears be getting only the first table, but I am not sure how we write the loop to get the appropriate data. Looking at the HTML it looks like both the first and third tables have the same class name, so I can't use that to just go to the appropriate table.
Another idea I had, is searching for all tables on the page and putting them into a list. I could then select the correct index, but I'm not sure how I would do that in code.
Replace soup.find with soup.find_all(). It returns a list of all the tables, which you can then iterate.
EPs = soup.find_all('table', attrs={'class':"W(100%)"})
for EP in EPs:
...
Your first and third tables would be EPs[0] and EPs[2] if that is what you are looking for.

Data Scraping / Regex expression Error (python)

I'm trying to scrape data from a table in a website. I can pull the data in, in the form of source code. But in my program, I get the error: TypeError: replace_with() takes exactly 2 arguments (3 given)
import urllib2
import bs4
import re
page_content = ""
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
page_content += page.read()
soup = bs4.BeautifulSoup(page_content)
tables = soup.find_all('tr')
file = open('crime_data.csv', 'w+')
for i in tables:
i = i.replace_with('</td>' , (',')) # this is where I get the error
i = re.sub(r'<.?td[^>]*>','',i)
file.write(i + '\n')
Why is it giving me that error?
Also, in essence, I'm trying take the data from the table and put it into a csv file. Any and all help would be greatly appreciated!
That replace_with function does not do what it appears you want it to. The linked docs state that: PageElement.replace_with()removes a tag or string from the tree, and replaces it with the tag or string of your choice
From your code it looks more like you want to replace the whole end tag </td> with a , in a an effort to get some sort of comma separated data.
Perhaps you should instead just use the get_text method on your <td> elements, and format them from there:
for i in tables:
file.write(i.get_text(',').strip() + '\n')
file.close() ####### <----- VERY IMPORTANT TO CLOSE FILES
Note
I tested your code out and you are not really scraping what you are after. I played around with it and came up with this:
import urllib2
import bs4
def scrape_crimes(html,write_headers):
soup = bs4.BeautifulSoup(html) # make the soup
table = soup.find_all('table',class_=('cbResultSetTable',)) # search for the exact table you want, there are multiple nested tables on the pages you are scraping
if len(table) > 0: # if the table is found
table = table[0] # set the table to the first result
else:
return # no table found, no use scraping
with open('crime_data.csv', 'a') as f: # opens file to append content
trs = table.find_all('tr') # get all the rows in the table
if write_headers: # if we request that headers are written
for th in trs[0].find_all('th'): # write each header followed by a comma
f.write(th.get_text(strip=True).encode('utf-8')+',') # ensure data is writable by calling encode
f.write('\n') # write a newline
for tr in trs: # for each table row in the table
tds = tr.find_all('td') # get all the td elements
if len(tds) > 0: # if there are td elements (not true for header rows
for td in tds: # for each td element
f.write(td.get_text(strip=True).encode('utf-8')+',') # add the data followed by a comma
f.write('\n') # finish the row off with a newline
open('crime_data.csv','w').close() # clear the file before running
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
scrape_crimes(page.read(),i==1) # offset the code to a function, the second argument is only true the first time
# this ensures that you will get headers only at the top of your output file
I removed the use of the re library because in general regex and html do not play nicely together., the short explanation being: HTML is not a regular language.
I also witch from using the coding pattern:
file = open('file_name','w')
# do stuff
file.close()
to this preferred pattern:
with open('file_name','w') as f:
# do stuff
In the first example it is often common to forget to close the file, which you did forget in your provided code. The second pattern will handle the close for you, so no worries there. Also, it is not good practice to name your variables with the same names as native python commands.
I changed your scripts pattern from combining all the pages html to scraping each page one by one because that is not a good idea. You could run into memory issues if you were doing this with large pages. Instead, it is usually better to handle the data in chunks.
The next thing I did was look at the HTML of the page you were scraping. You were pulling all <tr> elements but had you closely inspected the page, you would have seen that the table you are after is actually contained in a <tr>, giving you some big nasty block of text as a "result". Using bs4's optional classs_ argument to denote a specific class to look for in the table element leads to the data you are after.
The next thing I noticed was that the table headers would get pulled for every page, sprinkling your results with this redundant information. You would only want to pull this info the first time, so I added some logic for that.
I switched to using the .get_text method instead of the regex/replace_with combo you had because of the above explanations. The get_text method returns unicode however so I added the call to .encode('utf-8') which would ensure the data would be writable. I also specified the strip=True argument to get rid of any pesky white-space characters on the data. The reasoning behind this: you load the whole bs4 library, why not use it? The good people who write that library spent a lot of time taking care of parsing the text so you don't have to waste time doing it.
Hope this was helpful! Happy scraping!

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

How to extract table information using BeautilSoup?

I am trying to scrape information from these kind of pages.
I need the information contained under Internship, Residency, Fellowship. I can extract values from tables, but in this case I could not decide which table to use because the heading (like Internship) is present under a div tag outside the table as a simple plain text, and after that the table is present whose value I need to extract. And I have many such pages of this kind, and it is not necessary that each page has these values, like in some pages Residency may not be present at all. (This decreases the total number of tables in the page). One example of such page is this. In this page Internship is not present at all.
The main problem I am facing is all the tables have the same attribute values, so I can not decide which table in to use for different pages. If any value of my interest is not present in a page, I have to return an empty string for that value.
I am using BeautifulSoup in Python. Can someone point, how could I proceed in extracting those values.
It looks like the ids for the headings and data each have a unique value and standard suffixes. You can use that to search for the appropriate values. Here's my solution:
from BeautifulSoup import BeautifulSoup
# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable
# named 'html'
soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
x = soup.find('span', text=heading)
if x:
span_id = x.parent['id']
table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')
values.append(soup.find('td', attrs={'id': table_id}).text)
else:
values.append('')
print zip(headings, values)

Categories

Resources