Python - capture ALL tables from an HTML page - python

I have emails with embedded HTML tables and I have code that uses BeautifulSoup to extract the tables and the data within them, my problem is that sometimes it only succeeds in capturing one table when there are more.
The code I normally run on these table is:
with open(file_path) as in_f:
msg = email.message_from_file(in_f)
html_msg = msg.get_payload(1)
body = html_msg.get_payload(decode=True)
html = body.decode()
table = bs4.BeautifulSoup(html).find("table")
data = [[cell.text.strip() for cell in row.find_all("td")] for row in table.find_all("tr")]
But for this email, and some others like it, I only successfully extract the first Package. I've tried changing one line to table = bs4.BeautifulSoup(html).find_all("table") but find_all doesn't work there.
I'm a novice when it comes to BeautifulSoup so any help would be appreciated, thanks.

I think I see what you are doing wrong;
if you do
table = bs4.BeautifulSoup(html).find("table")
it returns a Tag (ie one element). If instead you do
tables = bs4.BeautifulSoup(html).find_all("table")
it returns a ResultSet (basically a list of tables). So far so good! The problem comes in the next line, when you try to treat the ResultSet as if it were a single Tag:
... for row in tables.find_all("tr") # Can't do this!
tables is not a single element (which has a .find_all method), it is a list of elements (which doesn't) - hence the AttributeError. Instead, you have to iterate over each table, like so:
tables = bs4.BeautifulSoup(html).find_all("table")
data = []
for table in tables: # <-- extra level of iteration!
for row in table.find_all("tr"):
data.append([cell.text.strip() for cell in row.find_all("td")])
Hope that helps!

Related

Python read html table from confluence and print each row as list

I'd like to parse confuence page ,read table and create list for each row.
My Table looks like
My code
x = confluence.get_page_by_id(p_id,expand="body.storage")
soup = BeautifulSoup(x["body"]["storage"]["value"], 'html.parser')
for tables in soup.select("table tr"):
data = [item.get_text() for item in tables.select("td")]
print(data)
But problem is, second column becuase of the new lines output of the code
['Karnataka','Bangalore','BangaloreMysoreTumkur']
And I want the output ot look like
['Karnataka','Bangalore','Bangalore Mysore Tumkur']
Can you please provide the code to fix this.
Thanks for the help!
BeautifulSoup removes the whitespace in rendered HTML, to use a custom separator use this:
data = [item.get_text(separator=" ") for item in tables.select("td")]
Because of missing HTML example as text, I am not aware of the contents, but you could try to set join parmeter for .get_text():
item.get_text(' ')

How can I parse HTML table that holds different object types?

I have an HTML table that holds the following object types: text, textbox, listbox (select) and buttons (see the attached picture).
My purpose is to parse, where it is possible, the text from the table.
For instance, I would like to parse the User Name, Permission, SNMPv3 Auth and SNMPv3 Priv columns.
In the case of the listboxes, I already know how to collect the selected option text.
Tables that include only text are well known to me and I know how to parse them very well but the methods that I have used to parse them don't suit this kind of a table.
How would you suggest me to deal with this kind of a table?
In the code example I print the contents of the table (the text), but actually, I will store it for the purpose of analyzing the contents of it. By the way, you can also see that I am not referring to the first row (the header) of the table.
This is how the users view list rows which has only div tag
As per the html you have shared, each tr has three elements, text box, selectbox and button.
Also in the screen shot for the saved record, I don't see input field. For example, the text user1. I assume the user1 is inside a span tag.
like
<td>
<div>user1</div>
</td>
You have to handle each element differently to get the value out of it.
To get innerText in div, we have to use elem.text
To get attribute
value of input text box, we have to use elem.get_attribute('value')
To get the selected value, we have to use Select(elem).first_selected_option
This is an example code, to get the data of your dom. Please free to edit as per your need.
I have used css selectors to find elements. Look here for the syntax.
# This returns all the tr elements in the table
rows = driver.find_elements_by_css_selector("table#sec_user_table>tbody>tr")
for i in range(1, len(rows)):
# This returns only the span, input which is not password and select elements
cols = rows[i].find_elements_by_xpath("td//*[self::div[not(.//input)] or self::input[#type='text'] or self::select]")
for col in cols:
if col.tag_name == 'SELECT':
print(Select(col).first_selected_option.text) # To get the select value
elif col.tag_name == 'INPUT':
print(col.get_attribute('value')) # To get the input value
else:
print(col.text) # To get text fron span
Or
Simple solution with single selectors:
This is specific to your case as you dont required input element completely
# This returns all the tr elements in the table
rows = driver.find_elements_by_css_selector("table#sec_user_table>tbody>tr")
for i in range(1, len(rows)):
username = rows[i].find_element_by_xpath("//div[not(.//input)]")
print(username.text)
select = rows[i].find_elements_by_css_selector("select")
for col in cols:
print(Select(col).first_selected_option.text) # To get the select value
I improved the above solution to solve my specific problem, it still may need some tweaks, for instance I will need to think of a way to ignore the last line, but this is not a big issue. Another thing that I want to fix is the time that it takes to get the result. For some reason it takes several seconds
rows = driver.find_elements_by_css_selector("table#sec_user_table>tbody>tr")
for row in rows:
cols = row.find_elements_by_css_selector("div,select")
for col in cols:
if col.tag_name == 'div':
if col.text != '':
print(col.text)
elif col.tag_name == 'select':
print(Select(col).first_selected_option.text)

Getting pandas.read_html( ) to work when HTML table contains more than one <tbody> tag

I'm trying to parse the tables found at http://www.swiftcodesbic.com and I'm using Pandas to grab the tables automatically. For the most part, this is working fine, but there is one table where there are two <tbody> tags and I think it's causing a hiccup. The faulty table can be found here.
The code I'm using to parse the html into a pandas.DataFrame is:
pandas.read_html(countryPage.text, attrs={"id":"t2"}, skiprows=1)[0]
where countryPage is a requests.get() object. Is there anything I can add to the pandas call to tell it to grab the second <tbody> tag? Or, if that's not the issue, can someone explain what might be causing it return a "table not found" error? Thanks in advance.
EDIT
Here's the solution I'm currently using, but I'd still like to know a more 'pythonic' approach to this.
try:
tempDataFrame = pd.read_html(countryPage.text, attrs={"id":"t2"}, skiprows=1)[0]
except:
if "france" is in url: #pseudo-code
soup = BeautifulSoup(countryPage.text)
table = soup.find_all("table")[2].findAll('tbody')[1] #this will vary based on your situation
table = "<table>" + str(table) + "</table>" #pandas needs the table tag to recognize a table
tempDataFrame = pd.read_html(table)[0]
Again, I'd be interested in knowing how to do this in a more efficient manner.
Using the match parameter should do the trick. From pandas.read_html documentation:
match : str or compiled regular expression, optional
The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.
Try something of this sort
tempDataFrame = pd.read_html(countryPage.text, match='foo', skiprows=1)
where foo is a string contained in the table

Data Scraping / Regex expression Error (python)

I'm trying to scrape data from a table in a website. I can pull the data in, in the form of source code. But in my program, I get the error: TypeError: replace_with() takes exactly 2 arguments (3 given)
import urllib2
import bs4
import re
page_content = ""
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
page_content += page.read()
soup = bs4.BeautifulSoup(page_content)
tables = soup.find_all('tr')
file = open('crime_data.csv', 'w+')
for i in tables:
i = i.replace_with('</td>' , (',')) # this is where I get the error
i = re.sub(r'<.?td[^>]*>','',i)
file.write(i + '\n')
Why is it giving me that error?
Also, in essence, I'm trying take the data from the table and put it into a csv file. Any and all help would be greatly appreciated!
That replace_with function does not do what it appears you want it to. The linked docs state that: PageElement.replace_with()removes a tag or string from the tree, and replaces it with the tag or string of your choice
From your code it looks more like you want to replace the whole end tag </td> with a , in a an effort to get some sort of comma separated data.
Perhaps you should instead just use the get_text method on your <td> elements, and format them from there:
for i in tables:
file.write(i.get_text(',').strip() + '\n')
file.close() ####### <----- VERY IMPORTANT TO CLOSE FILES
Note
I tested your code out and you are not really scraping what you are after. I played around with it and came up with this:
import urllib2
import bs4
def scrape_crimes(html,write_headers):
soup = bs4.BeautifulSoup(html) # make the soup
table = soup.find_all('table',class_=('cbResultSetTable',)) # search for the exact table you want, there are multiple nested tables on the pages you are scraping
if len(table) > 0: # if the table is found
table = table[0] # set the table to the first result
else:
return # no table found, no use scraping
with open('crime_data.csv', 'a') as f: # opens file to append content
trs = table.find_all('tr') # get all the rows in the table
if write_headers: # if we request that headers are written
for th in trs[0].find_all('th'): # write each header followed by a comma
f.write(th.get_text(strip=True).encode('utf-8')+',') # ensure data is writable by calling encode
f.write('\n') # write a newline
for tr in trs: # for each table row in the table
tds = tr.find_all('td') # get all the td elements
if len(tds) > 0: # if there are td elements (not true for header rows
for td in tds: # for each td element
f.write(td.get_text(strip=True).encode('utf-8')+',') # add the data followed by a comma
f.write('\n') # finish the row off with a newline
open('crime_data.csv','w').close() # clear the file before running
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
scrape_crimes(page.read(),i==1) # offset the code to a function, the second argument is only true the first time
# this ensures that you will get headers only at the top of your output file
I removed the use of the re library because in general regex and html do not play nicely together., the short explanation being: HTML is not a regular language.
I also witch from using the coding pattern:
file = open('file_name','w')
# do stuff
file.close()
to this preferred pattern:
with open('file_name','w') as f:
# do stuff
In the first example it is often common to forget to close the file, which you did forget in your provided code. The second pattern will handle the close for you, so no worries there. Also, it is not good practice to name your variables with the same names as native python commands.
I changed your scripts pattern from combining all the pages html to scraping each page one by one because that is not a good idea. You could run into memory issues if you were doing this with large pages. Instead, it is usually better to handle the data in chunks.
The next thing I did was look at the HTML of the page you were scraping. You were pulling all <tr> elements but had you closely inspected the page, you would have seen that the table you are after is actually contained in a <tr>, giving you some big nasty block of text as a "result". Using bs4's optional classs_ argument to denote a specific class to look for in the table element leads to the data you are after.
The next thing I noticed was that the table headers would get pulled for every page, sprinkling your results with this redundant information. You would only want to pull this info the first time, so I added some logic for that.
I switched to using the .get_text method instead of the regex/replace_with combo you had because of the above explanations. The get_text method returns unicode however so I added the call to .encode('utf-8') which would ensure the data would be writable. I also specified the strip=True argument to get rid of any pesky white-space characters on the data. The reasoning behind this: you load the whole bs4 library, why not use it? The good people who write that library spent a lot of time taking care of parsing the text so you don't have to waste time doing it.
Hope this was helpful! Happy scraping!

How to extract table information using BeautilSoup?

I am trying to scrape information from these kind of pages.
I need the information contained under Internship, Residency, Fellowship. I can extract values from tables, but in this case I could not decide which table to use because the heading (like Internship) is present under a div tag outside the table as a simple plain text, and after that the table is present whose value I need to extract. And I have many such pages of this kind, and it is not necessary that each page has these values, like in some pages Residency may not be present at all. (This decreases the total number of tables in the page). One example of such page is this. In this page Internship is not present at all.
The main problem I am facing is all the tables have the same attribute values, so I can not decide which table in to use for different pages. If any value of my interest is not present in a page, I have to return an empty string for that value.
I am using BeautifulSoup in Python. Can someone point, how could I proceed in extracting those values.
It looks like the ids for the headings and data each have a unique value and standard suffixes. You can use that to search for the appropriate values. Here's my solution:
from BeautifulSoup import BeautifulSoup
# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable
# named 'html'
soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
x = soup.find('span', text=heading)
if x:
span_id = x.parent['id']
table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')
values.append(soup.find('td', attrs={'id': table_id}).text)
else:
values.append('')
print zip(headings, values)

Categories

Resources