Use BeautifulSoup to fetch rows by header

Use BeautifulSoup to fetch rows by header - python

I have an html structure like this one:
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
Those attributes are not always present, sometimes I can have only Brand, other cases Brand and Flavoring.
To scrap this I did a code like this:
BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows = soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand = stats_rows[1].td.get_text(),
type = stats_rows[2].td.get_text(),
contents = stats_rows[3].td.get_text(),
flavoring = stats_rows[4].td.get_text())
But as expected it fails with index out bounds (or get really messed up) when the table ordering is different (type before brand) or some of the rows are missing (no contents).
Is there any better approach to something like:
Give me the data from row with header with string 'brand'

It is definitely possible. Check this out:
from bs4 import BeautifulSoup
html_content='''
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
header = item.text
rows = item.find_next_sibling().text
print(header,rows)
Output:
Brand 2 Guys Smoke Shop
Blend Type Aromatic
Contents Black Cavendish, Virginia
Flavoring Other / Misc

This would build a dict for you:
from BeautifulSoup import BeautifulSoup
valid_headers = ['brand', 'type', 'contents', 'flavoring']
t = """<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>"""
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
hea = row.findAll('th')
if hea.strip().lstrip().lower() in valid_headers:
val = row.findAll('td')
results[hea[0].string] = val[0].string
print results

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Having trouble appending a new row to the first row (the header row) in the table body ().
my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('page_content.xml'), 'html.parser')
# append a row to the first row in the table body
row = soup.find('tbody').find('tr')
row.append(soup.new_tag('tr', text='New Cell'))
print(row)
the output:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
what the output should be:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr text="New Cell"></tr>
the full xml file is:
<h1>Rental Agreement/Editor</h1>
<table class="wrapped">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
<tr>
<td>1.0.1-0</td>
<td>ABC-1234</td>
<td colspan="1">
<br/>
</td>
</tr>
</tbody>
</table>
<p class="auto-cursor-target">
<br/>
</p>

You can use .insert_after:
from bs4 import BeautifulSoup
html_doc = """
<table>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr>
<td> something else </td>
</tr>
</table>
"""
soup = BeautifulSoup(html_doc, "html.parser")
row = soup.select_one("tr:has(th)")
row.insert_after(soup.new_tag("tr", text="New Cell"))
print(soup.prettify())
Prints:
<table>
<tr>
<th>
Version
</th>
<th>
Jira
</th>
<th colspan="1">
Date/Time
</th>
</tr>
<tr text="New Cell">
</tr>
<tr>
<td>
something else
</td>
</tr>
</table>
EDIT: If you want to insert arbitrary HTML code, you can try:
what_to_insert = BeautifulSoup(
'<tr param="xxx">This is new <b>text</b></tr>', "html.parser"
)
row.insert_after(what_to_insert)

Trying to webscrape a table but cant

Theres this table i wanna scrape and get all of its details. The html code is this one:
<table id="bnConnectionTemplate:r1:0:tl1" class="detailTable" cellpadding="0" cellspacing="0" border="0" summary="">
<tbody>
<tr>
<th>Name: </th>
<td>EVERBRITE CORPORATION LIMITED</td>
</tr>
<tr>
<th><abbr title="Australian Company Number">ACN: </abbr></th>
<td>104 436 704</td>
</tr>
<tr>
<th><abbr title="Australian Business Number">ABN: </abbr></th>
<td><a id="bnConnectionTemplate:r1:0:j_id__ctru57pc2" class="contentLink af_goLink" href="http://abr.business.gov.au/Search.aspx?SearchText=96%20104%20436%20704" target="_blank"><span title="">96 104 436 704</span><span class="hiddenHint"> (External Link)</span></a></td>
</tr>
<tr>
<th>Registration date: </th>
<td>15/04/2003</td>
</tr>
<tr>
<th>Next review date: </th>
<td>15/04/2013</td>
</tr>
<tr>
<th>Former name(s): </th>
<td>VISIONGLOW GLOBAL LIMITED</td>
</tr>
<tr>
<th></th>
<td></td>
</tr>
<tr>
<th>Status: </th>
<td>Deregistered</td>
</tr>
<tr>
<th>Date deregistered: </th>
<td>7/09/2012</td>
</tr>
<tr>
<th>Type: </th>
<td>Australian Public Company, Limited By Shares</td>
</tr>
<tr>
<th>Locality of registered office: </th>
<td></td>
</tr>
<tr>
<th>Regulator: </th>
<td>Australian Securities & Investments Commission</td>
</tr>
</tbody>
My problem is that i cant get this table even if i try to get it by its class or id.
# noinspection PyUnresolvedReferences
import requests
# noinspection PyUnresolvedReferences
from bs4 import BeautifulSoup
source = requests.get("https://connectonline.asic.gov.au/RegistrySearch/faces/landing/panelSearch.jspx?searchText=104+436+704&searchType=OrgAndBusNm&_adf.ctrl-state=139sjjyk9g_15").text
soup = BeautifulSoup(source, 'lxml')
I tried doing:
table = soup.find('table', class_= 'detailTable') # Gives output : none
table = soup.find('table', id="bnConnectionTemplate:r1:0:tl1") # Gives output : none
At this point i m confused as to why this is happening.I have webscraped in the past with these commands and they worked fine.Any kind of help would be appreciated.

Python Selenium Copy Table Columns by Column Name

I have a table that has these headers, like this:
How would I select the whole column using xpath to store in an array.
I was hoping for different arrays, like:
courses = []
teacher = []
avg = []
Bare in mind these column don't have any ID's or classes, so I need a way to select just by using the name of the column.
Here is the code for the table:
<table border="0">
<tbody>
<tr>
<td nowrap="nowrap">Courses</td>
<td nowrap="nowrap">Teacher</td>
<td><select name="fldMarkingPeriod" onchange="switchMarkingPeriod(this.value);">
<option value="MP1">MP1</option>
<option selected="selected" value="MP2">MP2</option>
<option value="MP3">MP3</option>
</select>Avg</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td nowrap="nowrap">[Course Name]</td>
<td nowrap="nowrap">[Teacher Name]</td>
<td>
<table width="100%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td title="View Course Summary" width="70%">100%</td>
<td width="30%">A+</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
Any ideas? Thanks.

Not sure why exactly you need the data by columns, but here is a sample implementation:
courses = []
teachers = []
avgs = []
for row in table.find_elements_by_css("table > tbody > tr")[1:]:
course, teacher, _, avg = [td.text for td in row.find_elements_by_xpath(".//td")]
courses.append(course)
teachers.append(teacher)
avgs.append(avg)

How can i append <br> tags after a text element?

I am parsing a HTML file with BeautifulSoup and got stuck with < br> tags.
I want to append < br> tag after inserting a list element, but it didn't work.
What is the easiest way to do this?
soup = BeautifulSoup(open("test.html"))
mylist = [Item_1,Item_2]
for i in range(len(mylist)):
#insert Items to the 4. column
This is the default HTML:
<html>
<body>
<table>
<tr>
<th>
1. Column
</th>
<th>
2. Column
</th>
<th>
3. Column
</th>
<th>
4. Column
</th>
<th>
5. Column
</th>
<th>
6. Column
</th>
<th>
7. Column
</th>
<th>
8. Column
</th>
</tr>
<tr class="a">
<td class="h">
Text in first column
</td>
<td>
<br/>
</td>
<td>
<br/>
</td>
<td>
<!--I want to insert items here-->
</td>
<td>
1
</td>
<td>
37
</td>
<td>
38
</td>
<td>
38
</td>
</tr>
</table>
</body>
</html>
This is the HTML i want to make
<html>
<body>
<table>
<tr>
<th>
1. Column
</th>
<th>
2. Column
</th>
<th>
3. Column
</th>
<th>
4. Column
</th>
<th>
5. Column
</th>
<th>
6. Column
</th>
<th>
7. Column
</th>
<th>
8. Column
</th>
</tr>
<tr class="a">
<td class="h">
Text in first column
</td>
<td>
<br/>
</td>
<td>
<br/>
</td>
<td>
Item_1 <br>
Item_2
</td>
<td>
1
</td>
<td>
37
</td>
<td>
38
</td>
<td>
38
</td>
</tr>
</table>
</body>
</html>

To append a tag, first create it with the new_tag() factory function, like so:
soup.td.append(soup.new_tag('br'))
Consider the following program. For every table cell (that is, every td) in the html, it appends a <br/> tag and some text to the cell.
from bs4 import BeautifulSoup
html_doc = '''
<html>
<body>
<table>
<tr>
<td>
data1
</td>
<td>
data2
</td>
</tr>
</table>
</body>
</html>
'''
soup = BeautifulSoup(html_doc)
mylist = ['addendum 1', 'addendum 2']
for td,item in zip(soup.find_all('td'), mylist):
td.append(soup.new_tag('br'))
td.append(item)
print soup.prettify()
Result:
<html>
<body>
<table>
<tr>
<td>
data1
<br/>
addendum 1
</td>
<td>
data2
<br/>
addendum 2
</td>
</tr>
</table>
</body>
</html>

Scrape table with Scrapy

I have a table formed like this from a website:
<table>
<tr class="head">
<td class="One">
Column 1
</td>
<td class="Two">
Column 2
</td>
<td class="Four">
Column 3
</td>
<td class="Five">
Column 4
</td>
</tr>
<tr class="DataSet1">
<td class="One">
<table>
<tr>
<td class="DataType1">
Data 1
</td>
</tr>
<tr>
<td class="DataType_2">
<ul>
<li> Data 2a</li>
<li> Data 2b</li>
<li> Data 2c</li>
<li> Data 2d</li>
</ul>
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 3
</td>
</tr>
<tr>
<td class="DataType_4">
Data 4
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 5
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 6
</td>
</tr>
</table>
</td>
</tr>
<tr class="Empty">
<td class="One">
</td>
<td class="Two">
</td>
<td class="Four">
</td>
<td class="Five">
</td>
</tr>
<tr class="DataSet2">
<td class="One">
<table>
<tr>
<td class="DataType_1">
Data 7
</td>
</tr>
<tr>
<td class="DataType_2">
Data 8
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 9
</td>
</tr>
<tr>
<td class="DataType_4">
Data 10
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 11
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 12
</td>
</tr>
</table>
</td>
</tr>
<!-- and so on -->
</table>
The tags sometimes are also empty, for example:
<td class="DataType_6> </td>
I tried to scrape the content with Scrapy and the following script:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class MySpider(BaseSpider):
name = "SpiderName"
allowed_domains = ["url"]
start_urls = ["url"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr')
items = []
item = ProjectItem()
item["Data_1"] = rows.select('//td[#class="DataType_1"]/text()').extract()
item["Data_2"] = rows.select('//td[#class="DataType_2"]/text()').extract()
item["Data_3"] = rows.select('//td[#class="DataType_3"]/text()').extract()
item["Data_4"] = rows.select('//td[#class="DataType_4"]/text()').extract()
item["Data_5"] = rows.select('//td[#class="DataType_5"]/text()').extract()
item["Data_6"] = rows.select('//td[#class="DataType_6"]/text()').extract()
items.append(item)
return items
If I crawl using this command:
scrapy crawl SpiderName -o output.csv -t csv
I only get crap like as many times as I have got the Dataset all the values for "Data_1".

Had a similar problem. First of all, rows = hxs.select('//tr') is going to loop on everything from the first child. You need to dig a bit deeper, and use relative paths. This link gives an excellent explanation on how to structure your code.
When I finally got my head around it, I realised that in that order to parse each item separately, row.select should not have the // in it.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use BeautifulSoup to fetch rows by header - python

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Trying to webscrape a table but cant

Python Selenium Copy Table Columns by Column Name

How can i append <br> tags after a text element?

Scrape table with Scrapy

Categories

Resources