Trying to webscrape a table but cant - python

Theres this table i wanna scrape and get all of its details. The html code is this one:
<table id="bnConnectionTemplate:r1:0:tl1" class="detailTable" cellpadding="0" cellspacing="0" border="0" summary="">
<tbody>
<tr>
<th>Name: </th>
<td>EVERBRITE CORPORATION LIMITED</td>
</tr>
<tr>
<th><abbr title="Australian Company Number">ACN: </abbr></th>
<td>104 436 704</td>
</tr>
<tr>
<th><abbr title="Australian Business Number">ABN: </abbr></th>
<td><a id="bnConnectionTemplate:r1:0:j_id__ctru57pc2" class="contentLink af_goLink" href="http://abr.business.gov.au/Search.aspx?SearchText=96%20104%20436%20704" target="_blank"><span title="">96 104 436 704</span><span class="hiddenHint"> (External Link)</span></a></td>
</tr>
<tr>
<th>Registration date: </th>
<td>15/04/2003</td>
</tr>
<tr>
<th>Next review date: </th>
<td>15/04/2013</td>
</tr>
<tr>
<th>Former name(s): </th>
<td>VISIONGLOW GLOBAL LIMITED</td>
</tr>
<tr>
<th></th>
<td></td>
</tr>
<tr>
<th>Status: </th>
<td>Deregistered</td>
</tr>
<tr>
<th>Date deregistered: </th>
<td>7/09/2012</td>
</tr>
<tr>
<th>Type: </th>
<td>Australian Public Company, Limited By Shares</td>
</tr>
<tr>
<th>Locality of registered office: </th>
<td></td>
</tr>
<tr>
<th>Regulator: </th>
<td>Australian Securities & Investments Commission</td>
</tr>
</tbody>
My problem is that i cant get this table even if i try to get it by its class or id.
# noinspection PyUnresolvedReferences
import requests
# noinspection PyUnresolvedReferences
from bs4 import BeautifulSoup
source = requests.get("https://connectonline.asic.gov.au/RegistrySearch/faces/landing/panelSearch.jspx?searchText=104+436+704&searchType=OrgAndBusNm&_adf.ctrl-state=139sjjyk9g_15").text
soup = BeautifulSoup(source, 'lxml')
I tried doing:
table = soup.find('table', class_= 'detailTable') # Gives output : none
table = soup.find('table', id="bnConnectionTemplate:r1:0:tl1") # Gives output : none
At this point i m confused as to why this is happening.I have webscraped in the past with these commands and they worked fine.Any kind of help would be appreciated.

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Having trouble appending a new row to the first row (the header row) in the table body ().
my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('page_content.xml'), 'html.parser')
# append a row to the first row in the table body
row = soup.find('tbody').find('tr')
row.append(soup.new_tag('tr', text='New Cell'))
print(row)
the output:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
what the output should be:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr text="New Cell"></tr>
the full xml file is:
<h1>Rental Agreement/Editor</h1>
<table class="wrapped">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
<tr>
<td>1.0.1-0</td>
<td>ABC-1234</td>
<td colspan="1">
<br/>
</td>
</tr>
</tbody>
</table>
<p class="auto-cursor-target">
<br/>
</p>
You can use .insert_after:
from bs4 import BeautifulSoup
html_doc = """
<table>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr>
<td> something else </td>
</tr>
</table>
"""
soup = BeautifulSoup(html_doc, "html.parser")
row = soup.select_one("tr:has(th)")
row.insert_after(soup.new_tag("tr", text="New Cell"))
print(soup.prettify())
Prints:
<table>
<tr>
<th>
Version
</th>
<th>
Jira
</th>
<th colspan="1">
Date/Time
</th>
</tr>
<tr text="New Cell">
</tr>
<tr>
<td>
something else
</td>
</tr>
</table>
EDIT: If you want to insert arbitrary HTML code, you can try:
what_to_insert = BeautifulSoup(
'<tr param="xxx">This is new <b>text</b></tr>', "html.parser"
)
row.insert_after(what_to_insert)

how do we select the child element tbody after extracting the entire html?

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

Regex html dynamic table

I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.
[u'<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan \xe1llapota')
So I would like to create a regex to look for "Ingatlan \xe1llapota" and return "fel\xfaj\xedtott":
Ingatlan \xe1llapota fel\xfaj\xedtott
My current regex expression is the following: \bIngatlan állapota\s+(.*)
I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)
Any help is much appreciated. Thanks!
As pointed out before use xpath or css instead:
import scrapy
class txt_filter:
sterm='Ingatlan \xe1llapota'
txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
'''
resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())
Result:
$ python3 so_51590811.py
['felújított']

Use BeautifulSoup to fetch rows by header

I have an html structure like this one:
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
Those attributes are not always present, sometimes I can have only Brand, other cases Brand and Flavoring.
To scrap this I did a code like this:
BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows = soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand = stats_rows[1].td.get_text(),
type = stats_rows[2].td.get_text(),
contents = stats_rows[3].td.get_text(),
flavoring = stats_rows[4].td.get_text())
But as expected it fails with index out bounds (or get really messed up) when the table ordering is different (type before brand) or some of the rows are missing (no contents).
Is there any better approach to something like:
Give me the data from row with header with string 'brand'
It is definitely possible. Check this out:
from bs4 import BeautifulSoup
html_content='''
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
header = item.text
rows = item.find_next_sibling().text
print(header,rows)
Output:
Brand 2 Guys Smoke Shop
Blend Type Aromatic
Contents Black Cavendish, Virginia
Flavoring Other / Misc
This would build a dict for you:
from BeautifulSoup import BeautifulSoup
valid_headers = ['brand', 'type', 'contents', 'flavoring']
t = """<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>"""
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
hea = row.findAll('th')
if hea.strip().lstrip().lower() in valid_headers:
val = row.findAll('td')
results[hea[0].string] = val[0].string
print results

Scrape table with Scrapy

I have a table formed like this from a website:
<table>
<tr class="head">
<td class="One">
Column 1
</td>
<td class="Two">
Column 2
</td>
<td class="Four">
Column 3
</td>
<td class="Five">
Column 4
</td>
</tr>
<tr class="DataSet1">
<td class="One">
<table>
<tr>
<td class="DataType1">
Data 1
</td>
</tr>
<tr>
<td class="DataType_2">
<ul>
<li> Data 2a</li>
<li> Data 2b</li>
<li> Data 2c</li>
<li> Data 2d</li>
</ul>
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 3
</td>
</tr>
<tr>
<td class="DataType_4">
Data 4
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 5
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 6
</td>
</tr>
</table>
</td>
</tr>
<tr class="Empty">
<td class="One">
</td>
<td class="Two">
</td>
<td class="Four">
</td>
<td class="Five">
</td>
</tr>
<tr class="DataSet2">
<td class="One">
<table>
<tr>
<td class="DataType_1">
Data 7
</td>
</tr>
<tr>
<td class="DataType_2">
Data 8
</td>
</tr>
</table>
</td>
<td class="Two">
<table>
<tr>
<td class="DataType_3">
Data 9
</td>
</tr>
<tr>
<td class="DataType_4">
Data 10
</td>
</tr>
</table>
</td>
<td class="Three">
<table>
<tr>
<td class="DataType_5">
Data 11
</td>
</tr>
</table>
</td>
<td class="Four">
<table>
<tr>
<td class="DataType_6">
Data 12
</td>
</tr>
</table>
</td>
</tr>
<!-- and so on -->
</table>
The tags sometimes are also empty, for example:
<td class="DataType_6> </td>
I tried to scrape the content with Scrapy and the following script:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from project.items import ProjectItem
class MySpider(BaseSpider):
name = "SpiderName"
allowed_domains = ["url"]
start_urls = ["url"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//tr')
items = []
item = ProjectItem()
item["Data_1"] = rows.select('//td[#class="DataType_1"]/text()').extract()
item["Data_2"] = rows.select('//td[#class="DataType_2"]/text()').extract()
item["Data_3"] = rows.select('//td[#class="DataType_3"]/text()').extract()
item["Data_4"] = rows.select('//td[#class="DataType_4"]/text()').extract()
item["Data_5"] = rows.select('//td[#class="DataType_5"]/text()').extract()
item["Data_6"] = rows.select('//td[#class="DataType_6"]/text()').extract()
items.append(item)
return items
If I crawl using this command:
scrapy crawl SpiderName -o output.csv -t csv
I only get crap like as many times as I have got the Dataset all the values for "Data_1".
Had a similar problem. First of all, rows = hxs.select('//tr') is going to loop on everything from the first child. You need to dig a bit deeper, and use relative paths. This link gives an excellent explanation on how to structure your code.
When I finally got my head around it, I realised that in that order to parse each item separately, row.select should not have the // in it.
Hope this helps.

Categories

Resources