I have the following HTML code:
<tbody>
<tr>
<td>1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</td>
<td>62e907b15cbf27d5425399ebf6f0fb50ebb88f18</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">1089</td>
</tr>
<tr>
<td>12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX</td>
<td>119b098e2e980a229e139a9ed01a469e518e6f26</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">55</td>
</tr>
<!--- SNIP --->
</tbody>
I want to parse it to get something like:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa,62e907b15cbf27d5425399ebf6f0fb50ebb88f18,66.6771,66.6771
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX,119b098e2e980a229e139a9ed01a469e518e6f26,50.0572,50.0572
Tried with BeautifulSoup:
soup.select('tbody > tr > td')[rowcount].get_text(strip=True)
I get only the fist <td>*</td>
What am I doing wrong?
Try this
for row in soup.select('tbody tr'):
row_text = [x.text for x in row.find_all('td')]
print(', '.join(row_text)) # You can save or print this string however you want.
Output:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa, 62e907b15cbf27d5425399ebf6f0fb50ebb88f18, 66.67711246 BTC, 66.67711246 BTC, 1089
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX, 119b098e2e980a229e139a9ed01a469e518e6f26, 50.05723154 BTC, 50.05723154 BTC, 55
I was able to find what you want to scrape by doing the following:
from bs4 import BeautifulSoup
html = """<tbody>
<tr>
<td>1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</td>
<td>62e907b15cbf27d5425399ebf6f0fb50ebb88f18</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">1089</td>
</tr>
<tr>
<td>12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX</td>
<td>119b098e2e980a229e139a9ed01a469e518e6f26</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">55</td>
</tr>
<!--- SNIP --->
</tbody>"""
b = BeautifulSoup(html, 'lxml')
for tr in b.find_all('tr'):
data = tr.find_all('td')
val1 = data[0].find('a').text
val2 = data[1].find('a').text
num1 = data[2].text.split()[0]
num2 = data[3].text.split()[0]
print(val1, val2, num1, num2)
This results in:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa 62e907b15cbf27d5425399ebf6f0fb50ebb88f18 66.67711246 66.67711246
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX 119b098e2e980a229e139a9ed01a469e518e6f26 50.05723154 50.05723154
Related
I have the following page structure:
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
.
.
.
<tr class="small data-row" bgcolor="#ffffff">.</tr>
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#eff6ef">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>
I would like to get this second value == 183, but I am not sure how to do it. I tried in that way:
content = driver.page_source
soup = BeautifulSoup(content)
for elm in soup.select(".stats1"):
val=elm.get("align")
and the output is:
right
<td align="right" class="stats1">215</td>
if I got 183 instead of 215 I could use .split, but in this case I get only this first value.
.select() will return a list of elements. Just call that element by index:
from bs4 import BeautifulSoup
html = '''<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#ffffff">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>'''
soup = BeautifulSoup(html, 'html.parser')
elm = soup.select(".stats1")[1]
Output:
print(elm.text)
183
I am trying to retrieve all the td text for the below table using Beautiful Soup, unfortunately the tag names are the same and I am either only able to retrieve the first element or some elements are repeatedly printing. Hence not really sure of how to go about it.
Below is HTML table snippet:
<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr>
What I want is all the text content inside these tags for example the first auto_head corresponds to first auto_body i.e. Address = 1 similarly all the values should be retrieved.
I have used find,findall,findNext and next_sibling but no luck. Here is my current code in python:
self.table = self.soup_file.findAll(class_="Table")
self.headers = [tab.find(class_="Auto_head").findNext('td',class_="Auto_head").contents[0] for tab in self.table]
self.data = [data.find(class_="Auto_body").findNext('td').contents[0] for data in self.table]
Get the headers first, then use zip(...) to combine
from bs4 import BeautifulSoup
data = '''\
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
</tr>
<tr>
<td class="Auto_body">2</td>
<td class="Auto_body">def</td>
<td class="Auto_body">no</td>
</tr>
<tr>
<td class="Auto_body">3</td>
<td class="Auto_body">ghi</td>
<td class="Auto_body">maybe</td>
</tr>
</table>
'''
soup = BeautifulSoup(data, 'html.parser')
for table in soup.select('table.Auto'):
# get rows
rows = table.select('tr')
# get headers
headers = [td.text for td in rows[0].select('td.Auto_head')]
# get details
for row in rows[1:]:
values = [td.text for td in row.select('td.Auto_body')]
print(dict(zip(headers, values)))
My output:
{'Address': '1', 'Name': 'abc', 'Type': 'yes'}
{'Address': '2', 'Name': 'def', 'Type': 'no'}
{'Address': '3', 'Name': 'ghi', 'Type': 'maybe'}
Get each category first then iterate using zip
s = '''<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr></table>'''
soup = BeautifulSoup(s,features='html')
head = soup.find_all(name='td',class_='Auto_head')
body = soup.find_all(name='td',class_='Auto_body')
for one,two in zip(head,body):
print(f'{one.text}={two.text}')
Address=1
Name=abc
Type=yes
Value IN=abc123
AUTO Statement=jar
Value OUT=123abc
RESULT=PASS
=na
Searching by CSS class
The easiest solution is to add the find_all method at the end of the find
so your code will be
source = requests.get('YOUR URL')
soup=BeautifulSoup(source.text,'html.parser')
data = soup.find('tr').find_all('td')[0]
data = soup.find('tr').find_all('td')[1]
and so on just change the last list number 0,1,2... or else use for loop for the same
I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]
I am new in Python and someone suggested me to use Beautiful soup for Scrapping and i am struck in a problem to fetch the href attribute from a td tag Column 2 on the basis of year in column 4.
<table class="tableFile2" summary="Results">
<tr>
<th width="7%" scope="col">Filings</th>
<th width="10%" scope="col">Format</th>
<th scope="col">Description</th>
<th width="10%" scope="col">Filing Date</th>
<th width="15%" scope="col">File/Film Number</th>
</tr>
<tr>
<td nowrap="nowrap">8-K</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Current report, items 8.01 and 9.01
<br />Acc-no: 0001193125</td>
<td>2013-05-03</td>
<td nowrap="nowrap">000-10030<br>13813281 </td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">424B2</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Prospectus [Rule 424(b)(2)]<br />Acc-no: 0001193125</td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13802405 </td>
</tr>
<tr>
<td nowrap="nowrap">FWP</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Filing under Securities Act Rules 163/433 of free writing prospectuses<br />Acc-no: 0001193125-13-189053 (34 Act) Size: 52 KB </td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13800170 </td>
</tr>
</table>
table = soup.find('table', class="tableFile2")
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if "2013" in cols[3]
link = cols[1].find('a').get('href')
print
This works for me in Python 2.7:
table = soup.find('table', {'class': 'tableFile2'})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 4 and "2013" in cols[3].text:
link = cols[1].find('a').get('href')
print link
A few issues with your previous code:
soup.find() requires a dictionary of attributes (e.g., {'class' : 'tableFile2'})
Not every cols instance will have at least 3 columns, so you need to check length first.
I'm trying to parse through a table of rows using beautiful soup and save values of each row in a dict.
One hiccup is the structure of the table has some rows as the section headers. So for any row with the class 'header' I want to define a variable called "section". Here's what I have, but it's not working because it's saying ['class'] TypeError: string indices must be integers
Here's what I have:
for i in credits.contents:
if i['class'] == 'header':
section = i.contents
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
Example of data:
<table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
</table>
Here is one solution, with a slight adaptation of your example data so that the result is clearer:
from BeautifulSoup import BeautifulSoup
from pprint import pprint
html = '''<body><table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER 1</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA12</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER 2</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA13</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
</table></body>'''
soup = BeautifulSoup(html)
rows = soup.findAll('tr')
section = ''
dataset = {}
for row in rows:
if row.attrs:
section = row.text
dataset[section] = {}
else:
cells = row.findAll('td')
for cell in cells:
if cell['class'] in dataset[section]:
dataset[section][ cell['class'] ].append( cell.text )
else:
dataset[section][ cell['class'] ] = [ cell.text ]
pprint(dataset)
Produces:
{u'HEADER 1': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA12', u'DATA23', u'DATA33']},
u'HEADER 2': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA13', u'DATA23', u'DATA33']}}
EDIT ADAPTATION OF YOUR SOLUTION
Your code is neat and has only a couple of issues. You use contents in places where you shoul duse text or findAll -- I repaired that below:
soup = BeautifulSoup(html)
credits = soup.find('table')
section = ''
DATA_SET = {}
for i in credits.findAll('tr'):
if i.get('class', '') == 'header':
section = i.text
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
print DATA_SET
Please note that if successive cells have the same data_point class, then successive rows will replace earlier ones. I suspect this is not an issue in your real dataset, but that is why your code would return this, abbreviated, result:
{u'HEADER 2': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']},
u'HEADER 1': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']}}