I have the following page structure:
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
.
.
.
<tr class="small data-row" bgcolor="#ffffff">.</tr>
<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#eff6ef">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>
I would like to get this second value == 183, but I am not sure how to do it. I tried in that way:
content = driver.page_source
soup = BeautifulSoup(content)
for elm in soup.select(".stats1"):
val=elm.get("align")
and the output is:
right
<td align="right" class="stats1">215</td>
if I got 183 instead of 215 I could use .split, but in this case I get only this first value.
.select() will return a list of elements. Just call that element by index:
from bs4 import BeautifulSoup
html = '''<tr class="small data-row" bgcolor="#f9f9f9">.</tr>
<tr class="small" bgcolor="#ffffff">.</tr>
<td class="stats1" align="right">215</td>
<td class="stats1" align="right">183</td>
<td class="stats1" align="right">0</td>
<td class="stats1 stats-dash" align="right">-</td>
</tr>'''
soup = BeautifulSoup(html, 'html.parser')
elm = soup.select(".stats1")[1]
Output:
print(elm.text)
183
Related
I got the html below, I want to get the text of event_timestamp
<tr id="eventRowId_454169" event_attr_id="25" event_timestamp="2022-07-19 12:30:00" onclick="javascript:changeEventDisplay(454169, this, 'overview');">
<td class="first left time">15:30</td>
<td class="flagCur"> USD</td> <td class="sentiment" title="High Volatility Expected"><i class="newSiteIconsSprite grayFullBullishIcon middle"></i><i class="newSiteIconsSprite grayFullBullishIcon middle"></i><i class="newSiteIconsSprite grayFullBullishIcon middle"></i></td> <td class="left event">Building Permits (Jun)</td>
</tr>
Below is my code
Time = row.tr['event_timestamp']
Am getting None , what can I change to get the time?
from bs4 import BeautifulSoup
html = '''<tr id="eventRowId_454169" event_attr_id="25" event_timestamp="2022-07-19 12:30:00" onclick="javascript:changeEventDisplay(454169, this, 'overview');">
<td class="first left time">15:30</td>
<td class="flagCur"> USD</td> <td class="sentiment" title="High Volatility Expected"><i class="newSiteIconsSprite grayFullBullishIcon middle"></i><i class="newSiteIconsSprite grayFullBullishIcon middle"></i><i class="newSiteIconsSprite grayFullBullishIcon middle"></i></td> <td class="left event">Building Permits (Jun)</td>
</tr>
'''
soup = BeautifulSoup(html, 'html.parser')
time = soup.select_one('tr').get('event_timestamp')
print(time)
I have a text file with below content, all i need to extract 29565618> after Specific String Match(highlighted/bold below)
<div title="Available on both MOS and OTN">OracleJDK8 Update 212 <strong>(public)</strong></div>
Note: The href tag is above on the 2nd line after this patter match in the input text file.
Input Text File:
<tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
Expected Output:
29565618
My Code:
with open('file.txt') as f:
my_list = list(f)
try:
if my_list.index('JDK') > 0 and my_list.index('public') > 0:
print(string[4:-4])
except:
pass
You can do it with Beautiful Soup like this:
from bs4 import BeautifulSoup
html_doc = """
<tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>"""
soup = BeautifulSoup(html_doc, 'html.parser')
trs = soup.find_all('tr')
for tr in trs:
if tr.div:
div_text = tr.div.get_text()
if "JDK" in div_text and "public" in div_text:
for td in tr.find_all('td'):
td_text = td.get_text()
if td_text.isdigit():
print(td_text)
Output:
29565618
If data is your HTML snippet from the question, this script:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for a in soup.select('td.km:has(~ td.km) > a'):
if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text):
print(a.text)
prints:
29565618
soup = BeautifulSoup(html_doc, 'html.parser')
match = soup.find(text=lambda t: "JDK" in t)
if match and 'public' in match.parent.text:
print(match.find_previous('a').text)
Thanks for #Andrej Kesely
You can use:
(?=<a.*?>(.*)</a>)
Check here, it uses your data to confirm the match: https://regex101.com/r/W2wV2I/1/
What about this
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = ''' <tr>
<td class="km">29565618</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206839</td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206838</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km">29206859</td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>'''
doc = SimplifiedDoc(html)
trs = doc.trs.contains(['JDK','public'])
for tr in trs:
print(tr.a.text) # 29565618
I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]
I have the following HTML code:
<tbody>
<tr>
<td>1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</td>
<td>62e907b15cbf27d5425399ebf6f0fb50ebb88f18</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">1089</td>
</tr>
<tr>
<td>12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX</td>
<td>119b098e2e980a229e139a9ed01a469e518e6f26</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">55</td>
</tr>
<!--- SNIP --->
</tbody>
I want to parse it to get something like:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa,62e907b15cbf27d5425399ebf6f0fb50ebb88f18,66.6771,66.6771
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX,119b098e2e980a229e139a9ed01a469e518e6f26,50.0572,50.0572
Tried with BeautifulSoup:
soup.select('tbody > tr > td')[rowcount].get_text(strip=True)
I get only the fist <td>*</td>
What am I doing wrong?
Try this
for row in soup.select('tbody tr'):
row_text = [x.text for x in row.find_all('td')]
print(', '.join(row_text)) # You can save or print this string however you want.
Output:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa, 62e907b15cbf27d5425399ebf6f0fb50ebb88f18, 66.67711246 BTC, 66.67711246 BTC, 1089
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX, 119b098e2e980a229e139a9ed01a469e518e6f26, 50.05723154 BTC, 50.05723154 BTC, 55
I was able to find what you want to scrape by doing the following:
from bs4 import BeautifulSoup
html = """<tbody>
<tr>
<td>1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa</td>
<td>62e907b15cbf27d5425399ebf6f0fb50ebb88f18</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">66.6771<small class="b-blockExplorer__small">1246</small> BTC</td>
<td class="num">1089</td>
</tr>
<tr>
<td>12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX</td>
<td>119b098e2e980a229e139a9ed01a469e518e6f26</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">50.0572<small class="b-blockExplorer__small">3154</small> BTC</td>
<td class="num">55</td>
</tr>
<!--- SNIP --->
</tbody>"""
b = BeautifulSoup(html, 'lxml')
for tr in b.find_all('tr'):
data = tr.find_all('td')
val1 = data[0].find('a').text
val2 = data[1].find('a').text
num1 = data[2].text.split()[0]
num2 = data[3].text.split()[0]
print(val1, val2, num1, num2)
This results in:
1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa 62e907b15cbf27d5425399ebf6f0fb50ebb88f18 66.67711246 66.67711246
12c6DSiU4Rq3P4ZxziKxzrL5LmMBrzjrJX 119b098e2e980a229e139a9ed01a469e518e6f26 50.05723154 50.05723154
I'm trying to parse through a table of rows using beautiful soup and save values of each row in a dict.
One hiccup is the structure of the table has some rows as the section headers. So for any row with the class 'header' I want to define a variable called "section". Here's what I have, but it's not working because it's saying ['class'] TypeError: string indices must be integers
Here's what I have:
for i in credits.contents:
if i['class'] == 'header':
section = i.contents
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
Example of data:
<table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER NAME</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA</p></td>
<td class="data_point_2"><p>DATA</p></td>
<td class="data_point_3"><p>DATA</p></td>
</tr>
</table>
Here is one solution, with a slight adaptation of your example data so that the result is clearer:
from BeautifulSoup import BeautifulSoup
from pprint import pprint
html = '''<body><table class="credits">
<tr class="header">
<th colspan="3"><h1>HEADER 1</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA12</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
<tr class="header">
<th colspan="3"><h1>HEADER 2</h1></th>
</tr>
<tr>
<td class="data_point_1"><p>DATA11</p></td>
<td class="data_point_2"><p>DATA12</p></td>
<td class="data_point_3"><p>DATA13</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA21</p></td>
<td class="data_point_2"><p>DATA22</p></td>
<td class="data_point_3"><p>DATA23</p></td>
</tr>
<tr>
<td class="data_point_1"><p>DATA31</p></td>
<td class="data_point_2"><p>DATA32</p></td>
<td class="data_point_3"><p>DATA33</p></td>
</tr>
</table></body>'''
soup = BeautifulSoup(html)
rows = soup.findAll('tr')
section = ''
dataset = {}
for row in rows:
if row.attrs:
section = row.text
dataset[section] = {}
else:
cells = row.findAll('td')
for cell in cells:
if cell['class'] in dataset[section]:
dataset[section][ cell['class'] ].append( cell.text )
else:
dataset[section][ cell['class'] ] = [ cell.text ]
pprint(dataset)
Produces:
{u'HEADER 1': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA12', u'DATA23', u'DATA33']},
u'HEADER 2': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
u'data_point_3': [u'DATA13', u'DATA23', u'DATA33']}}
EDIT ADAPTATION OF YOUR SOLUTION
Your code is neat and has only a couple of issues. You use contents in places where you shoul duse text or findAll -- I repaired that below:
soup = BeautifulSoup(html)
credits = soup.find('table')
section = ''
DATA_SET = {}
for i in credits.findAll('tr'):
if i.get('class', '') == 'header':
section = i.text
DATA_SET[section] = {}
else:
DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents
print DATA_SET
Please note that if successive cells have the same data_point class, then successive rows will replace earlier ones. I suspect this is not an issue in your real dataset, but that is why your code would return this, abbreviated, result:
{u'HEADER 2': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']},
u'HEADER 1': {'data_point_2': [u'DATA32'],
'data_point_3': [u'DATA33'],
'data_point_1': [u'DATA31']}}