How do I fetch table with attributes using BeautifulSoup? [duplicate] - python

This question already has answers here:
How to find tags with only certain attributes - BeautifulSoup
(8 answers)
Closed 1 year ago.
How can I fetch table with other attributes using beautifulSoup?
I have this table :
I have tried this but didn't succeed :
Soup.find_all('table',{'style':"width=100%;align:center;border:0;cellpadding:0;cellspacing:0"})

You can do this:
from bs4 import BeautifulSoup
html_source = '''
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
'''
soup = BeautifulSoup(html_source, 'html.parser')
els = list(soup.find("table",attrs={"width":"100%","border":"0","align":"center"}))
print(' '.join(map(str, els)))
You can provide as many attributes as you want in that dictionary. You can get a bit more information from here.
Example: https://www.napkin.io/n/c174e5c8557d402d

Related

How to extract onClick url using beautifulsoup

Below is the HTML code which needs extraction
<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html
?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">
<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8
a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team">
<tr>
How do I get the location.href value?
Tried:
soup.findAll("div", {"onClick": "location.href"})
Returns null
Desired Output:
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
PS: there's plenty of location.href
How about using .select() method for SoupSieve package to run a CSS selector
from bs4 import BeautifulSoup
html = '<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">' \
'<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->' \
'<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team"><tr>'
soup = BeautifulSoup(html, features="lxml")
element = soup.select('div.one_block')[0]
print(element.get('onclick'))
Use split to get just print(element.get('onclick').split("'")[1])
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020

Extracting a value from html table using BeautifulSoup

I'm trying to extract a value from a html table using bs4, however the structure of the table is in the form of:
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
The value I'm interested in is 575,42, however it has no id or other identifier to be used by bs4 to be extracted.
How can I call this value? Or under what id?
You can use any of the attributes to extract. For example, to use the
class = "celda400" attribute
response.find('td', {'class':"celda400"}).string
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,43
</td>
'''
doc = SimplifiedDoc(html)
texts = doc.selects('td.celda400').text
print (texts)
Result:
['575,42', '575,43']
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
You can try it. I think, you can understand it:
from bs4 import BeautifulSoup
html_doc = """
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
575,42
</td>
<td class="celda400" vAlign="center" align="right" width="100" bgColor="#DFEDFF" style="color:Black">
875,42
</td>
"""
soup = BeautifulSoup(html_doc, 'lxml')
all_td = soup.find_all('td', {'class':"celda400"})
for td in all_td:
value = td.text.strip()
print(value)

Python Beautiful Soup parsing a UTF-8 coded table (using mechanize)

I'm trying to parse the following table, coded in UTF-8 (this is part of it):
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
<tr class="gridHeader" valign="top">
<td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
</tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
My code is:
html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html)
table_id = "ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1"
table = soup.findall("table", id=table_id)
And I'm getting the following error:
TypeError: 'NoneType' object is not callable
Since you are just finding using an id, you can just use id and nothing else, because ids are unique:
UPDATE
Using your paste:
# encoding=utf-8
from bs4 import BeautifulSoup
import requests
data = requests.get('https://dpaste.de/EWCK/raw/')
soup = BeautifulSoup(data.text)
print soup.find("table",
id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
I'm using python requests to get the data from a webpage, its same as as you trying to get the data. The above code works, and the correct ID is given. Try this for a change, don't use .decode('utf-8'), instead, just use br.response().read().

Extract content of a HTML-file

I've got a HTML-file which looks like this (simplified):
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others...
I tried to extract the content using the following code:
root = lxml.html.parse('www.test.xyz').getroot()
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main')
The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
So the problem is that the middle part appears one time too much at the end.
Why is this and how can this be omitted and fixed?
paul t., also a stackoverflow-user, told me to use "root.xpath('//table[#class="main" and not(.//table[#class="main"])]')". This code prints out exactly the part I have twice.
I hope the problem is described clearly enough...thanks for any help and any propositions :)
You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:
root.xpath('//table[#class="main" and not(ancestor::table[#class="main"])]')

HTML Parsing Table - BeautifulSoup

I am attempting to parse the second table seen below using BeautifulSoup. I am having trouble identifying the second table verses the first because the tables attributes are the exact same. How do I access the information in the table such as name = PATHWAY? What I have used so far to attempt to access the table is:
table = soup.find('table', {'name':'PATHWAY'})
I receive a response of "None" although I know the table is present. To me this means that my method to distinguish between the two is not working. Any suggestions?
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="REACTION TYPE">REACTION TYPE</td><td class=ue>ORGANISM</td><td class=ue>COMMENTARY</td><td class=ue>LITERATURE</td></tr>
<tr class=tr1>
<td class=g>condensation</td><td class=no>-</td><td class=no>-</td><td class=no>-</td></tr>
</table>
</td></tr></table>
<br>
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="PATHWAY">PATHWAY</td><td class=ue>KEGG Link</td><td class=ue>MetaCyc Link</td><td class=ue></td></tr>
<table>
Mu Mind has it right: find the "a" then traverse back up to the parent
soup.find(attrs={"name":"PATHWAY"}).findParent('table')
That's the python way....There is a single xpath command but operating with xpath on axis is more complicated and only worth the effort it it has some specific use (xslt or javascript requirements eg)
>>> soup.find(attrs={"name":"PATHWAY"})
<a name="PATHWAY">PATHWAY</a>
First:
table = soup.find('table' {'name':'PATHWAY'}
is no proper Python code.
What should this match?
This will match only.
Either you iterate through each single table and perform related check inside each table or you iterate over each single node of the tree until you find the related node and then walk up the node hierarchy (by following the parent nodes) until you find a table element. The recursiveChildGenerator() can be used to iterate over all nodes (like in a flat list).
You can use the function form of find:
soup.find(lambda tag: (tag.name=='table' and \
(tag.find('a', attrs={'name': 'PATHWAY'}) is not None)))

Categories

Resources