I have a table like this:
<table id="test" class="tablesorter">
<tr class="even">
<td style="background: #F5645C; color: #F5645C;">1 </td>
<td>Major Lazer</td>
<td class="right">64</td>
<td>93.1.15.107</td>
<td>0x0110000105DAB310</td>
<td class="center">No</td>
<td class="center">No</td>
</tr>
<tr class="odd">
<td style="background: #8FB9B0; color: #8FB9B0;">0 </td>
<td>Michael gunin</td>
<td class="right">64</td>
<td>57.48.41.27</td>
<td>0x0110000631HDA213</td>
<td class="center">No</td>
<td class="center">No</td>
</tr>
...
</table>
This table has over 100 rows, in the same format. What I want to do is to search after the long id, and then find that table row and get the IP and name.
For example, search after: 0x0110000105DAB310
Then find the specific table row in which this text exists, and grab the rest of the info like: Major Lazer and 93.1.15.107
table = playerssoup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find('td', text='0x0110000101517CC6')
This shows me the td, but I don't know from here what to do.
One approach is to use find_previous_sibling('td')
Ex:
for tr in table_rows:
td = tr.find('td', text='0x0110000105DAB310')
if td is not None:
print( td.find_previous_sibling('td').text )
print( td.find_previous_sibling('td').find_previous_sibling('td').find_previous_sibling('td').text )
Related
I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?
To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>
I am trying to retrieve all the td text for the below table using Beautiful Soup, unfortunately the tag names are the same and I am either only able to retrieve the first element or some elements are repeatedly printing. Hence not really sure of how to go about it.
Below is HTML table snippet:
<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr>
What I want is all the text content inside these tags for example the first auto_head corresponds to first auto_body i.e. Address = 1 similarly all the values should be retrieved.
I have used find,findall,findNext and next_sibling but no luck. Here is my current code in python:
self.table = self.soup_file.findAll(class_="Table")
self.headers = [tab.find(class_="Auto_head").findNext('td',class_="Auto_head").contents[0] for tab in self.table]
self.data = [data.find(class_="Auto_body").findNext('td').contents[0] for data in self.table]
Get the headers first, then use zip(...) to combine
from bs4 import BeautifulSoup
data = '''\
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
</tr>
<tr>
<td class="Auto_body">2</td>
<td class="Auto_body">def</td>
<td class="Auto_body">no</td>
</tr>
<tr>
<td class="Auto_body">3</td>
<td class="Auto_body">ghi</td>
<td class="Auto_body">maybe</td>
</tr>
</table>
'''
soup = BeautifulSoup(data, 'html.parser')
for table in soup.select('table.Auto'):
# get rows
rows = table.select('tr')
# get headers
headers = [td.text for td in rows[0].select('td.Auto_head')]
# get details
for row in rows[1:]:
values = [td.text for td in row.select('td.Auto_body')]
print(dict(zip(headers, values)))
My output:
{'Address': '1', 'Name': 'abc', 'Type': 'yes'}
{'Address': '2', 'Name': 'def', 'Type': 'no'}
{'Address': '3', 'Name': 'ghi', 'Type': 'maybe'}
Get each category first then iterate using zip
s = '''<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr></table>'''
soup = BeautifulSoup(s,features='html')
head = soup.find_all(name='td',class_='Auto_head')
body = soup.find_all(name='td',class_='Auto_body')
for one,two in zip(head,body):
print(f'{one.text}={two.text}')
Address=1
Name=abc
Type=yes
Value IN=abc123
AUTO Statement=jar
Value OUT=123abc
RESULT=PASS
=na
Searching by CSS class
The easiest solution is to add the find_all method at the end of the find
so your code will be
source = requests.get('YOUR URL')
soup=BeautifulSoup(source.text,'html.parser')
data = soup.find('tr').find_all('td')[0]
data = soup.find('tr').find_all('td')[1]
and so on just change the last list number 0,1,2... or else use for loop for the same
Could BeautifulSoup select no tag table?
There's many tables in a HTML, but the data I want is in the table without any tags.
Here is my example:
There are 2 tables in HTML.
One is english, and the other is number.
from bs4 import BeautifulSoup
HTML2 = """
<table>
<tr>
<td class>a</td>
<td class>b</td>
<td class>c</td>
<td class>d</td>
</tr>
<tr>
<td class>e</td>
<td class>f</td>
<td class>g</td>
<td class>h</td>
</tr>
</table>
<table cellpadding="0">
<tr>
<td class>111</td>
<td class>222</td>
<td class>333</td>
<td class>444</td>
</tr>
<tr>
<td class>555</td>
<td class>666</td>
<td class>777</td>
<td class>888</td>
</tr>
"""
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.select('table[cellpadding!="0"]') #<---I think the key point is here.
for div in f2:
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)
I only want the data in the "english" table
And make the format like following:
a b c d
e f g h
Then save to excel.
But I can only access that "number" table.
Is there a hint?
Thanks!
You could use find_all and select only tables that don't have a specific attribute.
f2 = soup2.find_all('table', {'cellpadding':None})
Or if you want to select tables that have absolutely no attributes:
f2 = [tbl for tbl in soup2.find_all('table') if not tbl.attrs]
Then you can make a list of columns from f2 and pass it to the dataframe .
data = [
[td.text for td in tr.find_all('td')]
for table in f2 for tr in table.find_all('tr')
]
You can use has_attr method to test whether table contains the cellpadding attribute:
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.find_all('table')
for div in f2:
if not div.has_attr('cellpadding'):
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)
I have a webpage with a table that only appears when I click 'Inspect Element' and is not visible through the View Source page. The table contains only two rows with several cells each and looks similar to this:
<table class="datadisplaytable">
<tbody>
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</tbody>
</table>
What I'm trying to do is to iterate through the rows and return the text contained in each cell. I can't really seem to do it with Selenium. The elements contain no IDs and I'm not sure how else to get them. I'm not very familiar with using xpaths and such.
Here is a debugging attempt that returns a TypeError:
def check_grades(self):
table = []
for i in self.driver.find_element_by_class_name("dddefault"):
table.append(i)
print(table)
What is an easy way to get the text from the rows?
XPath is fragile. It's better to use CSS selectors or classes:
mytable = find_element_by_css_selector('table.datadisplaytable')
for row in mytable.find_elements_by_css_selector('tr'):
for cell in row.find_elements_by_tag_name('td'):
print(cell.text)
If you want to go row by row using an xpath, you can use the following:
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault'][text()])
Which outputs:
['16759', 'MATH', '123', '001', 'Calculus']
['16449', 'PHY', '456', '002', 'Physics']
Using td[text()] will avoid getting any Nones returned for the td's that hold no text.
So to do the same using selenium you would:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][1]"])
For multiple tables:
def get_row_data(table):
for row in table.find_elements_by_xpath(".//tr"):
yield [td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][text()]"])
for table in driver.find_elements_by_xpath("//table[#class='datadisplaytable']"):
for data in get_row_data(table):
# use the data
Correction of the Selenium part of #Padraic Cunningham's answer:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault']")])
Note: there was one missing round bracket at the end; also removed the [1] index, to match the first XML example.
Another note: Though, the example with the index [1] should also be preserved, to show how to extract individual elements.
Another Version (modified and corrected post by Padraic Cunningham):
Tested with Python 3.x
#!/usr/bin/python
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault']")])
I am new in Python and someone suggested me to use Beautiful soup for Scrapping and i am struck in a problem to fetch the href attribute from a td tag Column 2 on the basis of year in column 4.
<table class="tableFile2" summary="Results">
<tr>
<th width="7%" scope="col">Filings</th>
<th width="10%" scope="col">Format</th>
<th scope="col">Description</th>
<th width="10%" scope="col">Filing Date</th>
<th width="15%" scope="col">File/Film Number</th>
</tr>
<tr>
<td nowrap="nowrap">8-K</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Current report, items 8.01 and 9.01
<br />Acc-no: 0001193125</td>
<td>2013-05-03</td>
<td nowrap="nowrap">000-10030<br>13813281 </td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">424B2</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Prospectus [Rule 424(b)(2)]<br />Acc-no: 0001193125</td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13802405 </td>
</tr>
<tr>
<td nowrap="nowrap">FWP</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Filing under Securities Act Rules 163/433 of free writing prospectuses<br />Acc-no: 0001193125-13-189053 (34 Act) Size: 52 KB </td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13800170 </td>
</tr>
</table>
table = soup.find('table', class="tableFile2")
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if "2013" in cols[3]
link = cols[1].find('a').get('href')
print
This works for me in Python 2.7:
table = soup.find('table', {'class': 'tableFile2'})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 4 and "2013" in cols[3].text:
link = cols[1].find('a').get('href')
print link
A few issues with your previous code:
soup.find() requires a dictionary of attributes (e.g., {'class' : 'tableFile2'})
Not every cols instance will have at least 3 columns, so you need to check length first.