I am a newbie in web scraping and I have been stuck with some issue. I tried and searched but nothing at all. I want to extract data from the table. The problem is that tr and td elements don't have attributes, just class. <td class="review-value"> is the same for different values. I don't know how to separate them. I need lists with every single value for example:
list_1 = [aircraft1, aircraft2, aircraft3, ..]
list_2 = [type_of_traveller1, type_of_traveller2, ...]
list_3 = [cabin_flown1, cabin_flown2,...]
list_4 = [route1, route2,...]
list_5 = [date_flown1, date_flown2, ..]
This is the table code:
<table class="review-ratings">
<tr>
<td class="review-rating-header aircraft">Aircraft</td>
<td class="review-value">Boeing 787-9</td>
</tr>
<tr>
<td class="review-rating-header type_of_traveller">Type Of Traveller</td>
<td class="review-value">Couple Leisure</td>
</tr>
<tr>
<td class="review-rating-header cabin_flown">Seat Type</td>
<td class="review-value">Business Class</td>
</tr>
<tr>
<td class="review-rating-header route">Route</td>
<td class="review-value">Mexico City to London</td>
</tr>
<tr>
<td class="review-rating-header date_flown">Date Flown</td>
<td class="review-value">February 2023</td>
</tr>
</table>
I am using BeautifulSoup:
page = requests.get(url)
table = soup.find('article')
review_table = table.find_all('table', class_ = 'review- ratings')
find_tr_header = table.find_all('td', class_ = 'review-rating-header')
headers = []
for i in find_tr_header:
headers.append(i.text.strip())
And I don't know what to do with class="review-value".
As I can see in your table each field has a cell .review-value that is following it (direct sibling).
So what you can do is use the selector + in CSS.
For instance .aircraft + .review-value will give you the value of the aircraft.
In Beautiful Soup you can even avoid this type of selector since there are built-in methods available for you. Check next-sibling
Related
I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?
To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>
I am trying to retrieve all the td text for the below table using Beautiful Soup, unfortunately the tag names are the same and I am either only able to retrieve the first element or some elements are repeatedly printing. Hence not really sure of how to go about it.
Below is HTML table snippet:
<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr>
What I want is all the text content inside these tags for example the first auto_head corresponds to first auto_body i.e. Address = 1 similarly all the values should be retrieved.
I have used find,findall,findNext and next_sibling but no luck. Here is my current code in python:
self.table = self.soup_file.findAll(class_="Table")
self.headers = [tab.find(class_="Auto_head").findNext('td',class_="Auto_head").contents[0] for tab in self.table]
self.data = [data.find(class_="Auto_body").findNext('td').contents[0] for data in self.table]
Get the headers first, then use zip(...) to combine
from bs4 import BeautifulSoup
data = '''\
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
</tr>
<tr>
<td class="Auto_body">2</td>
<td class="Auto_body">def</td>
<td class="Auto_body">no</td>
</tr>
<tr>
<td class="Auto_body">3</td>
<td class="Auto_body">ghi</td>
<td class="Auto_body">maybe</td>
</tr>
</table>
'''
soup = BeautifulSoup(data, 'html.parser')
for table in soup.select('table.Auto'):
# get rows
rows = table.select('tr')
# get headers
headers = [td.text for td in rows[0].select('td.Auto_head')]
# get details
for row in rows[1:]:
values = [td.text for td in row.select('td.Auto_body')]
print(dict(zip(headers, values)))
My output:
{'Address': '1', 'Name': 'abc', 'Type': 'yes'}
{'Address': '2', 'Name': 'def', 'Type': 'no'}
{'Address': '3', 'Name': 'ghi', 'Type': 'maybe'}
Get each category first then iterate using zip
s = '''<div>Table</div>
<table class="Auto" width="100%">
<tr>
<td class="Auto_head">Address</td>
<td class="Auto_head">Name</td>
<td class="Auto_head">Type</td>
<td class="Auto_head">Value IN</td>
<td class="Auto_head">AUTO Statement</td>
<td class="Auto_head">Value OUT</td>
<td class="Auto_head">RESULT</td>
<td class="Auto_head"></td>
</tr>
<tr>
<td class="Auto_body">1</td>
<td class="Auto_body">abc</td>
<td class="Auto_body">yes</td>
<td class="Auto_body">abc123</td>
<td class="Auto_body">jar</td>
<td class="Auto_body">123abc</td>
<td class="Auto_body">PASS</td>
<td class="Auto_body">na</td>
</tr></table>'''
soup = BeautifulSoup(s,features='html')
head = soup.find_all(name='td',class_='Auto_head')
body = soup.find_all(name='td',class_='Auto_body')
for one,two in zip(head,body):
print(f'{one.text}={two.text}')
Address=1
Name=abc
Type=yes
Value IN=abc123
AUTO Statement=jar
Value OUT=123abc
RESULT=PASS
=na
Searching by CSS class
The easiest solution is to add the find_all method at the end of the find
so your code will be
source = requests.get('YOUR URL')
soup=BeautifulSoup(source.text,'html.parser')
data = soup.find('tr').find_all('td')[0]
data = soup.find('tr').find_all('td')[1]
and so on just change the last list number 0,1,2... or else use for loop for the same
I have a webpage with a table that only appears when I click 'Inspect Element' and is not visible through the View Source page. The table contains only two rows with several cells each and looks similar to this:
<table class="datadisplaytable">
<tbody>
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</tbody>
</table>
What I'm trying to do is to iterate through the rows and return the text contained in each cell. I can't really seem to do it with Selenium. The elements contain no IDs and I'm not sure how else to get them. I'm not very familiar with using xpaths and such.
Here is a debugging attempt that returns a TypeError:
def check_grades(self):
table = []
for i in self.driver.find_element_by_class_name("dddefault"):
table.append(i)
print(table)
What is an easy way to get the text from the rows?
XPath is fragile. It's better to use CSS selectors or classes:
mytable = find_element_by_css_selector('table.datadisplaytable')
for row in mytable.find_elements_by_css_selector('tr'):
for cell in row.find_elements_by_tag_name('td'):
print(cell.text)
If you want to go row by row using an xpath, you can use the following:
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault'][text()])
Which outputs:
['16759', 'MATH', '123', '001', 'Calculus']
['16449', 'PHY', '456', '002', 'Physics']
Using td[text()] will avoid getting any Nones returned for the td's that hold no text.
So to do the same using selenium you would:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][1]"])
For multiple tables:
def get_row_data(table):
for row in table.find_elements_by_xpath(".//tr"):
yield [td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][text()]"])
for table in driver.find_elements_by_xpath("//table[#class='datadisplaytable']"):
for data in get_row_data(table):
# use the data
Correction of the Selenium part of #Padraic Cunningham's answer:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault']")])
Note: there was one missing round bracket at the end; also removed the [1] index, to match the first XML example.
Another note: Though, the example with the index [1] should also be preserved, to show how to extract individual elements.
Another Version (modified and corrected post by Padraic Cunningham):
Tested with Python 3.x
#!/usr/bin/python
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault']")])
I have a big long table in an HTML, so the tags aren't nested within each other. It looks like this:
<tr>
<td>A</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
<td class ="y">...</td>
</tr>
<tr>
<td>B</td>
</tr>
<tr>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
<td class="x">...</td>
</tr>
<tr>
<td class ="y">I want this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
<td class ="y">and this</td>
</tr>
So first I want to search the tree to find "B". Then I want to grab the text of every td tag with class y after B but before the next row of table starts over with "C".
I've tried this:
results = soup.find_all('td')
for result in results:
if result.string == "B":
print(result.string)
This gets me the string B that I want. But now I am trying to find all after this and I'm not getting what I want.
for results in soup.find_all('td'):
if results.string == 'B':
a = results.find_next('td',class_='y')
This gives me the next td after the 'B', which is what I want, but I can only seem to get that first td tag. I want to grab all of the tags that have class y, after 'B' but before 'C' (C isn't shown in the html, but follows the same pattern), and I want to it to a list.
My resulting list would be:
[['I want this'],['and this'],['and this'],['and this']]
Basically, you need to locate the element containing B text. This is your starting point.
Then, check every tr sibling of this element using find_next_siblings():
start = soup.find("td", text="B").parent
for tr in start.find_next_siblings("tr"):
# exit if reached C
if tr.find("td", text="C"):
break
# get all tds with a desired class
tds = tr.find_all("td", class_="y")
for td in tds:
print(td.get_text())
Tested on your example data, it prints:
I want this
and this
and this
and this
I am new in Python and someone suggested me to use Beautiful soup for Scrapping and i am struck in a problem to fetch the href attribute from a td tag Column 2 on the basis of year in column 4.
<table class="tableFile2" summary="Results">
<tr>
<th width="7%" scope="col">Filings</th>
<th width="10%" scope="col">Format</th>
<th scope="col">Description</th>
<th width="10%" scope="col">Filing Date</th>
<th width="15%" scope="col">File/Film Number</th>
</tr>
<tr>
<td nowrap="nowrap">8-K</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Current report, items 8.01 and 9.01
<br />Acc-no: 0001193125</td>
<td>2013-05-03</td>
<td nowrap="nowrap">000-10030<br>13813281 </td>
</tr>
<tr class="blueRow">
<td nowrap="nowrap">424B2</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Prospectus [Rule 424(b)(2)]<br />Acc-no: 0001193125</td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13802405 </td>
</tr>
<tr>
<td nowrap="nowrap">FWP</td>
<td nowrap="nowrap"> Documents</td>
<td class="small" >Filing under Securities Act Rules 163/433 of free writing prospectuses<br />Acc-no: 0001193125-13-189053 (34 Act) Size: 52 KB </td>
<td>2013-05-01</td>
<td nowrap="nowrap">333-188191<br>13800170 </td>
</tr>
</table>
table = soup.find('table', class="tableFile2")
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if "2013" in cols[3]
link = cols[1].find('a').get('href')
print
This works for me in Python 2.7:
table = soup.find('table', {'class': 'tableFile2'})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
if len(cols) >= 4 and "2013" in cols[3].text:
link = cols[1].find('a').get('href')
print link
A few issues with your previous code:
soup.find() requires a dictionary of attributes (e.g., {'class' : 'tableFile2'})
Not every cols instance will have at least 3 columns, so you need to check length first.