From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?
You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:
rows = soup.findAll('tr')[4::5]
This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)
row = 5
while true
element = soup.select('tr:nth-of-type('+ row +')')
if len(element) > 0:
# element is your desired row element, do what you want with it
row += 5
else:
break
As a general solution, you can convert the table to a nested list and iterate...
import BeautifulSoup
def listify(table):
"""Convert an html table to a nested list"""
result = []
rows = table.findAll('tr')
for row in rows:
result.append([])
cols = row.findAll('td')
for col in cols:
strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
text = ''.join(strings)
result[-1].append(text)
return result
if __name__=="__main__":
"""Build a small table with one column and ten rows, then parse into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
soup = BeautifulSoup.BeautifulSoup(htstring)
for idx, ii in enumerate(listify(soup)):
if ((idx+1)%5>0):
continue
print ii
Running that...
[mpenning#Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning#Bucksnort ~]$
Another option, if you prefer raw html...
"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \
if (idx+1)%5==0]
print result
Running that...
[mpenning#Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning#Bucksnort ~]$
Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:
from gazpacho import Soup
url = "https://en.wikipedia.org/wiki/List_of_probability_distributions"
soup = Soup.get(url)
a_tags = soup.find("a", {"href": "distribution"})
links = ["https://en.wikipedia.org" + a.attrs["href"] for a in a_tags]
links[4::5] # start at 0,1,2,3,**4** and stride by 5
Related
I am a newbie in web scraping and I have been stuck with some issue. I tried and searched but nothing at all. I want to extract data from the table. The problem is that tr and td elements don't have attributes, just class. <td class="review-value"> is the same for different values. I don't know how to separate them. I need lists with every single value for example:
list_1 = [aircraft1, aircraft2, aircraft3, ..]
list_2 = [type_of_traveller1, type_of_traveller2, ...]
list_3 = [cabin_flown1, cabin_flown2,...]
list_4 = [route1, route2,...]
list_5 = [date_flown1, date_flown2, ..]
This is the table code:
<table class="review-ratings">
<tr>
<td class="review-rating-header aircraft">Aircraft</td>
<td class="review-value">Boeing 787-9</td>
</tr>
<tr>
<td class="review-rating-header type_of_traveller">Type Of Traveller</td>
<td class="review-value">Couple Leisure</td>
</tr>
<tr>
<td class="review-rating-header cabin_flown">Seat Type</td>
<td class="review-value">Business Class</td>
</tr>
<tr>
<td class="review-rating-header route">Route</td>
<td class="review-value">Mexico City to London</td>
</tr>
<tr>
<td class="review-rating-header date_flown">Date Flown</td>
<td class="review-value">February 2023</td>
</tr>
</table>
I am using BeautifulSoup:
page = requests.get(url)
table = soup.find('article')
review_table = table.find_all('table', class_ = 'review- ratings')
find_tr_header = table.find_all('td', class_ = 'review-rating-header')
headers = []
for i in find_tr_header:
headers.append(i.text.strip())
And I don't know what to do with class="review-value".
As I can see in your table each field has a cell .review-value that is following it (direct sibling).
So what you can do is use the selector + in CSS.
For instance .aircraft + .review-value will give you the value of the aircraft.
In Beautiful Soup you can even avoid this type of selector since there are built-in methods available for you. Check next-sibling
I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?
To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>
The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
I tried the following xpaths but each of these is just giving me back a empty list.
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
Finally, what is the reason of such nesting, is it to prevent the scraping ?
It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)
So it has correctly formated table and it needs normal './tr/td//text()' to get all values
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
Result
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td>8800793196</td>
</tr>
</table>
Oryginal HTML for comparition - there are missing closing tags
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>
Similar to furas' answer, but using pandas to scrape the last table on the page:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[#rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
Output:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School
How can I retrieve all td information from this html data:
<h1>All staff</h1>
<h2>Manager</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Jon Staut
</td>
<td>Line Manager</td>
<td>0160 315 3832</td>
<td>
Jon.staut#strx.usc.com </td>
</tr>
</tbody>
</table>
<h2>Junior Staff</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Peter Boone
</td>
<td>Mailer</td>
<td>0160 315 3834</td>
<td>
Peter.Boone#strx.usc.com
</td>
</tr>
<tr>
<td>
John Peters
</td>
<td>Builder</td>
<td>0160 315 3837</td>
<td>
John.Peters#strx.usc.com
</td>
</tr>
</tbody>
</table>
Here's my code that generated an error:
response =requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.findAll('table', attrs={'class': 'StaffList'})
list_of_rows = []
for row in table.findAll('tr'): #2 rows found in table -loop through
list_of_cells = []
for cell in row.findAll('td'): # each cell in in a row
text = cell.text.replace(' ','')
list_of_cells.append(text)
#print list_of_cells
list_of_rows.append(list_of_cells)
#print all cells in the two rows
print list_of_rows
Error message:
AttributeError: 'ResultSet' object has no attribute 'findAll'
What do I need to do to make the code output all the information in the two web tables?
The problem starts at this line:
table = soup.findAll('table', attrs={'class': 'StaffList'})
The findAll returns an array which has no attribute findAll.
Simply, change the findAll to find:
table = soup.find('table', attrs={'class': 'StaffList'})
Alternatively, you can use CSS selector expression to return tr elements from the StaffList table without having to extract the table first :
for row in soup.select('table.StaffList tr'): #2 rows found in table -loop through
......
Thanks for suggestions guys. Problem now solved after replacing 2 lines of code:
The first one:
table = soup.findAll('table', attrs={'class': 'StaffList'})
replaced with:
table = soup.findAll('tr')
The second one:
for row in table.findAll('tr'):
replaced with:
for row in table:
I have a selenium python script that reads a table on a page. The table has 3 columns, the first is a list of IDs and the 3rd is a check box. I iterate through the IDs until I find the one I want then then click the corresponding check box and save. It works fine but is very slow as the table can be 4K rows.
This is the current code (self.questionID is a dictionary with the IDs I'm looking for):
k, v in self.questionID.items():
foundQuestion = False
i = 1
while foundQuestion is False:
questionIndex = driver.find_element_by_xpath('/html/body/div[1]/form/table[2]/tbody/tr/td[1]/table/tbody/tr/td/fieldset[2]/div/table[1]/tbody/tr/td/table/tbody/tr/td/div/table/tbody[%d]/tr/td[1]' % i).text
if questionIndex.strip() == k:
d = i - 1
driver.find_element_by_name('selectionIndex[%d]' % d).click()
foundQuestion = True
i +=1
This is a sample of the table, just the first couple of rows:
<thead>
<tr>
<th class="first" width="5%">ID</th>
<th width="90%">Question</th>
<th class="last" width="1%"> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="rowodd">AG001 </td>
<td class="rowodd">Foo: </td>
<td class="rowodd"><input class="input" name="selectionIndex[0]" tabindex="30" type="checkbox"></td>
</tr>
</tbody>
<tbody>
<tr>
<td class="roweven">AG002 </td>
<td class="roweven">Bar </td>
<td class="roweven"><input class="input" name="selectionIndex[1]" tabindex="30" type="checkbox"></td>
</tr>
</tbody>
As you can probably guess I'm no python ninja. Is there is a quicker way to read this table and locate the correct row?
You can find the relevant checkbox in one go by using an xpath expression to search a question node by text and to get it's td following sibling and input inside it:
checkbox = driver.find_element_by_xpath('//tr/td[1][(#class="rowodd" or #class="roweven") and text() = "%s${nbsp}"]/following-sibling::td[2]/input[starts-with(#name, "selectionIndex")]' % k)
checkbox.click()
Note that it would throw NoSuchElementException in case a question and a related to it checkbox is not found. You probably need to catch the exception:
try:
checkbox = driver.find_element_by_xpath('//tr/td[1][(#class="rowodd" or #class="roweven") and text() = "%s${nbsp}"]/following-sibling::td[2]/input[starts-with(#name, "selectionIndex")]' % k)
checkbox.click()
except NoSuchElementException:
# question not found - need to handle it, or just move on?
pass