Beautifulsoup iterate to get either <td>sometext</td> or url

Beautifulsoup iterate to get either <td>sometext</td> or url - python

I Want to create a list that contains a key-value pair. With the <thead> items as the key. For the values I want to get the text for all <th>items except the <th> items where there is a <a href='url'>, then I want to get the url instead.
Currently I am only able to get the text from all items. But how do I do to get '/someurl' instead of Makulerad and Detaljer?
<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>
My code:
raw_html = simple_get('https://example.com/')
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head,(td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)

Try this:
simply get the text data of <td> if it doesn't have an <a>. Otherwise get the href value.
from bs4 import BeautifulSoup
raw_html = '''<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head, [td.get_text() if not td.a else td.a['href'] for td in row.find_all("td")]))
datasets.append(dataset)
print(datasets)
OUTPUT:
[{'Volymsenhet': 'Antal', 'Pris': '5,40', 'Valuta': 'SEK', 'Handelsplats': 'NASDAQ STOCKHOLM AB', 'url1': '/someurl', 'url2': '/someurl'}]

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Having trouble appending a new row to the first row (the header row) in the table body ().
my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('page_content.xml'), 'html.parser')
# append a row to the first row in the table body
row = soup.find('tbody').find('tr')
row.append(soup.new_tag('tr', text='New Cell'))
print(row)
the output:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
what the output should be:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr text="New Cell"></tr>
the full xml file is:
<h1>Rental Agreement/Editor</h1>
<table class="wrapped">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
<tr>
<td>1.0.1-0</td>
<td>ABC-1234</td>
<td colspan="1">
<br/>
</td>
</tr>
</tbody>
</table>
<p class="auto-cursor-target">
<br/>
</p>

You can use .insert_after:
from bs4 import BeautifulSoup
html_doc = """
<table>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr>
<td> something else </td>
</tr>
</table>
"""
soup = BeautifulSoup(html_doc, "html.parser")
row = soup.select_one("tr:has(th)")
row.insert_after(soup.new_tag("tr", text="New Cell"))
print(soup.prettify())
Prints:
<table>
<tr>
<th>
Version
</th>
<th>
Jira
</th>
<th colspan="1">
Date/Time
</th>
</tr>
<tr text="New Cell">
</tr>
<tr>
<td>
something else
</td>
</tr>
</table>
EDIT: If you want to insert arbitrary HTML code, you can try:
what_to_insert = BeautifulSoup(
'<tr param="xxx">This is new <b>text</b></tr>', "html.parser"
)
row.insert_after(what_to_insert)

Adding a new table to tbody using Beautiful Soup

I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?

To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>

Get the content of tr in tbody

I have the following table :
<table class="table table-bordered adoption-status-table">
<thead>
<tr>
<th>Extent of IFRS application</th>
<th>Status</th>
<th>Additional Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>IFRS Standards are required for domestic public companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>IFRS Standards are permitted but not required for domestic public companies</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>Permitted, but very few companies use IFRS Standards.</td>
</tr>
<tr>
<td>IFRS Standards are required or permitted for listings by foreign companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is required or permitted</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is under consideration</td>
<td>
</td>
<td></td>
</tr>
</tbody>
</table>
I am trying to extract the data like in its original source :
This is my work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
This is the only output I get :
Number of tables on site: 1
['Extent of IFRS application', 'Status', 'Additional Information']
I want to replace the NULL cells by False and the path to the image check by True.

You need to look for img element inside td. Here is an example:
data = []
for tr in body_rows:
cells = tr.find_all('td')
img = cells[1].find('img')
if img and img['src'] == '/images/icons/tick.png':
status = True
else:
status = False
data.append({
'Extent of IFRS application': cells[0].string,
'Status': status,
'Additional Information': cells[2].string,
})
print(pd.DataFrame(data).head())

Above answer is good, one other option is to use pandas.read_html to extract the table into a dataframe and populate the missing Status column using lxml xpath (or beautifulsoup if you prefer but it's more verbose) :
import pandas as pd
import requests
from lxml import html
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)
Try this on repl.it

Extracting contents of two tables from web data

How can I retrieve all td information from this html data:
<h1>All staff</h1>
<h2>Manager</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Jon Staut
</td>
<td>Line Manager</td>
<td>0160 315 3832</td>
<td>
Jon.staut#strx.usc.com </td>
</tr>
</tbody>
</table>
<h2>Junior Staff</h2>
<table class="StaffList">
<tbody>
<tr>
<th>Name</th>
<th>Post title</th>
<th>Telephone</th>
<th>Email</th>
</tr>
<tr>
<td>
Peter Boone
</td>
<td>Mailer</td>
<td>0160 315 3834</td>
<td>
Peter.Boone#strx.usc.com
</td>
</tr>
<tr>
<td>
John Peters
</td>
<td>Builder</td>
<td>0160 315 3837</td>
<td>
John.Peters#strx.usc.com
</td>
</tr>
</tbody>
</table>
Here's my code that generated an error:
response =requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.findAll('table', attrs={'class': 'StaffList'})
list_of_rows = []
for row in table.findAll('tr'): #2 rows found in table -loop through
list_of_cells = []
for cell in row.findAll('td'): # each cell in in a row
text = cell.text.replace('&nbsp','')
list_of_cells.append(text)
#print list_of_cells
list_of_rows.append(list_of_cells)
#print all cells in the two rows
print list_of_rows
Error message:
AttributeError: 'ResultSet' object has no attribute 'findAll'
What do I need to do to make the code output all the information in the two web tables?

The problem starts at this line:
table = soup.findAll('table', attrs={'class': 'StaffList'})
The findAll returns an array which has no attribute findAll.
Simply, change the findAll to find:
table = soup.find('table', attrs={'class': 'StaffList'})

Alternatively, you can use CSS selector expression to return tr elements from the StaffList table without having to extract the table first :
for row in soup.select('table.StaffList tr'): #2 rows found in table -loop through
......

Thanks for suggestions guys. Problem now solved after replacing 2 lines of code:
The first one:
table = soup.findAll('table', attrs={'class': 'StaffList'})
replaced with:
table = soup.findAll('tr')
The second one:
for row in table.findAll('tr'):
replaced with:
for row in table:

Using Beautiful soup to analyze table in python

So I've got a table:
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
And I was simply trying to return a JSON string of the table pairs like so:
[["Pig A", "Straw"], ["Pig B", "Stick"], ["Pig C", "Brick"]]
However, with my code I can't seem to get rid of the HTML tags:
stable = soup.find('table')
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:4]:
# Process the body of the table
row = []
td = tr.findAll('td')
#td = [el.text for el in soup.tr.finall('td')]
row.append( td[0])
row.append( td[1])
cells.append( row )
return cells
#eventually, I'd like to do this:
#h = json.dumps(cells)
#return h
My output is this:
[[<td>Pig A</td>, <td>Straw</td>], [<td>Pig B</td>, <td>Stick</td>], [<td>Pig C</td>, <td>Brick</td>]]

Use the text property to get only the inner text of the element:
row.append(td[0].text)
row.append(td[1].text)

You can try using lxml library.
from lxml.html import fromstring
import lxml.html as PARSER
#data = open('example.html').read() # You can read it from a html file.
#OR
data = """
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
"""
root = PARSER.fromstring(data)
main_list = []
for ele in root.getiterator():
if ele.tag == "tr":
text = ele.text_content().strip().split('\n')
main_list.append(text)
print main_list
Output:
[['Pig', ' House Type'], ['Pig A', ' Straw'], ['Pig B', ' Stick'], ['Pig C', ' Brick']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup iterate to get either <td>sometext</td> or url - python

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Adding a new table to tbody using Beautiful Soup

Get the content of tr in tbody

Extracting contents of two tables from web data

Using Beautiful soup to analyze table in python

Categories

Resources