Get the content of tr in tbody - python

I have the following table :
<table class="table table-bordered adoption-status-table">
<thead>
<tr>
<th>Extent of IFRS application</th>
<th>Status</th>
<th>Additional Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>IFRS Standards are required for domestic public companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>IFRS Standards are permitted but not required for domestic public companies</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>Permitted, but very few companies use IFRS Standards.</td>
</tr>
<tr>
<td>IFRS Standards are required or permitted for listings by foreign companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is required or permitted</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is under consideration</td>
<td>
</td>
<td></td>
</tr>
</tbody>
</table>
I am trying to extract the data like in its original source :
This is my work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
This is the only output I get :
Number of tables on site: 1
['Extent of IFRS application', 'Status', 'Additional Information']
I want to replace the NULL cells by False and the path to the image check by True.

You need to look for img element inside td. Here is an example:
data = []
for tr in body_rows:
cells = tr.find_all('td')
img = cells[1].find('img')
if img and img['src'] == '/images/icons/tick.png':
status = True
else:
status = False
data.append({
'Extent of IFRS application': cells[0].string,
'Status': status,
'Additional Information': cells[2].string,
})
print(pd.DataFrame(data).head())

Above answer is good, one other option is to use pandas.read_html to extract the table into a dataframe and populate the missing Status column using lxml xpath (or beautifulsoup if you prefer but it's more verbose) :
import pandas as pd
import requests
from lxml import html
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)
Try this on repl.it

Related

Adding a new table to tbody using Beautiful Soup

I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?
To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>

Scraping a nested and unstructured table in python (lxml)

The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
I tried the following xpaths but each of these is just giving me back a empty list.
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
Finally, what is the reason of such nesting, is it to prevent the scraping ?
It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)
So it has correctly formated table and it needs normal './tr/td//text()' to get all values
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
Result
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td>8800793196</td>
</tr>
</table>
Oryginal HTML for comparition - there are missing closing tags
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>
Similar to furas' answer, but using pandas to scrape the last table on the page:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[#rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
Output:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School

Parse integer from a "td" tag with beautifulsoup

I read many articles about beautifulsoup but still I do not understand. I need an example.
I want to get the value of "PD/DD" which is 1,9.
Here is the source:
<div class="table vertical">
<table>
<tbody>
<tr>
<th>F/K</th>
<td>A/D</td>
</tr>
<tr>
<th>FD/FAVÖK</th>
<td>19,7</td>
</tr>
<tr>
HERE--> <th>PD/DD</th>
HERE--> <td>1,9</td>
</tr>
<tr>
<th>FD/Satışlar</th>
<td>5,1</td>
</tr>
<tr>
<th>Yabancı Oranı (%)</th>
<td>2,43</td>
</tr>
<tr>
<th>Ort Hacim (mn$) 3A/12A</th>
<td>1,3 / 1,6</td>
</tr>
My code is:
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
b=soup_bank.find_all("div", {"class": "table vertical"})
for i in b:
children = i.findChildren("td" , recursive=True)
for child in children:
l=[]
l_text = child.text
l.append(l_text)
print(l)
When i run this code it gives me a list with 1 index.
['Afyon Çimento ']
['11.04.1990']
['Çimento üretip satmak ve ana faaliyet konusu ile ilgili her türlü yan sanayi kuruluşlarına iştirak etmek.']
['(0216)5547000']
['(0216)6511415']
['Kısıklı Cad. Sarkusyan-Ak İş Merkezi S Blok kat:2 34662 Altunizade - Üsküdar / İstanbul']
['A/D']
['19,7']
['1,9']
['5,1']
['2,43']
['1,3 / 1,6']
['407,0 mnTL']
['395,0 mnTL']
['-']
How can I get only PD/DD value. I am expecting something like:
PD/DD : 1,9
My preference:
With bs4 4.7.1 you can use :contains to target the th by its text value then take the adjacent sibling td.
import requests
from bs4 import BeautifulSoup
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
print(soup_bank.select_one('th:contains("PD/DD") + td').text)
You could also use :nth-of-type for positional matching (3rd row 1st column):
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td:nth-of-type(1)').text
As we are using select_one, which returns first match, we can shorten to:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td').text
If id static
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) td').text
You already know the PD/DD but that could be gained by:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) th').text
If those ids remain static for at least a while then
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) th').text

Beautifulsoup iterate to get either <td>sometext</td> or url

I Want to create a list that contains a key-value pair. With the <thead> items as the key. For the values I want to get the text for all <th>items except the <th> items where there is a <a href='url'>, then I want to get the url instead.
Currently I am only able to get the text from all items. But how do I do to get '/someurl' instead of Makulerad and Detaljer?
<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>
My code:
raw_html = simple_get('https://example.com/')
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head,(td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
Try this:
simply get the text data of <td> if it doesn't have an <a>. Otherwise get the href value.
from bs4 import BeautifulSoup
raw_html = '''<table class="table table-bordered table-hover table-striped zero-margin-top">
<thead>
<tr>
<th>Volymsenhet</th>
<th>Pris</th>
<th>Valuta</th>
<th>Handelsplats</th>
<th>url1</th>
<th>url2</th>
</tr>
</thead>
<tbody>
<tr class="iprinactive">
<td>Antal</td>
<td>5,40</td>
<td>SEK</td>
<td>NASDAQ STOCKHOLM AB</td>
<td>Makulerad</td>
<td>
Detaljer
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(raw_html, 'html.parser')
table = soup.find("table", attrs={"class":"table"})
head = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(head, [td.get_text() if not td.a else td.a['href'] for td in row.find_all("td")]))
datasets.append(dataset)
print(datasets)
OUTPUT:
[{'Volymsenhet': 'Antal', 'Pris': '5,40', 'Valuta': 'SEK', 'Handelsplats': 'NASDAQ STOCKHOLM AB', 'url1': '/someurl', 'url2': '/someurl'}]

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Categories

Resources