Scraping a nested and unstructured table in python (lxml)

Scraping a nested and unstructured table in python (lxml) - python

The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
I tried the following xpaths but each of these is just giving me back a empty list.
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
Finally, what is the reason of such nesting, is it to prevent the scraping ?

It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)
So it has correctly formated table and it needs normal './tr/td//text()' to get all values
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
Result
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td>8800793196</td>
</tr>
</table>
Oryginal HTML for comparition - there are missing closing tags
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>

Similar to furas' answer, but using pandas to scrape the last table on the page:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[#rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
Output:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School

Related

Adding a new table to tbody using Beautiful Soup

I am trying to add another row to this table in my HTML page. The table has four columns.
enter image description here
This is the code I have so far:
#Table Data
newVersion = soup.new_tag('td',{'colspan':'1'},**{'class': 'confluenceTd'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
soup.insert(tableBody[1],newRow)
I have only filled in one column (the version) and I have inserted it into the a 'tr' tag. The idea being I could fill in the other 3 columns and insert them into the tr.
The tableBody[1] is due to the their being multiple tables on the page, which don't have unique IDs or classes.
The problem line is the soup.insert(tableBody[1],newRow) as it raises:
TypeError: '<' not supported between instances of 'int' and 'Tag'
But how do I provide a reference point for the insertion of the tr tag?

To create a new tag with different attributes, you can use the attr parameter of new_tag.
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
Since you haven't provided any HTML code, I have tried to reproduce the HTML code based on your input.
This code will append the newly created row to the tbody.
from bs4 import BeautifulSoup
s = '''
<table>
<thead>
</thead>
<tbody>
<tr>
<td colspan="1" class="confluenceTd">1.0.17</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
<td colspan="1" class="confluenceTd">...</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(s, 'html.parser')
newVersion = soup.new_tag('td', attrs= {'class': 'confluenceTd', 'colspan': '1'})
newRow = soup.new_tag('tr')
newRow.insert(1,newVersion)
tableBody = soup.select("tbody")
#This is a magic number
tableBody[0].append(newRow)
print(soup)
Output
<table>
<thead>
</thead>
<tbody>
<tr>
<td class="confluenceTd" colspan="1">1.0.17</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
<td class="confluenceTd" colspan="1">...</td>
</tr>
<tr><td class="confluenceTd" colspan="1"></td></tr></tbody>
</table>

Get the content of tr in tbody

I have the following table :
<table class="table table-bordered adoption-status-table">
<thead>
<tr>
<th>Extent of IFRS application</th>
<th>Status</th>
<th>Additional Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>IFRS Standards are required for domestic public companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>IFRS Standards are permitted but not required for domestic public companies</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>Permitted, but very few companies use IFRS Standards.</td>
</tr>
<tr>
<td>IFRS Standards are required or permitted for listings by foreign companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is required or permitted</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is under consideration</td>
<td>
</td>
<td></td>
</tr>
</tbody>
</table>
I am trying to extract the data like in its original source :
This is my work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
This is the only output I get :
Number of tables on site: 1
['Extent of IFRS application', 'Status', 'Additional Information']
I want to replace the NULL cells by False and the path to the image check by True.

You need to look for img element inside td. Here is an example:
data = []
for tr in body_rows:
cells = tr.find_all('td')
img = cells[1].find('img')
if img and img['src'] == '/images/icons/tick.png':
status = True
else:
status = False
data.append({
'Extent of IFRS application': cells[0].string,
'Status': status,
'Additional Information': cells[2].string,
})
print(pd.DataFrame(data).head())

Above answer is good, one other option is to use pandas.read_html to extract the table into a dataframe and populate the missing Status column using lxml xpath (or beautifulsoup if you prefer but it's more verbose) :
import pandas as pd
import requests
from lxml import html
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)
Try this on repl.it

Established html table line to python

Let's say, i have an HTML Table like this:
<tr>
<td class="Klasse gerade">12A<br></td>
<td class="Stunde gerade">4<br></td>
<td class="Fach gerade">GEO statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">603<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
<tr>
<td class="Klasse gerade">10A<br></td>
<td class="Stunde gerade">2<br></td>
<td class="Fach gerade">MA statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">406<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
if phrase the HTML to python(2.7) with:
link = "http://www.test.com/vplan.html"
f = urllib.urlopen(link)
vplan = f.read()
print vplan
how can i do this?: if td=10A then print the complete tr of 10A
Sorry for the bad formulation but this is in my opinion the easiest was to explain my question and don't wonder about the German word's (I'm a German)

You need an HTML parser like Beautifulsoup. Assuming the table in question is the only one or the first one in the document, the program may look like this:
#!/usr/bin/env python
import urllib
from bs4 import BeautifulSoup
def main():
link = 'http://www.test.com/vplan.html'
soup = BeautifulSoup(urllib.urlopen(link), 'lxml')
table = soup.find('table')
rows = [x.find_parent('tr') for x in table.find_all(text='10A')]
for row in rows:
for cell in row.find_all('td'):
print cell.text
print '-' * 10

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.

There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.

An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

Getting the nth element using BeautifulSoup

From a large table I want to read rows 5, 10, 15, 20 ... using BeautifulSoup. How do I do this? Is findNextSibling and an incrementing counter the way to go?

You could also use findAll to get all the rows in a list and after that just use the slice syntax to access the elements that you need:
rows = soup.findAll('tr')[4::5]

This can be easily done with select in beautiful soup if you know the row numbers to be selected. (Note : This is in bs4)
row = 5
while true
element = soup.select('tr:nth-of-type('+ row +')')
if len(element) > 0:
# element is your desired row element, do what you want with it
row += 5
else:
break

As a general solution, you can convert the table to a nested list and iterate...
import BeautifulSoup
def listify(table):
"""Convert an html table to a nested list"""
result = []
rows = table.findAll('tr')
for row in rows:
result.append([])
cols = row.findAll('td')
for col in cols:
strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
text = ''.join(strings)
result[-1].append(text)
return result
if __name__=="__main__":
"""Build a small table with one column and ten rows, then parse into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
soup = BeautifulSoup.BeautifulSoup(htstring)
for idx, ii in enumerate(listify(soup)):
if ((idx+1)%5>0):
continue
print ii
Running that...
[mpenning#Bucksnort ~]$ python testme.py
['foo5']
['foo10']
[mpenning#Bucksnort ~]$

Another option, if you prefer raw html...
"""Build a small table with one column and ten rows, then parse it into a list"""
htstring = """<table> <tr> <td>foo1</td> </tr> <tr> <td>foo2</td> </tr> <tr> <td>foo3</td> </tr> <tr> <td>foo4</td> </tr> <tr> <td>foo5</td> </tr> <tr> <td>foo6</td> </tr> <tr> <td>foo7</td> </tr> <tr> <td>foo8</td> </tr> <tr> <td>foo9</td> </tr> <tr> <td>foo10</td> </tr></table>"""
result = [html_tr for idx, html_tr in enumerate(soup.findAll('tr')) \
if (idx+1)%5==0]
print result
Running that...
[mpenning#Bucksnort ~]$ python testme.py
[<tr> <td>foo5</td> </tr>, <tr> <td>foo10</td> </tr>]
[mpenning#Bucksnort ~]$

Here's how you could scrape every 5th distribution link on this Wikipedia page with gazpacho:
from gazpacho import Soup
url = "https://en.wikipedia.org/wiki/List_of_probability_distributions"
soup = Soup.get(url)
a_tags = soup.find("a", {"href": "distribution"})
links = ["https://en.wikipedia.org" + a.attrs["href"] for a in a_tags]
links[4::5] # start at 0,1,2,3,**4** and stride by 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a nested and unstructured table in python (lxml) - python

Related

Adding a new table to tbody using Beautiful Soup

Get the content of tr in tbody

Established html table line to python

Using Python + BeautifulSoup to pick up text in a table on webpage

Getting the nth element using BeautifulSoup

Categories

Resources