python parse specific data on html table using lxml and xpath

python parse specific data on html table using lxml and xpath - python

First of all I am new to python and Stack Overflow so please be kind.
This is the source code of the html page I want to extract data from.
Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168
The table is at the bottom of the page
<html>
table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>
</td>
<td>Guldhedens IK - IF Warta</td>
<td>Guldheden Södra 1 Konstgräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>
</td>
<td>Romelanda UF - IK Virgo</td>
<td>Romevi 1 Gräs </td>
</tr>
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>
</td>
<td>Kode IF - IK Kongahälla</td>
<td>Kode IP 1 Gräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>
</td>
<td>Floda BoIF - Partille IF FK </td>
<td>Flodala IP 1 </td>
</tr>
</tbody>
</table>
</html>
I need to extract the time: 19:30 and the team name: Guldhedens IK - IF Warta meaning the first and the second table cell(not the third) from the first table row and 13:00/Romelanda UF - IK Virgo from the second table row etc.. from all the table rows there is.
As you can see every table row has a date right before the time so here comes the tricky part. I only want to get the time and the team names as mentioned above from those table rows where the date is equal to the date I run this code.
The only thing I managed to do so far is not much, I can only get the time and the team name using this code:
import lxml.html
html = lxml.html.parse("http://gbgfotboll.se/information/?scr=table&ftid=51168")
test=html.xpath("//*[#id='content-primary']/table[3]/tbody/tr[1]/td[1]/span/span//text()")
print test
which gives me the result ['2014-09-26', ' 19:30'] after this I'm lost on how to iterate through different table rows wanting the specific table cells where the date matches the date I run the code.
I hope you can answer as much as you can.

If I understood you, try something like this:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
print html.xpath(xpath1)[1], html.xpath(xpath2)[0]
I know this is fragile and there are better solutions, but it works. ;)
Edit:
Better way with using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr') #change this to [0] to parse first table
for i in t:
try:
print i.find('span').get_text()[-5:], i.find('a').get_text()
except AttributeError:
pass
Edit2:
page not responding, but that should work:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr')
time = ""
for i in t:
try:
dateTime = i.find('span').get_text()
teamName = i.find('a').get_text()
if time == dateTime[:-5]:
print dateTime[-5,], teamName
else:
print dateTime, teamName
time = dateTime[:-5]
except AttributeError:
pass
lxml:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
dateTemp = ""
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span// text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == dateTemp:
print time, teamName
else:
print date, time, teamName

So thanks to #CodeNinja help i just tweaked it a little bit to get exactly what i wanted.
I imported time to get the date of the time i run the code. Anyways here is the code for what i wanted. Thank you for the help!!
import lxml.html
import time
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
currentDate = (time.strftime("%Y-%m-%d"))
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == currentDate:
print time, teamName
So here is the FINAL version of how to do it the correct way. This will parse through all the table rows it has without using "range" in the for loop. I got this answer from my other post here: Iterate through all the rows in a table using python lxml xpath
import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")
html = lxml.html.parse(url)
for row in rows_xpath(html):
time = time_xpath(row)[0].strip()
team = team_xpath(row)[0]
print time, team

Related

Get the content of tr in tbody

I have the following table :
<table class="table table-bordered adoption-status-table">
<thead>
<tr>
<th>Extent of IFRS application</th>
<th>Status</th>
<th>Additional Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>IFRS Standards are required for domestic public companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>IFRS Standards are permitted but not required for domestic public companies</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>Permitted, but very few companies use IFRS Standards.</td>
</tr>
<tr>
<td>IFRS Standards are required or permitted for listings by foreign companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is required or permitted</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is under consideration</td>
<td>
</td>
<td></td>
</tr>
</tbody>
</table>
I am trying to extract the data like in its original source :
This is my work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
This is the only output I get :
Number of tables on site: 1
['Extent of IFRS application', 'Status', 'Additional Information']
I want to replace the NULL cells by False and the path to the image check by True.

You need to look for img element inside td. Here is an example:
data = []
for tr in body_rows:
cells = tr.find_all('td')
img = cells[1].find('img')
if img and img['src'] == '/images/icons/tick.png':
status = True
else:
status = False
data.append({
'Extent of IFRS application': cells[0].string,
'Status': status,
'Additional Information': cells[2].string,
})
print(pd.DataFrame(data).head())

Above answer is good, one other option is to use pandas.read_html to extract the table into a dataframe and populate the missing Status column using lxml xpath (or beautifulsoup if you prefer but it's more verbose) :
import pandas as pd
import requests
from lxml import html
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)
Try this on repl.it

Scraping a nested and unstructured table in python (lxml)

The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
I tried the following xpaths but each of these is just giving me back a empty list.
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
Finally, what is the reason of such nesting, is it to prevent the scraping ?

It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)
So it has correctly formated table and it needs normal './tr/td//text()' to get all values
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
Result
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td>8800793196</td>
</tr>
</table>
Oryginal HTML for comparition - there are missing closing tags
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>

Similar to furas' answer, but using pandas to scrape the last table on the page:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[#rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
Output:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School

Parse integer from a "td" tag with beautifulsoup

I read many articles about beautifulsoup but still I do not understand. I need an example.
I want to get the value of "PD/DD" which is 1,9.
Here is the source:
<div class="table vertical">
<table>
<tbody>
<tr>
<th>F/K</th>
<td>A/D</td>
</tr>
<tr>
<th>FD/FAVÖK</th>
<td>19,7</td>
</tr>
<tr>
HERE--> <th>PD/DD</th>
HERE--> <td>1,9</td>
</tr>
<tr>
<th>FD/Satışlar</th>
<td>5,1</td>
</tr>
<tr>
<th>Yabancı Oranı (%)</th>
<td>2,43</td>
</tr>
<tr>
<th>Ort Hacim (mn$) 3A/12A</th>
<td>1,3 / 1,6</td>
</tr>
My code is:
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
b=soup_bank.find_all("div", {"class": "table vertical"})
for i in b:
children = i.findChildren("td" , recursive=True)
for child in children:
l=[]
l_text = child.text
l.append(l_text)
print(l)
When i run this code it gives me a list with 1 index.
['Afyon Çimento ']
['11.04.1990']
['Çimento üretip satmak ve ana faaliyet konusu ile ilgili her türlü yan sanayi kuruluşlarına iştirak etmek.']
['(0216)5547000']
['(0216)6511415']
['Kısıklı Cad. Sarkusyan-Ak İş Merkezi S Blok kat:2 34662 Altunizade - Üsküdar / İstanbul']
['A/D']
['19,7']
['1,9']
['5,1']
['2,43']
['1,3 / 1,6']
['407,0 mnTL']
['395,0 mnTL']
['-']
How can I get only PD/DD value. I am expecting something like:
PD/DD : 1,9

My preference:
With bs4 4.7.1 you can use :contains to target the th by its text value then take the adjacent sibling td.
import requests
from bs4 import BeautifulSoup
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
print(soup_bank.select_one('th:contains("PD/DD") + td').text)
You could also use :nth-of-type for positional matching (3rd row 1st column):
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td:nth-of-type(1)').text
As we are using select_one, which returns first match, we can shorten to:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td').text
If id static
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) td').text
You already know the PD/DD but that could be gained by:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) th').text
If those ids remain static for at least a while then
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) th').text

How to extract 2nd column in html table in python?

<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
I need to grab all the td's in column 2 in python.I am new to python.
import urllib2
from bs4 import BeautifulSoup
url = "http://ccdsiu.byethost33.com/magento/adamo-13.html"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'madhu'})
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
for trr in trdata:
print trr
I am trying to get data from above code .It is printing all the td elements in table.I am trying to achieve this by Xpath

I don't think you can use xpath like you mentioned with BeautifulSoup. However, the lxml module, which comes with python, can do it.
from lxml import etree
table = '''
<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
'''
parser = etree.HTMLParser()
tree = etree.fromstring(table, parser)
results = tree.xpath('//tr/td[position()=2]')
print 'Column 2\n========'
for r in results:
print r.text
Which when run prints
Column 2
========
Doe
abcd
efc

You don't have to iterate over your td elements. Use this:
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
if len(tddata) >= 2:
print tddata[1]
Lists are indexed starting from 0. I check the length of the list to make sure that second td exists.

It is not clear really what you want since your example of html is not relevant and the description of just second column tds isnt really helpful. Anyway I modified Elmos answer to give you the Importance title and then the actual importance level of each thing.
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
count = 0
for i in range(0, len(tddata)):
if count % 6 == 0:
print tddata[count + 1]
count += 1

How to select some urls with BeautifulSoup?

I want to scrape the following information except the last row and "class="Region" row:
...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td>
<td bgcolor="" align="left">New York</td>
<td bgcolor="" align="left" class="Region">N/A</td>
<td bgcolor="" align="left">1,863</td>
<td bgcolor="" align="left">565</td>
<td bgcolor="" align="left">1,133</td>
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...
I tested with this handler:
class TestUrlOpen(webapp.RequestHandler):
def get(self):
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
link_list = []
for a in soup.findAll('a',href=True):
link_list.append(a["href"])
self.response.out.write("""<p>link_list: %s</p>""" % link_list)
This works but it also get the "View Profile" link which I don't want:
link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]
I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.

I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.
from BeautifulSoup import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
number = row.find('td').text
tds = row.findAll(name='td',attrs={'align':'left'})
link = tds[0].find('a')['href']
firm = tds[0].text
office = tds[1].text
attorneys = tds[3].text
partners = tds[4].text
associates = tds[5].text
salary = tds[6].text
print number, firm, office, attorneys, partners, associates, salary

I would get each tr, in the table with the class=listings. Your search is obviously too broad for the information you want. Because HTML has a structure you can easily get just the table data. This is easier in the long run then getting all hrefs and filtering the ones that you don't want out. BeautifulSoup has plent of documentation on how to do this. http://www.crummy.com/software/BeautifulSoup/documentation.html
not exact code:
for tr in soup.findAll('tr'):
data_list = tr.children()
data_list[0].content # 7
data_list[1].content # New York
data_list[2].content # Region <-- ignore this
# etc

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python parse specific data on html table using lxml and xpath - python

Related

Get the content of tr in tbody

Scraping a nested and unstructured table in python (lxml)

Parse integer from a "td" tag with beautifulsoup

How to extract 2nd column in html table in python?

How to select some urls with BeautifulSoup?

Categories

Resources