<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
I need to grab all the td's in column 2 in python.I am new to python.
import urllib2
from bs4 import BeautifulSoup
url = "http://ccdsiu.byethost33.com/magento/adamo-13.html"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'madhu'})
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
for trr in trdata:
print trr
I am trying to get data from above code .It is printing all the td elements in table.I am trying to achieve this by Xpath
I don't think you can use xpath like you mentioned with BeautifulSoup. However, the lxml module, which comes with python, can do it.
from lxml import etree
table = '''
<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
'''
parser = etree.HTMLParser()
tree = etree.fromstring(table, parser)
results = tree.xpath('//tr/td[position()=2]')
print 'Column 2\n========'
for r in results:
print r.text
Which when run prints
Column 2
========
Doe
abcd
efc
You don't have to iterate over your td elements. Use this:
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
if len(tddata) >= 2:
print tddata[1]
Lists are indexed starting from 0. I check the length of the list to make sure that second td exists.
It is not clear really what you want since your example of html is not relevant and the description of just second column tds isnt really helpful. Anyway I modified Elmos answer to give you the Importance title and then the actual importance level of each thing.
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
count = 0
for i in range(0, len(tddata)):
if count % 6 == 0:
print tddata[count + 1]
count += 1
Related
The website I'm scraping (using lxml ) is working just fine with everything except a table, in which all the tr's , td's and heading th's are nested & mixed and forms a unstructured HTML table.
<table class='table'>
<tr>
<th>Serial No.
<th>Full Name
<tr>
<td>1
<td rowspan='1'> John
<tr>
<td>2
<td rowspan='1'>Jane Alleman
<tr>
<td>3
<td rowspan='1'>Mukul Jha
.....
.....
.....
</table>
I tried the following xpaths but each of these is just giving me back a empty list.
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/td/td/text()')]
persons = [x for x in tree.xpath('//table[#class="table"]/tr/th/th/tr/td/td/text()') if x.isdigit() ==False] # to remove the serial no.s
Finally, what is the reason of such nesting, is it to prevent the scraping ?
It seems lxml loads table in similar way as browser and it creates correct structure in memory and you can see correct HTML when you use lxml.html.tostring(table)
So it has correctly formated table and it needs normal './tr/td//text()' to get all values
import requests
import lxml.html
text = requests.get('https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station').text
s = lxml.html.fromstring(text)
table = s.xpath('//table')[1]
for row in table.xpath('./tr'):
cells = row.xpath('./td//text()')
print(cells)
print(lxml.html.tostring(table, pretty_print=True).decode())
Result
['Fare', ' DMRC Rs. 30']
['Time', '0:14']
['First', '6:03']
['Last', '22:24']
['Phone ', '8800793196']
<table class="table">
<tr>
<td title="Monday To Saturday">Fare</td>
<td><div> DMRC Rs. 30</div></td>
</tr>
<tr>
<td>Time</td>
<td>0:14</td>
</tr>
<tr>
<td>First</td>
<td>6:03</td>
</tr>
<tr>
<td>Last</td>
<td>22:24</td>
</tr>
<tr>
<td>Phone </td>
<td>8800793196</td>
</tr>
</table>
Oryginal HTML for comparition - there are missing closing tags
<table class='table'>
<tr><td title='Monday To Saturday'>Fare<td><div> DMRC Rs. 30</div></tr>
<tr><td>Time<td>0:14</tr>
<tr><td>First<td>6:03</tr>
<tr><td>Last<td>22:24
<tr><td>Phone <td><a href='tel:8800793196'>8800793196</a></tr>
</table>
Similar to furas' answer, but using pandas to scrape the last table on the page:
import requests
import lxml
import pandas as pd
url = 'https://delhimetrorail.info/dwarka-sector-8-delhi-metro-station-to-dwarka-sector-14-delhi-metro-station'
response = requests.get(url)
root = lxml.html.fromstring(response.text)
rows = []
info = root.xpath('//table[4]/tr/td[#rowspan]')
for i in info:
row = []
row.append(i.getprevious().text)
row.append(i.text)
rows.append(row)
columns = root.xpath('//table[4]//th/text()')
df1 = pd.DataFrame(rows, columns=columns)
df1
Output:
Gate Dwarka Sector 14 Metro Station
0 1 Eros Etro Mall
1 2 Nirmal Bharatiya Public School
I read many articles about beautifulsoup but still I do not understand. I need an example.
I want to get the value of "PD/DD" which is 1,9.
Here is the source:
<div class="table vertical">
<table>
<tbody>
<tr>
<th>F/K</th>
<td>A/D</td>
</tr>
<tr>
<th>FD/FAVÖK</th>
<td>19,7</td>
</tr>
<tr>
HERE--> <th>PD/DD</th>
HERE--> <td>1,9</td>
</tr>
<tr>
<th>FD/Satışlar</th>
<td>5,1</td>
</tr>
<tr>
<th>Yabancı Oranı (%)</th>
<td>2,43</td>
</tr>
<tr>
<th>Ort Hacim (mn$) 3A/12A</th>
<td>1,3 / 1,6</td>
</tr>
My code is:
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
b=soup_bank.find_all("div", {"class": "table vertical"})
for i in b:
children = i.findChildren("td" , recursive=True)
for child in children:
l=[]
l_text = child.text
l.append(l_text)
print(l)
When i run this code it gives me a list with 1 index.
['Afyon Çimento ']
['11.04.1990']
['Çimento üretip satmak ve ana faaliyet konusu ile ilgili her türlü yan sanayi kuruluşlarına iştirak etmek.']
['(0216)5547000']
['(0216)6511415']
['Kısıklı Cad. Sarkusyan-Ak İş Merkezi S Blok kat:2 34662 Altunizade - Üsküdar / İstanbul']
['A/D']
['19,7']
['1,9']
['5,1']
['2,43']
['1,3 / 1,6']
['407,0 mnTL']
['395,0 mnTL']
['-']
How can I get only PD/DD value. I am expecting something like:
PD/DD : 1,9
My preference:
With bs4 4.7.1 you can use :contains to target the th by its text value then take the adjacent sibling td.
import requests
from bs4 import BeautifulSoup
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
print(soup_bank.select_one('th:contains("PD/DD") + td').text)
You could also use :nth-of-type for positional matching (3rd row 1st column):
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td:nth-of-type(1)').text
As we are using select_one, which returns first match, we can shorten to:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td').text
If id static
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) td').text
You already know the PD/DD but that could be gained by:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) th').text
If those ids remain static for at least a while then
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) th').text
Is it possible to extract content which comes after text Final Text: (not a tag) using Beautifulsoup.
i.e. expecting only
<td>0 / 22 FAIL</td></tr><tr>
Problem here is many tags doesn't have class or id etc. If i exctract only <td>, i will get all which is not required.
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td></tr><tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td></tr></table>
You can find the <strong>Final Text:</strong> tag using find('strong', text='Final Text:'). Then, you can use the find_next() method to get the next <td> tag.
html = '''
<table>
<tr>
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td>
</tr>
<tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td>
</tr>
</table>
'''
soup = BeautifulSoup(html, 'lxml')
txt = soup.find('strong', text='Final Text:').find_next('td')
print(txt)
Output:
<td>0 / 22 FAIL</td>
Yes, its possible, consider this HTML
<table>
<tr>
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td>
</tr>
<tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td>
</tr>
</table>
This xpath will work
//*[contains(text(),'Final Text')]/parent::td/parent::tr/following-sibling::tr
Find tag containing text Final Text, get its parent td, then get its parent tr then get its following sibling tr
If the content which you're trying to get always comes after the first index of the <td></td> tag. Why not get the second index of the list of elements?
soup = BeautifulSoup(html)
td_list = soup.find('td')
td_list[1] # This would be the FAIL element
I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))
First of all I am new to python and Stack Overflow so please be kind.
This is the source code of the html page I want to extract data from.
Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168
The table is at the bottom of the page
<html>
table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>
</td>
<td>Guldhedens IK - IF Warta</td>
<td>Guldheden Södra 1 Konstgräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>
</td>
<td>Romelanda UF - IK Virgo</td>
<td>Romevi 1 Gräs </td>
</tr>
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>
</td>
<td>Kode IF - IK Kongahälla</td>
<td>Kode IP 1 Gräs </td>
</tr>
<tr class="clTrEven">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>
</td>
<td>Floda BoIF - Partille IF FK </td>
<td>Flodala IP 1 </td>
</tr>
</tbody>
</table>
</html>
I need to extract the time: 19:30 and the team name: Guldhedens IK - IF Warta meaning the first and the second table cell(not the third) from the first table row and 13:00/Romelanda UF - IK Virgo from the second table row etc.. from all the table rows there is.
As you can see every table row has a date right before the time so here comes the tricky part. I only want to get the time and the team names as mentioned above from those table rows where the date is equal to the date I run this code.
The only thing I managed to do so far is not much, I can only get the time and the team name using this code:
import lxml.html
html = lxml.html.parse("http://gbgfotboll.se/information/?scr=table&ftid=51168")
test=html.xpath("//*[#id='content-primary']/table[3]/tbody/tr[1]/td[1]/span/span//text()")
print test
which gives me the result ['2014-09-26', ' 19:30'] after this I'm lost on how to iterate through different table rows wanting the specific table cells where the date matches the date I run the code.
I hope you can answer as much as you can.
If I understood you, try something like this:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
print html.xpath(xpath1)[1], html.xpath(xpath2)[0]
I know this is fragile and there are better solutions, but it works. ;)
Edit:
Better way with using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr') #change this to [0] to parse first table
for i in t:
try:
print i.find('span').get_text()[-5:], i.find('a').get_text()
except AttributeError:
pass
Edit2:
page not responding, but that should work:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr')
time = ""
for i in t:
try:
dateTime = i.find('span').get_text()
teamName = i.find('a').get_text()
if time == dateTime[:-5]:
print dateTime[-5,], teamName
else:
print dateTime, teamName
time = dateTime[:-5]
except AttributeError:
pass
lxml:
import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
dateTemp = ""
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span// text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == dateTemp:
print time, teamName
else:
print date, time, teamName
So thanks to #CodeNinja help i just tweaked it a little bit to get exactly what i wanted.
I imported time to get the date of the time i run the code. Anyways here is the code for what i wanted. Thank you for the help!!
import lxml.html
import time
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
currentDate = (time.strftime("%Y-%m-%d"))
for i in range(12):
xpath1 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
xpath2 = ".//*[#id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
time = html.xpath(xpath1)[1]
date = html.xpath(xpath1)[0]
teamName = html.xpath(xpath2)[0]
if date == currentDate:
print time, teamName
So here is the FINAL version of how to do it the correct way. This will parse through all the table rows it has without using "range" in the for loop. I got this answer from my other post here: Iterate through all the rows in a table using python lxml xpath
import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")
html = lxml.html.parse(url)
for row in rows_xpath(html):
time = time_xpath(row)[0].strip()
team = team_xpath(row)[0]
print time, team